State of the Union Addresses Analysis

This project analyzes the State of the Union Addresses by all of the US presidents. For the code of this project, please visit the State of Union Addresses Analysis Code under Codes Samples Section.


This projects takes all the State of the Union Addresses of all the presidents of United States, and look at the pattern of sentences and words throughout all addresses. By using R program, we will see what the number of sentences per address, average number of words per address, total number of words per address, and the word stem that are most frequently used in all addresses. Then by plotting those variables in graphs, we’ll be able to visualize how addresses’ lengths vary throughout presidencies. Then by looking at the data, we can speculate what are some contributing factors that can affect addresses to create such trend.


The State of Union Addresses are taking from website The American Presidency Project ( The analysis takes a look at 223 addresses made by 41 (out of 44) Presidents. The analysis breaks each address into sentences and words. By having such variables, we’re able to look at the number of sentences per speech and the mean number of words per sentence. We’ll then look at all the speeches and see what the most frequently appeared words in all speeches are.


By breaking each speech using periods, I was able to find the number of sentences per speech (with maximum of 1745 and minimum of 25 sentences). By finding the total number of words and total number of sentences, I was also able to compute the average number of words per sentence (with maximum of 49.5 and minimum of 15.3 words per sentence). In the overall trend (shown above), it seems like while the number of sentences is becoming less, the mean of words is becoming greater.

Another results that is found is that while speeches before 1880’s have remained relatively shorter (around 2000 words), speech’s total length has gradually grew longer (with Obama’s speech more than 6000 words). This can be expected since at the turn of the 20th century, world events such as WWI and Great Depression occurred. Also, during that period, with advanced industrial sector, the nation went through great deal of social and economical reforms. Therefore, Presidents have more to address during their speeches; therefore, increase the length of speeches.

For the second part of the project, we look analyze the pattern of words in all the speeches. We first convert all the words in each speech to its word stem. Then we run through all the speeches and pick out the ones that appeared once or more (storing in all.stems). By measuring its length, I discovered that there are 31975 unique words in all the speeches. After doing a count of the number of occurrence of each word, the top 25 most frequent words and their counts:

# Words Counts
1 the 151769
2 of 100598
3 to 61354
4 and 61314
5 in 38654
6 a 27773
7 that 21777
8 it 20660
9 be 20572
10 for 19407
11 is 17289
12 our 16690
13 by 15373
14 which 12927
15 have 12751
16 as 12259
17 this 11993
18 not 11975
19 are 10861
20 state 9570
21 I 9259
22 govern 9196
23 their 9082
24 from 8973
25 are 8955

After making the “Distance Between Presidentional Word Frequency Distribution” plot (below). It can be observed that most Presidential addresses vary at (0,0), with Zachary Taylor and Barack Obama being the only obvious outliers. This can be explained since Zachary Taylor served very shortly (made only one union address), and Barack Obama just recently started his term (also with only one union address). Therefore, the pairwise distance information cannot be averaged out and computed, leaving them as outliers.

%d bloggers like this: