For many, social media has become the preferred outlet to profess opinions and express personal endorsements for candidates. Twitter, being public and widely used, offers a potentially powerful way to gauge the political pulse of the electorate. Now that the dust has settled after ‘Super Tuesday’, let’s take a look at the primaries so far.
We collected political tweets by pattern matching for keywords related to each candidate from a 10% sample of Twitter’s streaming API1. Below, we show counts of the daily frequencies between January 1 2015 and February 25th 2016, along with the average sentiment of each day’s political tweets (calculated using our LabMT happiness database). Peaks in the frequency during 2015 correspond to televised political debates. The average happiness plot requires some investigation and we will focus on the sentiment surrounding the main candidates.
Let’s start with the frequency and happiness time series for the two remaining Democratic candidates, Hilary Clinton and Bernie Sanders. For each candidate, mentions include tweets from both supporters and opponents, as well as tweets that have been tagged with multiple candidates. In the happiness time series, we see tweets related to Sanders are on average slightly more positive than Clinton (5.85 vs 5.70). Of the frontrunners in both parties, tweets tagged with Sanders’ keywords have the highest computed average word happiness. Later in this post, we will investigate the specific words that are causing these differences.
For the GOP related tweet counts over the same time frame, Trump is the clear leader in terms of mentions on Twitter. The other candidates are barely visible on the same set of axes. The average computed happiness of tweets surrounding each GOP candidate are comparable. Trump has a slight lead in happiness in comparison to Cruz and Rubio, and a commanding lead over Carson. His computed word happiness is slightly more positive than Clinton (5.71 vs 5.70), but still less positive than Sanders (5.79 vs 5.85).
Knowing the relative happiness values is just a start: We have to look at which words are driving the scores for each candidate. For the word clouds below, words are colored by sentiment (lighter teal for happier, darker purple for sadder) and sized by their weighted tf-idf scores, a combination of raw frequency and the relative ‘surprise’ factor across all tweets.2 (Clicking on an image will show a higher quality version.3 )
Words in Sanders’ cloud are similar in size, suggesting there is a wide range of conversation regarding the Senator’s campaign; his tweets reflect a discussion of a host of issues. The positive words in Sanders’ cloud include ‘endorsement’, ‘truth’, ‘support’, and ‘winning’. Negative words are political in nature and may be referencing his position on fighting ‘greed’ on Wall Street as well as ‘problems’ and ‘battles’ average people are facing financially, a central focus of his campaign.
Clinton has a strong mix of positive and negative words in her cloud. Heavily weighted positive words include ‘experience’, ‘talented’, ‘woman’, and ‘world’. Negative words include ‘criminal’, ‘investigation’ and ‘liar’, probably referring to the email server scandal.
The GOP word clouds reflect similar themes. Most of Trump’s highest weighted words may be from supporters describing his ‘movement’, and current ‘winnings’ in many of the Republican primaries. Negative words, likely authored by citizens opposing him, include ‘assault’, ‘attack’, ‘racist’, ‘liar’, and ‘dangerous’. Senator Cruz has a similar word cloud, with positives describing his supporters’ political ideals, and negatives involving political jargon as well as ‘lying’. Rubio and Carson appear to have the most negative word clouds.
We next investigate these average happiness score differences quantitatively using word shift graphs. Word shift graphs display the most prominent words swaying the average happiness of a comparison corpus relative to a reference corpus. For these graphs, the reference corpus consists of all the political tweets we collected, not including the candidate for comparison. These word shifts were created with tweets from January–February 25, 2016.
(NOTE: Scroll graphs for more words, and click bars at the top to see only one word type.)
The first word shift compares Sanders’ tweets to the rest of the politically collected tweets. Tweets tagged with a Sanders keyword are more positive (5.86 vs 5.75) than both the reference distribution and every other candidate. This shift is due to increased (↑) mentions of the positive words (+) ‘free’, ‘win’, ‘health’, ‘we’, ‘young’, ‘democratic’, ‘college’, etc. as well as less mentions (↓) of the negative words (-) ‘liar’ ,’lying’, ‘hate’, ‘ lies’, ‘no’, ‘loser’, ‘bad’, etc. The highest negative contributors are mentions of ‘arrested’, ‘arrest’, and ‘protest’, which are likely referencing the Senator being arrested for protesting segregation during the Civil rights movement.
Clinton’s sentiment is close to that of reference distribution (5.76 vs 5.77). Negative words refer to the email investigation and include ‘jail’, ‘criminal’, ‘prison’, ‘scandal’, ‘indicted’, etc. Of note, ‘bill’ is a negatively coded word (interpreted to mean paying bills), however in this context is evidently a reference to Bill Clinton (pattern matches are case-insensitive). Positive contributions reflect increased mentions of the words ‘she’, ‘women’,’ thanks’, ‘health’, and less mentions of the negative words ‘hate’, ‘sad’ ,’loser’, ‘fraud’, ‘racist’.
For the GOP candidates (below), Trump has the highest computed happiness level (5.79). Positive contributions include more ‘great’, ‘love’, ‘america’, ‘better’, ‘loves’, which clearly connects to his slogan ‘Make America Great Again’. More interestingly, his negative contributions include much more ‘hate’, ‘racist’, ‘died’, ‘loser’ ,’sad’, ‘ban’, and forms of profanity, suggesting that tweets mentioning Trump reflect the sentiments of opponents as well. Two notes: We didn’t specifically analyze tweets authored by Trump or his campaign, and Lexicon Valley devoted a whole podcast to the surprising evolution of the word ‘sad’, inspired by Trump’s usage.
The final three republican candidates have very negative word shifts which would see speak to the issues they have focused on in the debates and their campaigns as a whole.
We re-emphasize that this preliminary study did not differentiate between tweets that were supportive or critical of each of the candidates. To do so, we are building a binary classifier trained on politically relevant sentiments to classify individual tweets as “for” or “against” each candidate.
Social media is empowering the general populace with the ability to amplify and organize their voices and political opinions, which may significantly shape the outcome of the election. From what we’ve seen so far, if positivity were the only predictor (and of course it’s not), we would expect a Sanders versus Trump Presidential Election.
1. Note that all frequencies reported above reflect a subsample of the 10% Gardenhose API, itself a random subsample of all public tweets. As a result, total mentions of a candidate ought to be roughly a factor of 10 larger than the numbers reported here.
2. One way to visualize the different types of emotionally charged words unique to each candidate is to use term frequency inverse document frequency (tf-idf) statistics. Using this measure, we combine the set of tweets mentioning each candidate into a single “document” and then weight words by their frequency of appearance and relative uniqueness across all documents. The method allows us to filter out words common to all candidates, and focus on the words most distinguishing of each.
3. The word clouds were created by modifying Amueller’s open source python word cloud library.