Sentiment Analysis using Python: Trump and Obama

Using Python to analyse Twitter data

Author: Yusra Farooqui    

23/05/2018

Overview

This project is means to give an understanding on how sentiment analysis can be run on tweets with Python. Sentiment analysis or opinion analysis is; “contextual mining of text which identifies and extracts subjective information in source material and helping a business to understand the social sentiment of their brand, product or service while monitoring online conversations.” I will be using the data generated by the two presidents of the United States, Donald Trump and Barack Obama by treating them as self-brands. The project will also cover data preparation and cleaning techniques for posterior analysis. 

The syntax is written in Python 3.5.2 using Spyder IDE. A basic understanding of Python programming language is required to reuse and understand the script. For the most part the script relies only on standard scientific python tools (numpy/matplotlib/pandas/re). You will be additionally requiring the following Python libraries:

Before beginning, you will be required to install tweepy and textblob via pip.  I have consciously made this project comprehensive to also serve as a guide on the sheer capacity of python as a programming language. The project will cover the following topics:

All graphs and results created in this project can be found in an excel dashboard here. The data used for analysis in this project will also be found in the same excel file. The full syntax for the project can be found on my GitHub repository

Note: The scripts in this article will primarily focus on Donald's tweet data as the exact same methodology & syntax with only minor changes in variable names is used for Obama's tweets. This is not true for results and findings in the article. A copy of the script for Obama's tweets analysis can be found on my GitHub.

Accessing Twitter data with Python

Creating Twitter App

This is one of the easiest steps of the entire process. It requires you to register an app at twitter apps, log in with your twitter account and get the following credentials;

Keep your credentials safe and out of reach. They are used as authentication of your app on third party software. 

Accessing Twitter Data

In the following script copy your credentials in the space provided between inverted commas. The script does a bunch of fancy stuff at the back which you don’t really need to worry about. We have created an object named extract which will be our source of retrieving data from twitter.

import tweepy

consumer_key = 'your consumer key'

consumer_secret = 'your consumer secret'

access_token = 'your access token'

access_token_secret = 'your access secret'

 

def  twitter_Access():

     auth = OAuthHandler(consumer_key, consumer_secret)

     auth.set_access_token(access_token, access_token_secret)

     api = tweepy.API(auth, wait_on_rate_limit=True)

     return api

extract = twitter_Access()

Retrieving tweets from other user timelines

In my article Twitter Basics, I have explored different methods of retrieving tweets in Python. Here we will be focusing on the tweets retrieved from different user profiles.  Use the following code to retrieve Donald Trump’s tweets:

Dtweets = extract.user_timeline(screen_name="realDonaldTrump", count=200, tweet_mode='extended')

for tweet in Dtweets[:5]:

    print(tweet.full_text)

    print()

You can subsequently add more parameters to this method:

The code does 3 fundamental things:

Output:

...Follow the money! The spy was there early in the campaign and yet never reported Collusion with Russia, because there was no Collusion. He was only there to spy for political reasons and to help Crooked Hillary win - just like they did to Bernie Sanders, who got duped!

If the person placed very early into my campaign wasn’t a SPY put there by the previous Administration for political purposes, how come such a seemingly massive amount of money was paid for services rendered - many times higher than normal...

For the first time since Roe v. Wade, America has a Pro-Life President, a Pro-Life Vice President, a Pro-Life House of Representatives and 25 Pro-Life Republican State Capitals! https://t.co/EfF54tmetT

It was my honor to welcome @NASCAR Cup Series Champion @MartinTruex_Jr and his team to the @WhiteHouse yesterday! https://t.co/5cr2oybrnQ

Today, it was my great honor to welcome President Moon Jae-in of the Republic of Korea to the @WhiteHouse!🇺🇸🇰🇷 https://t.co/yvOxNiA1DM

Similarly, you can retrieve Obamas latest 200 tweets with the following code:

Otweets = extract.user_timeline(screen_name="BarackObama", count=200, tweet_mode='extended')

for tweet in Otweets[:5]:

    print(tweet.full_text)

    print()

Building dataframes with Pandas

When we retrieve tweets from Twitter we collect a lot of data associated with that tweet. This gives more room for posterior analysis. Run the following command to explore the data are we collecting:

print(dir(Dtweets[0]))

Output:

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_api', '_json', 'author', 'contributors', 'coordinates', 'created_at', 'destroy', 'display_text_range', 'entities', 'favorite', 'favorite_count', 'favorited', 'full_text', 'geo', 'id', 'id_str', 'in_reply_to_screen_name', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'is_quote_status', 'lang', 'parse', 'parse_list', 'place', 'retweet', 'retweet_count', 'retweeted', 'retweets', 'source', 'source_url', 'truncated', 'user']

The code allows us to see the information associated with the first tweet, indicated by the 0. We do not require most of this information for our analysis. However, we could stipulate a lot of interesting analysis with all this data available. We need to create a pandas dataframe to store and view this data methodologically. In simple words we want to store this data in a table. We initially construct our dataframe with the tweet column using the following syntax:

Import pandas as pd

Import numpy as np

Donald = pd.DataFrame(data=[tweet.full_text for tweet in Dtweets], columns=[‘Tweets’]) 

Subsequently, we can add the following columns to the dataframe:

Donald['Tweet Length']  = np.array([len(tweet.full_text) for tweet in Dtweets])

Donald['ID']   = np.array([tweet.id for tweet in Dtweets])

Donald['Date Posted'] = np.array([tweet.created_at for tweet in Dtweets])

Donald['OS'] = np.array([tweet.source for tweet in Dtweets])

Donald['Likes']  = np.array([tweet.favorite_count for tweet in Dtweets])

Donald['Retweets']   = np.array([tweet.retweet_count for tweet in Dtweets])

Adding datetime arrays to the dataframe

Run the following syntax to add columns which shows the month and weekday name pertaining to the tweet posted date. I have utilised Series.dt which can be used to access the values of the series as datetime objects and return several properties. Series is basically just the singular column/array in a dataframe. This will be used for an aggregated analysis on tweet frequencies, by weekday and month. The object wt is created for sorting purposes in graphs. wt contains weekday in integer form, from 0-6, with 0 being Monday.

Donald['Weekday']  = Donald['Date Posted'].dt.weekday_name

Donald['wt'] =Donald['Date Posted'].dt.weekday

Donald['Month'] =Donald['Date Posted'].dt.month

Lastly, we will create an index for the dataframe, although usually present this part is recommended to work with loc/iloc module for pandas dataframe with idx parameters and will save you the agony of coding errors:

Donald = Donald.reset_index()

Following the same steps, I have created another dataframe named Obama containing data pertaining to Obama's tweets. 

Data types, Nulls and Descriptive analysis

My methodology when pre-processing the data is to allow the data guide me through the process, therefore, I try not to make assumptions or force conclusions on a data-set without any hard data results. There might be a plethora of information on how twitter data appears in third party consoles, however, here we will rather make informed decisions solely based on what the data is telling us.

Data Types

One crucial piece of information is knowing the type of data we are working with. This might not seem like a big concern in the general sense, but it immensely affects our syntax when working with different data types. It is also an important aspect to consider when applying comparative and advanced analytics concepts. In Python we have a simple piece of code to get this information:

Donald.dtypes

This will allow us to see the data type such as float, integer, object (string) etc. pertaining to each column in our dataframe.

NULLS

The next step is to identify nulls in the data-set and treat them early on. In python we have a relatively straightforward method to achieve this:

Donald.isnull()

This will give us a multi-dimensional array of records pertaining to our dataframe columns, containing only the Boolean values of either True or False. With True corresponding to null values. The first step is to find out whether there are any null values in our entire dataset: 

Donald.isnull().values.any()

For the above query we will get a Boolean value of either True (has nulls) or False (no nulls). In this case we get a False, stating that there are no nulls in our dataframe. Therefore, we do not need any treatment of data pertaining to null values. However, often that will not be the case. In Python you can also drill down to specific columns containing nulls, even though redundant in this case:

Donald.isnull().any()

And the specific count of nulls as integers in each column:

Donald.isnull().sum()


Descriptive Analysis

In order to prepare the data, it is vital to explore the basic features of the data in the study. In statistical terms to learn these features we run a descriptive analysis. Python pandas has a describe method to get this information. For pandas dataframe describe generates descriptive statistics that summarise the central tendency, dispersion and shape of a data-set's distribution, excluding NaN values. 

The default condition is to return only an analysis of numeric columns. If the dataframe consists only of object and categorical data without any numeric columns, the default is to return an analysis of both the object and categorical columns. If include='all' is provided as an option, the result will include a union of attributes of each type. The ‘include’ and ‘exclude’ parameters can be used to limit which columns in a DataFrame are analysed for the output. 

This tiny little piece of code is particularly very useful when working with a large data-set. Note: When working with vast data, I particularly prioritise running the describe module at quite a few stages of the project to constantly keep a check on the attributes of the data during and after data preparation. 

Use the following syntax on Donald tweets’ dataframe:

Donald.describe()

There are 3 main findings from the descriptive analysis:

1, We only get result from the numeric columns as we did not use include=’all’.

2, There is a clear error in the Tweet length column. The max length is 304, longer than the character limit allowed by Twitter.

3, The minimum value for Likes is 0, which seems strange. This shows that there is a tweet/tweets with no likes.

Output for describe method for Donald's dataframe:

Data Cleaning and Preparation 

Through descriptive analysis we identified that the tweet length has been incorrectly calculated, showing a tweet with a length of 304 characters. The character limit for a tweet is restricted to 280 characters by Twitter. We will begin by retrieving the tweets with more than 280 characters to dig in the root cause for this scenario. Additionally, I will add the set_option to make sure pandas reveals the entire tweet in the console. 

pd.set_option('display.max_colwidth', -1)

Donald.loc[(Donald.Tweet_Length > 280), 'Tweets'].count()

Donald.loc[(Donald.Tweet_Length > 280), 'Tweets'].sample(n=4)

The following tweets are returned:

14    I ask Senator Chuck Schumer, why didn’t President Obama & the Democrats do something about Trade with China, including Theft of Intellectual Property etc.? They did NOTHING! With that being said, Chuck & I have long agreed on this issue! Fair Trade, plus, with China will happen!

19    What ever happened to the Server, at the center of so much Corruption, that the Democratic National Committee REFUSED to hand over to the hard charging (except in the case of Democrats) FBI? They broke into homes & offices early in the morning, but were afraid to take the Server?   

23    Things are really getting ridiculous. The Failing and Crooked (but not as Crooked as Hillary Clinton) @nytimes has done a long & boring story indicating that the World’s most expensive Witch Hunt has found nothing on Russia & me so now they are looking at the rest of the World! 

30    Just met with UN Secretary-General António Guterres who is working hard to “Make the United Nations Great Again.” When the UN does more to solve conflicts around the world, it means the U.S. has less to do and we save money. @NikkiHaley is doing a fantastic job! https://t.co/pqUv6cyH2z

Name: Tweets, dtype: object

We have 17 tweets that go beyond the tweet character limit. On further inspection it is evidently clear that the URLs, HTML encoding (&amp) and the @user mentions are affecting the character count. This data is a nuisance when conducting sentiment analysis. We will do a text clean up by using string methods in pandas that support regular expressions.

HTML, URLs and @mentions clean-up with Regex

Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the re module. The great thing about pandas package is that it supports string methods. For example with string methods I can find out the times Donald have tweeted about Obama:

import re

Donald.Tweets[Donald.Tweets.str.contains('Obama', ‘Barack’, flags= re.IGNORECASE)].count()

Basically, I am searching for either Barack or Obama in tweets from Trump. It shows that Donald has tweeted about Obama 7 times. If you want to know the specific tweets with mentions of Obama remove the count() method from the syntax. We can similarly run the same process for Obama:

Obama.Tweets[Obama.Tweets.str.contains(‘Donald’, ‘Trump’, flag=re.IGNORECASE)].count()

Not surprisingly, Trump appeared 0 times in the 200 latest Obama’s tweets. 

I wrote a very simplified code to replace all this noise from the data. The first two replace methods, search for text beginning with http or @ and then deletes the proceeding text until it finds whitespace. The next method replaces all &amp with empty space and lastly, we use rstrip to remove any white characters trailing the last character of the tweet text. The code also stores these changes in our original Tweets column:

Donald['Tweets'] = Donald.Tweets.str.replace(r'http\S+', '').str.replace(r'@\S+', '').str.replace('&amp','').str.rstrip()

We can check our results for discrepancies. Create the object Tweet_Length from the object Tweets in our pandaframe. We then check the criteria of the twitter length:

Donald['Tweet_Length'] = Donald['Tweets'].apply(len)

Donald.loc[(Donald.Tweet_length > 280), ‘Tweets’].count().any()

The Boolean result from the last syntax is False, therefore, we no longer have tweets exceeding the actual twitter limit.

Zero Likes 

Let’s address the elephant in the room. It is hard to fathom that Trump received 0 likes on any of his tweets. I will begin by drilling down the tweets pertaining to 0 likes and check for the date and the tweet text:

Donald.loc[(Donald.Likes == 0), ['Date_Posted', 'Tweets']].sample(n=4)

For some of these tweets the date posted goes back to 15th of May, therefore, we will ignore the freshness of the tweet to the lack of likes. Apart from the original tweets we also get a set of retweeted tweets (tweets beginning from RT) from the user. These retweets from the user show zero likes as the content is not of the user himself/herself. We know that a sentiment analysis on retweeted tweets will create ambiguity and therefore we do not want to include this in our analysis. Unfortunately, we cannot filter for this when extracting the tweets. Filtering for different dimensions while extracting is a possibility with premium Twitter API. However, we can remove these retweets after collection from our analysis:

Donald = Donald[Donald.Likes != 0]

The last step is to make sure you save your data. Every time you run the entire syntax the tweets will change in accordance to the user activity:

Donald.to_csv(‘Donald.csv’)

Donald.describe()

Visualisations: Weekly Analysis, Time Series & Exploratory Findings

I would like to first get some numbers and figures from the data. We will begin by finding, the tweet with the most likes and the time frame of the tweets from latest to oldest:

#to get oldest tweet data

Donald.loc[Donald['Date_Posted'].idxmin()]

#To get latest tweet date

Donald.loc[Donald['Date_Posted'].idxmax()]

#the most liked tweet

Donald.loc[Donald['Likes'].idxmax()]

You will get the values of all columns corresponding to the conditions in the syntax. Blow I have only displayed the relevant result for Donald and Obama:

The first thing we notice is that Trump is more active on Twitter as his most recent 200 tweets only go back to April 2018. However, Obama is more conservative when tweeting, the oldest tweet among his most recent 200 tweets dates back to 2016.

Additionally, it was interesting for me to see whether either of the politicians mentioned other politicians in their tweets:

Donald.Tweets[Donald.Tweets.str.contains('Obama','Barack', flags= re.IGNORECASE)].count()

Donald.Tweets[Donald.Tweets.str.contains('Hillary','Clinton', flags= re.IGNORECASE)].count()

Donald.Tweets[Donald.Tweets.str.contains('Bernie','Sandars', flags= re.IGNORECASE)].count()

Obama had no mentions of any of the following three politicians; Bernie, Hillary and Trump. I added this data in a pie chart. Use the information retrieved from the above syntax and add it to the sizes object defined below:

#creating pie chart for politicians mentioned

labels = 'Obama', 'Hillary', 'Bernie'

sizes = [7, 5, 1]

explode = (0.1, 0, 0)  # explode 1st slice

colors = ['gold', 'lightcoral', 'lightskyblue']

# Plot

plt.pie(sizes, explode=explode, labels=labels, colors=colors,

        autopct='%1.1f%%', shadow=True, startangle=140)

plt.axis('equal')

plt.show()

Tweets with mentions of other politicians by Donald

I am also interested to see on which day of the week do Obama and Trump tweet the most:

import matplotlib.pyplot as plt

#when using notebook make sure to add

%matplotlib inline  

Weekday_Tweets = Donald.groupby(['wt', 'Weekday']).agg({'Weekday': np.count_nonzero}).plot(colormap='seismic', fontsize=12, kind = 'bar')

Weekday_Tweets.set_ylabel('Count of tweets', fontsize=14)

Weekday_Tweets.set_xlabel('', fontsize=14)

plt.xticks(np.arange(7), ("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))

plt.title('Trump')

plt.legend(['Count of Tweets']);

Moreover, I will also check which day of the week are Obama's and Trump's tweet on average liked the most:

AvgLikes = Donald.groupby(['wt']).agg({'Likes': np.mean}).plot(colormap='PiYG', fontsize=12, kind = 'bar')

AvgLikes .set_ylabel('Avearge Likes on Tweets', fontsize=14)

AvgLikes.set_Title(‘Trump’)

AvgLikes .set_xlabel('')

plt.xticks(np.arange(7), ("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))

Left: Since April 2018 Trump posted the most tweets on Monday

Right: Since September 2016 Obama posted the most tweets on Monday

Left: Average likes on tweets for Donald, for a given weekday

Right: Average likes on tweets for Obama, for a given weekday

The graphs show that even though both presidents tweet the most on the working days, the tweets that get liked the most are generally posted on the weekends. Several factors can attribute to this, given the subjectivity of the tweets and twitter users' activity. 

To see the pattern of the likes and count of retweets without aggregation we will construct a time series graph. Pandas has its own object for time series. Initially we will create a time series objects and then plot the figure together. Make sure to run the syntax in collection rather than individually:

#Time Series, creating time series objects

TTfavD = pd.Series(data=Donald['Likes'].values, index=Donald['Date_Posted'])

TTretD = pd.Series(data=Donald['Retweets'].values, index=Donald['Date_Posted'])

#plotting time seres

TTfavD.plot(colormap="coolwarm_r", figsize=(16,4), label="Favourites", legend=True)

TTretD.plot(colormap="copper_r", figsize=(16,4), label="Retweets", legend=True)

plt.xlabel('')

plt.title('Donald')

The results are displayed below:

The figures demonstrate the retweet count and likes has the same observable pattern with time. Therefore, a tweet with more likes is more likely to be retweeted or vice versa.

Sentiment analysis

To run sentiment analysis, we need to first install textblob via pip. TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more. We will be focusing more on getting the polarity of the tweets from textblob to run sentiment analysis. The objective is to create a simplistic frame for polarity and get an overview of the data. We will run the following syntax to obtain the polarity of the tweets:

from textblob import TextBlob

Donald[['polarity', 'subjectivity']] = Donald['Tweets'].apply(lambda Text: pd.Series(TextBlob(Text).sentiment))

We will get polarity and subjectivity of the tweet text in our dataframe. This ranges from -1 to 1, with -1 being negative, 0 being neutral and 1 being positive. I will aggregate this data in 3 different categories:

Donald.loc[(Donald.polarity > 0), 'Tweets'].count()

Donald.loc[(Donald.polarity == 0), 'Tweets'].count()

Donald.loc[(Donald.polarity < 0), 'Tweets'].count()

After getting an aggregated count on the text polarity, I have plot it in a pie chart:

#passing polarity to pie chart

labels = ['Positive', 'Negative', 'Neutral']

sizes = [122, 45, 22]

explode = (0, 0, 0.1)  # explode 1st slice

colors = ['cyan', 'r', 'orchid']

#pie chart

plt.pie(sizes, explode=explode, labels=labels, colors=colors,

        autopct='%1.1f%%', shadow=True, startangle=140)

plt.axis('equal')

plt.title(‘Donald’)

We see interesting results. Obama’s tweets had more negative connotations as compared to Donald’s tweets. This has a lot to do with how adjectives are used in the tweets and the subjectivity of the tweets. For example, stating, ‘I am sad’, will be accounted with negative polarity in textblob. This still gives a very good general analysis on how textblob can be used to give an overview of sentiments in textual data. I drilled down on some of the negative tweets of Obama and Donald:

Donald.loc[(Donald.polarity < 0), 'Tweets'].sample(n=4)

From the sample negative tweets from Obama and Donald, I have selected two tweets at random and have displayed them below:

‘’Secret Service has just informed me that Senator Jon Tester’s statements on Admiral Jackson are not true. There were no such findings. A horrible thing that we in D.C. must live with, just like phony Russian Collusion. Tester should lose race in Montana. Very dishonest and sick!’’

‘’Wildfires in the next few decades could be ""unrecognizable"" to previous generations—because of climate change:’’

I leave it up to the readers to decide on which tweet pertains to which politician. You may also check for the tweets by analysing the raw data samples here. Further analysis can be done on negative tweets to account for subjectivity and perspective. 

The full syntax is available at my GitHub repository. If there are any questions regarding the topic please feel free to reach out to me via the contact form.