Author: Yusra Farooqui
23/05/2018
This project is means to give an understanding on how sentiment analysis can be run on tweets with Python. Sentiment analysis or opinion analysis is; “contextual mining of text which identifies and extracts subjective information in source material and helping a business to understand the social sentiment of their brand, product or service while monitoring online conversations.” I will be using the data generated by the two presidents of the United States, Donald Trump and Barack Obama by treating them as self-brands. The project will also cover data preparation and cleaning techniques for posterior analysis.
The syntax is written in Python 3.5.2 using Spyder IDE. A basic understanding of Python programming language is required to reuse and understand the script. For the most part the script relies only on standard scientific python tools (numpy/matplotlib/pandas/re). You will be additionally requiring the following Python libraries:
Tweepy: A well-known Python wrapper for Twitter API.
Textblob: A simplistic Python library for processing textual data.
Before beginning, you will be required to install tweepy and textblob via pip. I have consciously made this project comprehensive to also serve as a guide on the sheer capacity of python as a programming language. The project will cover the following topics:
Accessing Twitter Data with Python
Building dataframes with Pandas
Building datetime arrays (weekday and month)
Data types, Nulls and Descriptive analysis
Python Regex for Data Cleaning (HTML, URLs and @mentions)
Pandas Time Series and Weekly Analysis
Graphical Representations of findings
Sentiment analysis
All graphs and results created in this project can be found in an excel dashboard here. The data used for analysis in this project will also be found in the same excel file. The full syntax for the project can be found on my GitHub repository.
Note: The scripts in this article will primarily focus on Donald's tweet data as the exact same methodology & syntax with only minor changes in variable names is used for Obama's tweets. This is not true for results and findings in the article. A copy of the script for Obama's tweets analysis can be found on my GitHub.
This is one of the easiest steps of the entire process. It requires you to register an app at twitter apps, log in with your twitter account and get the following credentials;
Consumer Key
Consumer Secret
Access Token
Access Secret
Keep your credentials safe and out of reach. They are used as authentication of your app on third party software.
In the following script copy your credentials in the space provided between inverted commas. The script does a bunch of fancy stuff at the back which you don’t really need to worry about. We have created an object named extract which will be our source of retrieving data from twitter.
import tweepy
consumer_key = 'your consumer key'
consumer_secret = 'your consumer secret'
access_token = 'your access token'
access_token_secret = 'your access secret'
def twitter_Access():
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth, wait_on_rate_limit=True)
return api
extract = twitter_Access()
In my article Twitter Basics, I have explored different methods of retrieving tweets in Python. Here we will be focusing on the tweets retrieved from different user profiles. Use the following code to retrieve Donald Trump’s tweets:
Dtweets = extract.user_timeline(screen_name="realDonaldTrump", count=200, tweet_mode='extended')
for tweet in Dtweets[:5]:
print(tweet.full_text)
print()
You can subsequently add more parameters to this method:
id – Specifies the ID or screen name of the user.
user_id – Specifies the ID of the user. Helpful for disambiguating when a valid user ID is also a valid screen name.
screen_name – Specifies the screen name of the user. Helpful for disambiguating when a valid screen name is also a user ID.
since_id – Returns only statuses with an ID greater than (that is, more recent than) the specified ID.
max_id – Returns only statuses with an ID less than (that is, older than) or equal to the specified ID.
count – Specifies the number of statuses to retrieve.
page – Specifies the page of results to retrieve. Note: there are pagination limits
Gets the full length of tweets with tweet_mode=extended, however, still truncates retweets.
Returns/prints 5 tweets from the latest 200 tweets published by the user from the time of execution.
An object named Dtweets is defined for tweets from Donald Trump on twitter.
Output:
...Follow the money! The spy was there early in the campaign and yet never reported Collusion with Russia, because there was no Collusion. He was only there to spy for political reasons and to help Crooked Hillary win - just like they did to Bernie Sanders, who got duped!
If the person placed very early into my campaign wasn’t a SPY put there by the previous Administration for political purposes, how come such a seemingly massive amount of money was paid for services rendered - many times higher than normal...
For the first time since Roe v. Wade, America has a Pro-Life President, a Pro-Life Vice President, a Pro-Life House of Representatives and 25 Pro-Life Republican State Capitals! https://t.co/EfF54tmetT
It was my honor to welcome @NASCAR Cup Series Champion @MartinTruex_Jr and his team to the @WhiteHouse yesterday! https://t.co/5cr2oybrnQ
Today, it was my great honor to welcome President Moon Jae-in of the Republic of Korea to the @WhiteHouse!🇺🇸🇰🇷 https://t.co/yvOxNiA1DM
Similarly, you can retrieve Obamas latest 200 tweets with the following code:
Otweets = extract.user_timeline(screen_name="BarackObama", count=200, tweet_mode='extended')
for tweet in Otweets[:5]:
print(tweet.full_text)
print()
When we retrieve tweets from Twitter we collect a lot of data associated with that tweet. This gives more room for posterior analysis. Run the following command to explore the data are we collecting:
print(dir(Dtweets[0]))
Output:
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_api', '_json', 'author', 'contributors', 'coordinates', 'created_at', 'destroy', 'display_text_range', 'entities', 'favorite', 'favorite_count', 'favorited', 'full_text', 'geo', 'id', 'id_str', 'in_reply_to_screen_name', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'is_quote_status', 'lang', 'parse', 'parse_list', 'place', 'retweet', 'retweet_count', 'retweeted', 'retweets', 'source', 'source_url', 'truncated', 'user']
The code allows us to see the information associated with the first tweet, indicated by the 0. We do not require most of this information for our analysis. However, we could stipulate a lot of interesting analysis with all this data available. We need to create a pandas dataframe to store and view this data methodologically. In simple words we want to store this data in a table. We initially construct our dataframe with the tweet column using the following syntax:
Import pandas as pd
Import numpy as np
Donald = pd.DataFrame(data=[tweet.full_text for tweet in Dtweets], columns=[‘Tweets’])
Subsequently, we can add the following columns to the dataframe:
Donald['Tweet Length'] = np.array([len(tweet.full_text) for tweet in Dtweets])
Donald['ID'] = np.array([tweet.id for tweet in Dtweets])
Donald['Date Posted'] = np.array([tweet.created_at for tweet in Dtweets])
Donald['OS'] = np.array([tweet.source for tweet in Dtweets])
Donald['Likes'] = np.array([tweet.favorite_count for tweet in Dtweets])
Donald['Retweets'] = np.array([tweet.retweet_count for tweet in Dtweets])
Adding datetime arrays to the dataframe
Run the following syntax to add columns which shows the month and weekday name pertaining to the tweet posted date. I have utilised Series.dt which can be used to access the values of the series as datetime objects and return several properties. Series is basically just the singular column/array in a dataframe. This will be used for an aggregated analysis on tweet frequencies, by weekday and month. The object wt is created for sorting purposes in graphs. wt contains weekday in integer form, from 0-6, with 0 being Monday.
Donald['Weekday'] = Donald['Date Posted'].dt.weekday_name
Donald['wt'] =Donald['Date Posted'].dt.weekday
Donald['Month'] =Donald['Date Posted'].dt.month
Lastly, we will create an index for the dataframe, although usually present this part is recommended to work with loc/iloc module for pandas dataframe with idx parameters and will save you the agony of coding errors:
Donald = Donald.reset_index()
Following the same steps, I have created another dataframe named Obama containing data pertaining to Obama's tweets.
My methodology when pre-processing the data is to allow the data guide me through the process, therefore, I try not to make assumptions or force conclusions on a data-set without any hard data results. There might be a plethora of information on how twitter data appears in third party consoles, however, here we will rather make informed decisions solely based on what the data is telling us.
One crucial piece of information is knowing the type of data we are working with. This might not seem like a big concern in the general sense, but it immensely affects our syntax when working with different data types. It is also an important aspect to consider when applying comparative and advanced analytics concepts. In Python we have a simple piece of code to get this information:
Donald.dtypes
This will allow us to see the data type such as float, integer, object (string) etc. pertaining to each column in our dataframe.
The next step is to identify nulls in the data-set and treat them early on. In python we have a relatively straightforward method to achieve this:
Donald.isnull()
This will give us a multi-dimensional array of records pertaining to our dataframe columns, containing only the Boolean values of either True or False. With True corresponding to null values. The first step is to find out whether there are any null values in our entire dataset:
Donald.isnull().values.any()
For the above query we will get a Boolean value of either True (has nulls) or False (no nulls). In this case we get a False, stating that there are no nulls in our dataframe. Therefore, we do not need any treatment of data pertaining to null values. However, often that will not be the case. In Python you can also drill down to specific columns containing nulls, even though redundant in this case:
Donald.isnull().any()
And the specific count of nulls as integers in each column:
Donald.isnull().sum()
Figure Left: Output Boolean
Figure Right: Output Integer
In order to prepare the data, it is vital to explore the basic features of the data in the study. In statistical terms to learn these features we run a descriptive analysis. Python pandas has a describe method to get this information. For pandas dataframe describe generates descriptive statistics that summarise the central tendency, dispersion and shape of a data-set's distribution, excluding NaN values.
The default condition is to return only an analysis of numeric columns. If the dataframe consists only of object and categorical data without any numeric columns, the default is to return an analysis of both the object and categorical columns. If include='all' is provided as an option, the result will include a union of attributes of each type. The ‘include’ and ‘exclude’ parameters can be used to limit which columns in a DataFrame are analysed for the output.
This tiny little piece of code is particularly very useful when working with a large data-set. Note: When working with vast data, I particularly prioritise running the describe module at quite a few stages of the project to constantly keep a check on the attributes of the data during and after data preparation.
Use the following syntax on Donald tweets’ dataframe:
Donald.describe()
1, We only get result from the numeric columns as we did not use include=’all’.
2, There is a clear error in the Tweet length column. The max length is 304, longer than the character limit allowed by Twitter.
3, The minimum value for Likes is 0, which seems strange. This shows that there is a tweet/tweets with no likes.
Output for describe method for Donald's dataframe:
Through descriptive analysis we identified that the tweet length has been incorrectly calculated, showing a tweet with a length of 304 characters. The character limit for a tweet is restricted to 280 characters by Twitter. We will begin by retrieving the tweets with more than 280 characters to dig in the root cause for this scenario. Additionally, I will add the set_option to make sure pandas reveals the entire tweet in the console.
pd.set_option('display.max_colwidth', -1)
Donald.loc[(Donald.Tweet_Length > 280), 'Tweets'].count()
Donald.loc[(Donald.Tweet_Length > 280), 'Tweets'].sample(n=4)
The following tweets are returned:
14 I ask Senator Chuck Schumer, why didn’t President Obama & the Democrats do something about Trade with China, including Theft of Intellectual Property etc.? They did NOTHING! With that being said, Chuck & I have long agreed on this issue! Fair Trade, plus, with China will happen!
19 What ever happened to the Server, at the center of so much Corruption, that the Democratic National Committee REFUSED to hand over to the hard charging (except in the case of Democrats) FBI? They broke into homes & offices early in the morning, but were afraid to take the Server?
23 Things are really getting ridiculous. The Failing and Crooked (but not as Crooked as Hillary Clinton) @nytimes has done a long & boring story indicating that the World’s most expensive Witch Hunt has found nothing on Russia & me so now they are looking at the rest of the World!
30 Just met with UN Secretary-General António Guterres who is working hard to “Make the United Nations Great Again.” When the UN does more to solve conflicts around the world, it means the U.S. has less to do and we save money. @NikkiHaley is doing a fantastic job! https://t.co/pqUv6cyH2z
Name: Tweets, dtype: object
We have 17 tweets that go beyond the tweet character limit. On further inspection it is evidently clear that the URLs, HTML encoding (&) and the @user mentions are affecting the character count. This data is a nuisance when conducting sentiment analysis. We will do a text clean up by using string methods in pandas that support regular expressions.
Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the re module. The great thing about pandas package is that it supports string methods. For example with string methods I can find out the times Donald have tweeted about Obama:
import re
Donald.Tweets[Donald.Tweets.str.contains('Obama', ‘Barack’, flags= re.IGNORECASE)].count()
Basically, I am searching for either Barack or Obama in tweets from Trump. It shows that Donald has tweeted about Obama 7 times. If you want to know the specific tweets with mentions of Obama remove the count() method from the syntax. We can similarly run the same process for Obama:
Obama.Tweets[Obama.Tweets.str.contains(‘Donald’, ‘Trump’, flag=re.IGNORECASE)].count()
Not surprisingly, Trump appeared 0 times in the 200 latest Obama’s tweets.
I wrote a very simplified code to replace all this noise from the data. The first two replace methods, search for text beginning with http or @ and then deletes the proceeding text until it finds whitespace. The next method replaces all & with empty space and lastly, we use rstrip to remove any white characters trailing the last character of the tweet text. The code also stores these changes in our original Tweets column:
Donald['Tweets'] = Donald.Tweets.str.replace(r'http\S+', '').str.replace(r'@\S+', '').str.replace('&','').str.rstrip()
We can check our results for discrepancies. Create the object Tweet_Length from the object Tweets in our pandaframe. We then check the criteria of the twitter length:
Donald['Tweet_Length'] = Donald['Tweets'].apply(len)
Donald.loc[(Donald.Tweet_length > 280), ‘Tweets’].count().any()
The Boolean result from the last syntax is False, therefore, we no longer have tweets exceeding the actual twitter limit.
Zero Likes
Let’s address the elephant in the room. It is hard to fathom that Trump received 0 likes on any of his tweets. I will begin by drilling down the tweets pertaining to 0 likes and check for the date and the tweet text:
Donald.loc[(Donald.Likes == 0), ['Date_Posted', 'Tweets']].sample(n=4)
For some of these tweets the date posted goes back to 15th of May, therefore, we will ignore the freshness of the tweet to the lack of likes. Apart from the original tweets we also get a set of retweeted tweets (tweets beginning from RT) from the user. These retweets from the user show zero likes as the content is not of the user himself/herself. We know that a sentiment analysis on retweeted tweets will create ambiguity and therefore we do not want to include this in our analysis. Unfortunately, we cannot filter for this when extracting the tweets. Filtering for different dimensions while extracting is a possibility with premium Twitter API. However, we can remove these retweets after collection from our analysis:
Donald = Donald[Donald.Likes != 0]
The last step is to make sure you save your data. Every time you run the entire syntax the tweets will change in accordance to the user activity:
Donald.to_csv(‘Donald.csv’)
Donald.describe()
I would like to first get some numbers and figures from the data. We will begin by finding, the tweet with the most likes and the time frame of the tweets from latest to oldest:
#to get oldest tweet data
Donald.loc[Donald['Date_Posted'].idxmin()]
#To get latest tweet date
Donald.loc[Donald['Date_Posted'].idxmax()]
#the most liked tweet
Donald.loc[Donald['Likes'].idxmax()]
You will get the values of all columns corresponding to the conditions in the syntax. Blow I have only displayed the relevant result for Donald and Obama:
The first thing we notice is that Trump is more active on Twitter as his most recent 200 tweets only go back to April 2018. However, Obama is more conservative when tweeting, the oldest tweet among his most recent 200 tweets dates back to 2016.
Additionally, it was interesting for me to see whether either of the politicians mentioned other politicians in their tweets:
Donald.Tweets[Donald.Tweets.str.contains('Obama','Barack', flags= re.IGNORECASE)].count()
Donald.Tweets[Donald.Tweets.str.contains('Hillary','Clinton', flags= re.IGNORECASE)].count()
Donald.Tweets[Donald.Tweets.str.contains('Bernie','Sandars', flags= re.IGNORECASE)].count()
Obama had no mentions of any of the following three politicians; Bernie, Hillary and Trump. I added this data in a pie chart. Use the information retrieved from the above syntax and add it to the sizes object defined below:
#creating pie chart for politicians mentioned
labels = 'Obama', 'Hillary', 'Bernie'
sizes = [7, 5, 1]
explode = (0.1, 0, 0) # explode 1st slice
colors = ['gold', 'lightcoral', 'lightskyblue']
# Plot
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
autopct='%1.1f%%', shadow=True, startangle=140)
plt.axis('equal')
plt.show()
Tweets with mentions of other politicians by Donald
I am also interested to see on which day of the week do Obama and Trump tweet the most:
import matplotlib.pyplot as plt
#when using notebook make sure to add
%matplotlib inline
Weekday_Tweets = Donald.groupby(['wt', 'Weekday']).agg({'Weekday': np.count_nonzero}).plot(colormap='seismic', fontsize=12, kind = 'bar')
Weekday_Tweets.set_ylabel('Count of tweets', fontsize=14)
Weekday_Tweets.set_xlabel('', fontsize=14)
plt.xticks(np.arange(7), ("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
plt.title('Trump')
plt.legend(['Count of Tweets']);
Moreover, I will also check which day of the week are Obama's and Trump's tweet on average liked the most:
AvgLikes = Donald.groupby(['wt']).agg({'Likes': np.mean}).plot(colormap='PiYG', fontsize=12, kind = 'bar')
AvgLikes .set_ylabel('Avearge Likes on Tweets', fontsize=14)
AvgLikes.set_Title(‘Trump’)
AvgLikes .set_xlabel('')
plt.xticks(np.arange(7), ("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
Left: Since April 2018 Trump posted the most tweets on Monday
Right: Since September 2016 Obama posted the most tweets on Monday
Left: Average likes on tweets for Donald, for a given weekday
Right: Average likes on tweets for Obama, for a given weekday
The graphs show that even though both presidents tweet the most on the working days, the tweets that get liked the most are generally posted on the weekends. Several factors can attribute to this, given the subjectivity of the tweets and twitter users' activity.
To see the pattern of the likes and count of retweets without aggregation we will construct a time series graph. Pandas has its own object for time series. Initially we will create a time series objects and then plot the figure together. Make sure to run the syntax in collection rather than individually:
#Time Series, creating time series objects
TTfavD = pd.Series(data=Donald['Likes'].values, index=Donald['Date_Posted'])
TTretD = pd.Series(data=Donald['Retweets'].values, index=Donald['Date_Posted'])
#plotting time seres
TTfavD.plot(colormap="coolwarm_r", figsize=(16,4), label="Favourites", legend=True)
TTretD.plot(colormap="copper_r", figsize=(16,4), label="Retweets", legend=True)
plt.xlabel('')
plt.title('Donald')
The results are displayed below:
The figures demonstrate the retweet count and likes has the same observable pattern with time. Therefore, a tweet with more likes is more likely to be retweeted or vice versa.
To run sentiment analysis, we need to first install textblob via pip. TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more. We will be focusing more on getting the polarity of the tweets from textblob to run sentiment analysis. The objective is to create a simplistic frame for polarity and get an overview of the data. We will run the following syntax to obtain the polarity of the tweets:
from textblob import TextBlob
Donald[['polarity', 'subjectivity']] = Donald['Tweets'].apply(lambda Text: pd.Series(TextBlob(Text).sentiment))
We will get polarity and subjectivity of the tweet text in our dataframe. This ranges from -1 to 1, with -1 being negative, 0 being neutral and 1 being positive. I will aggregate this data in 3 different categories:
Donald.loc[(Donald.polarity > 0), 'Tweets'].count()
Donald.loc[(Donald.polarity == 0), 'Tweets'].count()
Donald.loc[(Donald.polarity < 0), 'Tweets'].count()
After getting an aggregated count on the text polarity, I have plot it in a pie chart:
#passing polarity to pie chart
labels = ['Positive', 'Negative', 'Neutral']
sizes = [122, 45, 22]
explode = (0, 0, 0.1) # explode 1st slice
colors = ['cyan', 'r', 'orchid']
#pie chart
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
autopct='%1.1f%%', shadow=True, startangle=140)
plt.axis('equal')
plt.title(‘Donald’)
We see interesting results. Obama’s tweets had more negative connotations as compared to Donald’s tweets. This has a lot to do with how adjectives are used in the tweets and the subjectivity of the tweets. For example, stating, ‘I am sad’, will be accounted with negative polarity in textblob. This still gives a very good general analysis on how textblob can be used to give an overview of sentiments in textual data. I drilled down on some of the negative tweets of Obama and Donald:
Donald.loc[(Donald.polarity < 0), 'Tweets'].sample(n=4)
From the sample negative tweets from Obama and Donald, I have selected two tweets at random and have displayed them below:
‘’Secret Service has just informed me that Senator Jon Tester’s statements on Admiral Jackson are not true. There were no such findings. A horrible thing that we in D.C. must live with, just like phony Russian Collusion. Tester should lose race in Montana. Very dishonest and sick!’’
‘’Wildfires in the next few decades could be ""unrecognizable"" to previous generations—because of climate change:’’
I leave it up to the readers to decide on which tweet pertains to which politician. You may also check for the tweets by analysing the raw data samples here. Further analysis can be done on negative tweets to account for subjectivity and perspective.
The full syntax is available at my GitHub repository. If there are any questions regarding the topic please feel free to reach out to me via the contact form.