Author: Yusra Farooqui
01/06/2018
This project aims at uncovering the development of a TV advertisement by using the simplest statistical tools and graphical tools in Python to showcase the viability and performance of a campaign. This project will aim to uncover the data by processing and analysing the data and finally representing this data in graphs and figures and create meaning out of it. I have a data set (2112, 6) containing data on a TV campaign. To be precise the data set includes the date-time, the different channels the advertisement was run on, the TV viewership corresponding to each time and channel, and the visits (web traffic) on the company's website recorded at the same time the advertisement was run. This is quite a lot of interesting data, and my objective here is to create visualisations to see how the campaign develops over time. Kathleen Mckeown, the director of Data Science Institute at Columbia University, states that visualisations are important to explore data and to develop intuitions on how we will go about solving a problem. In some fields such as clinical research data interpretation is highly relied on visualisations. This is usually a preliminary step to tackling mathematical models and algorithms. Kathleen emphasises that as we are visual beings, scientists should have the ability to correctly visualise their data.
In this case we are interested in understanding the times of the day and days of the week when our audience are most active and engaged, on a given channel. It allows us to see whether further relationships can be established to tap into our audience reach and engagement. We will be covering the following topics for this project:
Processing of data
Analysis of data
Visualisations
We will be using only 3 basic python libraries; pandas, numpy and matplotlib. An advanced understanding of python programming language is required to reuse this code. The script is written in Spyder IDE, Python 3.5.5.
The data set for this project is randomly created by using set parameters mimicking actual campaign data scenarios. Pandas was developed primarily for financial modelling, so as you might expect, it contains a fairly extensive set of tools for working with dates, times, and time-indexed data. The date-time parameter was created by using this added functionality of pandas, additionally channel, views and traffic were randomly generated using numpy.
DISCLAIMER: The data set used in this project is completely randomly generated with no real meaning or emphasis on the traffic and views generated on any channels. The channels are randomly selected and the data was randomly generated for a given range.
After importing my data set and storing it in pandas, I will begin by understanding what my data set entails. Below I have written some simplistic queries with their outputs that allow me to see the kind of data I am dealing with. I learn that I have no nulls in my data set and the data types include date, object and integers. The descriptive statistics results from the describe method shows us that the maximum generated traffic on the website is 648 and the maximum TV audience views of the campaign correspond to 1638000 views. However, the findings are not dependant on each other. It is not conclusive whether the maximum Views correspond to the maximum Traffic.
campaign = pd.read_csv("campaign.csv", sep = ";", parse_dates = ["Date"], dayfirst=True)
campaign.columns
Out[1]: Index(['Date', 'Time', 'Channel', 'WebTraffic', 'Views'], dtype='object')
campaign.isnull().values.any()
Out[2]: False
campaign.dtypes
Out[3]:
Date datetime64[ns]
Time object
Channel object
WebTraffic int64
Views int64
dtype: object
campaign.describe()
Out[4]:
WebTraffic Views
count 2112.000000 2.112000e+03
mean 33.932292 1.066172e+05
std 60.201105 1.831076e+05
min 0.000000 1.300000e+03
25% 6.000000 3.640000e+04
50% 15.000000 5.460000e+04
75% 33.000000 1.092000e+05
max 648.000000 1.638000e+06
campaign.head()
Out[5]:
Date Time Channel Traffic Views
0 2017-07-01 10:45 Rai1 3 36400
1 2017-07-01 10:56 10th 143 637000
2 2017-07-01 11:30 Eurosport 6 18200
3 2017-07-01 11:38 Rai1 57 72800
4 2017-07-01 11:43 E! 17 36400
I will analyse the data corresponding to the highest Traffic and Views using the loc method for pandas. Additionally, I would like to know the spread of my data over time. Therefore, I will also look at the unique month and year present in my data by using the dt method for series:
campaign.loc[campaign['WebTraffic'].idxmax()]
Out[6]:
Date 2017-07-15 00:00:00
Time 21:27
Channel MBC
Traffic 648
Views 1328600
Name: 1019, dtype: object
campaign.loc[campaign['Views'].idxmax()]
Out[7]:
Date 2017-07-28 00:00:00
Time 18:37
Channel MBC
Traffic 486
Views 1638000
Name: 1801, dtype: object
#yearly and monthly spread of data
campaign.Date.dt.year.value_counts()
Out[8]:
2017 2112
Name: Date, dtype: int64
campaign.Date.dt.month.value_counts()
Out[9]:
7 2112
Name: Date, dtype: int64
I notice that web traffic has nothing to do with the channel and it is independent of this information, however, still has an associated value given the way data is stored. Basically it refers to the traffic generated on the website for the same given time the advertisement was run on. Additionally, I learn that the data is spread over only one year, 2017 and only one month, 7, which corresponds to July. The value_counts() method aggregates this result for us. I want to learn more on how dates and time appear in the data set. One way to do this is to find out the unique values in each array. Subsequently we can run the same command for Channels:
#gives list of unique values in column/array
campaign.Date.unique()
array(['2017-07-01T00:00:00.000000000', '2017-07-02T00:00:00.000000000',
'2017-07-03T00:00:00.000000000', '2017-07-04T00:00:00.000000000',
'2017-07-05T00:00:00.000000000', '2017-07-06T00:00:00.000000000',
'2017-07-07T00:00:00.000000000', '2017-07-08T00:00:00.000000000',
'2017-07-09T00:00:00.000000000', '2017-07-10T00:00:00.000000000',
'2017-07-11T00:00:00.000000000', '2017-07-12T00:00:00.000000000',
'2017-07-13T00:00:00.000000000', '2017-07-14T00:00:00.000000000',
'2017-07-15T00:00:00.000000000', '2017-07-16T00:00:00.000000000',
'2017-07-17T00:00:00.000000000', '2017-07-18T00:00:00.000000000',
'2017-07-19T00:00:00.000000000', '2017-07-20T00:00:00.000000000',
'2017-07-21T00:00:00.000000000', '2017-07-22T00:00:00.000000000',
'2017-07-23T00:00:00.000000000', '2017-07-24T00:00:00.000000000',
'2017-07-25T00:00:00.000000000', '2017-07-26T00:00:00.000000000',
'2017-07-27T00:00:00.000000000', '2017-07-28T00:00:00.000000000',
'2017-07-29T00:00:00.000000000', '2017-07-30T00:00:00.000000000',
'2017-07-31T00:00:00.000000000'], dtype='datetime64[ns]')
#gives count of unique values in column/array
campaign.Date.nunique()
Out[11]: 31
#gives count of unique values in column/array
campaign.Time.nunique()
Out[13]: 796
#gives count of unique values in column/array
campaign.Channel.nunique()
Out[16]: 20
We see that out of 1,440 (24*60) unique times available in a day we have only 796 unique times in our data and 20 unique channels where our advertisement is run on. Furthermore, I would like to know the number of times the advertisement was run for each date. I can use the group by method to get this information. Instead of using the nunique method I will use the count method as I am interested in finding count of all times (number of times the campaign was run) rather than the unique number of times pertaining to each date. I can further drill down to find the time for each specific date:
campaign.groupby('Date')['Time'].count()
Out[17]:
2017-07-01 79
2017-07-02 52
2017-07-03 56
2017-07-04 64
2017-07-05 67
2017-07-06 8
2017-07-07 85
2017-07-08 120
2017-07-09 84
2017-07-10 102
2017-07-11 79
2017-07-12 87
2017-07-13 6
2017-07-14 64
2017-07-15 79
2017-07-16 66
2017-07-17 71
2017-07-18 52
2017-07-19 65
2017-07-20 7
2017-07-21 79
2017-07-22 66
2017-07-23 83
2017-07-24 79
2017-07-25 79
2017-07-26 90
2017-07-27 2
2017-07-28 63
2017-07-29 84
2017-07-30 104
2017-07-31 90
Name: Time, dtype: int64
campaign.loc[(campaign.Date == '2017-01-20'), ['Time'] ].count()
Out[18]:
Time
1286 01:40
1287 01:44
1288 02:04
1289 02:11
1290 02:44
1291 03:12
1292 03:21
Similarly, I can group by channels and see whether there is a visible pattern. Additionally, we can use other variables to find out how they are spread across channel and date. For instance, I have also added the aggregate of Views and Website Traffic generated for each channel as well as the count of times the campaign was run at each channel. We will be using numpy to assist us with this syntax. Make sure to use sum rather than count when dealing with views & web traffic:
#number of times the advertisement was run on each channel
campaign.groupby(['Channel'])['Time'].count()
Out[19]:
Channel
10th 179
Animal 52
Comedy 295
Discovery Science 382
E! 20
Erediv 30
Erediv 2 1
Eurosport 57
MBC 159
Ned2 109
Ned3 6
Net5 124
RTL Lounge 20
RTL5 21
RTL8 16
Rai1 342
SBS 6 137
Sport1 16
TLC 132
Veronica 14
Name: Time, dtype: int64
#web traffic and views corresponding to each channels with the number of times the advertisement was run, regardless of date and time
campaign.groupby(['Channel']).agg({'WebTraffic': np.sum, 'Views':np.sum, 'Time':np.count_nonzero})
Out[21]:
Views WebTraffic Time
Channel
10th 6721000 2254 179
Animal 2130700 855 52
Comedy 14944800 5600 295
Discovery Science 17330300 8385 382
E! 2277600 597 20
Erediv 2220400 921 30
Erediv 2 1300 6 1
Eurosport 5406700 1738 57
MBC 55091400 12843 159
Ned2 16689400 3079 109
Ned3 1146600 605 6
Net5 12669800 5380 124
RTL Lounge 855400 378 20
RTL5 3767400 2673 21
RTL8 618800 374 16
Rai1 48413300 16059 342
SBS 6 24334700 5814 137
Sport1 1019200 479 16
TLC 6406400 2137 132
Veronica 3130400 1488 14
We learn that the advertisement was run more on certain channels such as Discovery Science as compared to other channels such as Erediv, and, subsequently for some dates more channels were utilised to run the ad. This analysis could have been straightforward if we were only looking at 1 channel, however, we are considering 20 channels and the information for all these channels is unevenly spread out. In hindsight, we have learnt that the advertisement was run on different times on each day and at different times and dates for each channel. For example it was run 104 times on 30th of July and 79 times on 25th of July. We can safely assume that our data is incongruous, having no logical or discernible pattern.
We still have not figured out how to communicate everything that we have just discovered and more importantly how do we interpret this data? To understand and communicate our data successfully we will make use of visualisations.
If you are well-versed in python, intuitively you would want to see the development of views and web traffic over time by utilising pandas time series. However, that will not be a suitable scenario in this case and we will see that in a minute. Pandas has a built in time series function, which makes this job relatively easy for us. However, if we properly understood our data in the previous section we know that these graphs will not communicate much about our data set. Below, I have created the time series for both Website Traffic and Channel Views and then plotted them together on separate axis, for both date and time.
TSViews = pd.Series(data=campaign['Views'].values, index=campaign['Date'])
TSTraffic = pd.Series(data=campaign['WebTraffic'].values, index=campaign['Date'])
TSViews.plot(color='blue', marker ='.', grid =True, label='Channel Views', legend=True, figsize=(16,4))
TSTraffic.plot(color='red', grid=True, marker='*', secondary_y=True, label='Web Traffic', legend =True, figsize=(16, 4))
We have two vibrant graphs that unfortunately do not efficiently communicate anything about our data and hinders us from making any meaningful inferences. We have way too many data points for a single date or time which corresponds to these vertical lines on the graphs and it makes our job difficult to make any conclusions. Additionally, we can see that the time graph is not chronologically correct. For every date we have several times that the advertisement was run on. Therefore, it is ignoring the date the advertisement was run on. We can fix this by converting date from a datetime object to a string object and concatenate it with time. We can then plot the same graph by taking into account both Date and Time. Usually, in most date and time based data sets these measures are enough to learn the development of a campaign over time but we need to tweak and process our data to make more discernible visualisations in this case.
When we deal with a problem like this we make use of aggregation. If we aggregate our data we will be able to communicate our findings more efficiently with graphs. If you would want to communicate which dates or times are better for your advertisement you certainly cannot rely on the graphs above. A good place to start is by plotting matrix graphs to see how channel views develop over hours. We will be using numpy to aggregate and calculate the mean, median, minimum and maximum for views plotted against hourly bins. Obviously, such measures always take assumptions into considerations. For example by plotting the graph in hourly bins we are assuming that it does not matter at which hourly minute the advertisement was run; the results are approximately the same. We are also assuming that it does not matter which channel the advertisement was run on. In our initial analysis our assumptions are always greater than our posterior analysis. This is because we allow this time to learn from the data. We will be debunking some of these assumptions in the later stages, however, for now we will stick to our initial assumptions.
#hour and time series calculated by datetime module
campaign['Aired_T'] = pd.to_datetime(campaign['Time'],format= "%H:%M" ).dt.time
campaign['Aired_H'] = pd.to_datetime(campaign['Time'], format='%H:%M').dt.hour
#subplot
fig, ((ax1, ax2), (ax3, ax4)) = p.subplots(2, 2, figsize=(15, 10))
#plotting bar chart for average views
Time_Traffic=campaign.groupby(['Aired_H']).agg({'Views': np.mean}).plot(colormap="coolwarm", fontsize=12, kind = 'bar', ax=ax1)
Time_Traffic.set_title("Hourly Average Channel Views\n", fontsize = 18)
Time_Traffic.set_ylabel('Average Views', fontsize=12)
Time_Traffic.set_xlabel('\nTime', fontsize=12)
Time_Traffic.xaxis.grid(True)
Time_Traffic.yaxis.grid(True)
We see that between 04:00 to 06:00 the advertisement was not ran on any channels. Given our initial assumptions to be true, we can further establish that we can garner more views of our campaign if we run the advertisement from 18:00 onward, irrespective of the channel. Most people will not analyse the maximum, minimum and median values. However, this is necessary if you are planning on building a machine learning model to automatically depict the best time to run the campaign. Is your strategy more risque and you would like to get the maximum views of your campaign or more conservative where you would like to at least get a minimum amount of views? Where would you draw the line in accomplishing your strategy and how will you account for the assumption of the hourly bins? Another mistake most people make is that they base such hourly models on averages. In this case it is better to base the model on median values as it accounts for our assumption at a higher accuracy and does not account for outliers. Below you will see the box plot of views for 20 hourly bins.
If you were asked which time was the best to run the advertisement to get the most views, the most common answer will be based on the average, max and min views; in this case being, 19:00. There is absolutely nothing wrong with that answer; it is just biased. However, if you want to be more certain you would base your answer on the box plot; in this case being 23:00. The rule of thumb is that when you are grouping your data for ease, it is always best to look at how your data is distributed on a boxplot rather than average. Whenever a graph falls on a normal distribution, using the mean is a good choice. But if your data has outliers you will need to look at median, because it is far more representative of your data.
To learn the development of website traffic we can plot the exact same graphs but this time with Website Traffic against date and time.
We learn that we have zeroes for our minimum web traffic visits and quite often we need to treat these zeroes. However, in this case we are certainly sure that at times 0 people visit the website, therefore 0 here is of value to us and thus we will not change our data. How we interpret our traffic data is fundamentally different from how we treat channel views. For starters we have no control over the channels and have to bid to run our ad campaign. On the other hand we have complete control over the website. And by studying the times people visit the website we can add promotional offers and also set the prices for running external ads. Using sum of hourly web traffic is also useful in this case scenario. This data is of particular importance to us, when we are determining CPC for our external providers. However, we don't have this information so we will only try extrapolating the traffic with views in the later stages. Therefore, we will focus more on Channel views for the time being.
We previously extrapolated data grouped by hours and we can further investigate by looking at data grouped by days. This becomes very important if you are creating separate predictive models for each day. As we are only dealing with 7 bars, I also added the corresponding y value as a percentage for each bar. We first create the weekday series by using the dt module as we did before for hours. I have written the syntax for the average views grouped by weekday below:
campaign['weekday'] = campaign['Date'].dt.weekday_name
campaign['wt'] =campaign['Date'].dt.weekday
campaign = campaign.sort(['wt'])
#subplots
fig, ((ax1, ax2), (ax3, ax4)) = p.subplots(2, 2, figsize=(15, 10))
#graph for average views grouped by day names
Day_Traffic=campaign.groupby(['wt','weekday']).agg({'Views': np.mean}).plot(color="coral", fontsize=12, kind = 'bar', ax=ax1, legend=False)
Day_Traffic.set_title("Weekday Average Views\n", fontsize = 18)
Day_Traffic.set_ylabel('Average Views', fontsize=12)
Day_Traffic.set_xlabel('Weekday', fontsize=12)
totals = []
for i in Day_Traffic.patches:
totals.append(i.get_height())
total = sum(totals)
for i in Day_Traffic.patches:
Day_Traffic.text(i.get_x()-.03, i.get_height()+1000, \
str(round((i.get_height()/total)*100, 2))+'%', fontsize=10, color='black')
ax1.spines['top'].set_visible(False)
ax1.spines['right'].set_visible(False)
When the data is grouped by weekdays we can observe that the median is almost the same for all days. However, we do see some fluctuations in the averages. We can also observe that the range determined by the min and maximum falls to the extreme for Friday, Saturday and Sunday. This could be because of the times the advertisement was run on each day. We can further validate our finding by checking the time the advertisement was run on each day.
campaign.groupby('weekday')['Time'].count()
Out[37]:
weekday
Friday 291
Monday 398
Saturday 428
Sunday 389
Thursday 23
Tuesday 274
Wednesday 309
Name: Time, dtype: int64
The advertisement was run 428 times on Saturdays in July and only 23 times on Thursdays. They are not comparable as we have way few data points for Thursday. Exploratory analysis allows us to learn about the data and helps us determine the logical pattern and treatment that is needed for posterior analysis. For example, building a model around Monday, Saturday and Sunday will be prone to less errors as compared to Thursday.
We can further drill this down by checking whether this has anything to with how the days are spread out in the month.
campaign.groupby(['week','wt','weekday'])['Time'].count()
Out[38]:
week wt weekday
26 5 Saturday 79
6 Sunday 52
27 0 Monday 56
1 Tuesday 64
2 Wednesday 67
3 Thursday 8
4 Friday 85
5 Saturday 120
6 Sunday 84
28 0 Monday 102
1 Tuesday 79
2 Wednesday 87
3 Thursday 6
4 Friday 64
5 Saturday 79
6 Sunday 66
29 0 Monday 71
1 Tuesday 52
2 Wednesday 65
3 Thursday 7
4 Friday 79
5 Saturday 66
6 Sunday 83
30 0 Monday 79
1 Tuesday 79
2 Wednesday 90
3 Thursday 2
4 Friday 63
5 Saturday 84
6 Sunday 104
31 0 Monday 90
Name: Time, dtype: int64
We have determined how the data is spread over hours, weeks and days. Even though all this information is necessary for the analyst and the data scientists to study, usually, we need to convey our story in one graph. Python grouping and unstacking is a great way to take multiple categories and plot them on one graph. Pandas unstack pivots a level of the (necessarily hierarchical) index labels, returning a DataFrame having a new level of column labels whose inner-most level consists of the pivoted index labels. In simpler words it takes the categories present in a column, and does it magic by pivoting the categories in the series, creating internal labels. Now all of this magic is not really seen, however, if you are learning Python it is important to know how this works. I have written the code below that unstacks the day names and then plots the result.
fig, ax = p.subplots(figsize=(20,10), squeeze =True)
camapign = campaign.sort(['wt'])
campaign.groupby(['Aired_H', 'weekday']).mean()['Views'].unstack().plot(ax=ax, kind ='line', marker = 'o', ls=':')
fig.set_tight_layout(False)
ax.set_title('Daily Average Views by Hour\n')
p.tick_params(axis='both', which='major', labelsize=22)
p.legend(fontsize=18)
loc = mticker.MultipleLocator(base=1.0)
ax.xaxis.set_major_locator(loc)
p.grid(which='major', axis='both')
Let me just revise what we have accomplished so far. At first we studied the time irrespective of dates and days it was run on and determined the better times for running the advertisement. Then we created series through pandas for week numbers and weekday name. We leveraged this information to determine patterns of how views change for each day at hourly bin intervals. The graphs above depict the patterns of average and median channel views and how they develop over time. These graphs are commonly used in practice but usually interactive in nature. I will do a tutorial on how to create python interactive graphs and embed them on external sites but for now we will focus on simple visualisations.
We will now delve into the information we have regarding specific channels. I will be combing the channels and time at a later stage. We want to determine which channels are worth investigating, meaning, the ones that garner the most views regardless of other parameters. I will be discarding the channels that do not create any value for our campaign.
We will use the same code as we did previously to plot views against hours and replace it with channels, to create a matrix for max, min, median and average.
We can observe that the median view for some channels is below 25,000 and we can further establish that the campaign was not run on all channels with the same frequency. For better inferences it is better to divide the data, in a low tier and high tier. This does not mean data deletion. We can further investigate the channels by adding the number of times the campaign was run on the secondary axis on our median graph, shown below. We immediately see some discrepancies. We can observe that even though the campaign was run more than 300 times for Comedy and Discovery Science, the median or central tendency of the views remains low. The min, max and the average is also very low for these channels comparatively to other channels as shown in above graphs. And consecutively we can observe that Ned3, Ned2, RTL5, MBC, SBS6 and Veronica garner more median views as compared to the times the campaign was run on these channels.
Slicing in pandas is simple, however, we want to slice the median views pivoted on the channels. The easiest way to do this is to create another dataframe and select the data we want there. I have shown below how easily this can be accomplished :
n = campaign.set_index(['Channel'])
n = n.loc[n.index.isin(['MBC', 'SBS 6', 'RTL5', 'Ned2', 'Ned3', 'Veronica'])]
n.reset_index(level=0, inplace=True)
And as I showed before with weekdays we can unstack the data for channels. We can observe that not all times were used to run the campaign for each channel. We can see patterns of how time influences the views for specific channels and can adjust on how we want to bid on the time slots.
Visualisations are also great tools to learn the possible relationships between two parameters. We will see the relationship of channel views and website traffic. Many people argue that comparing these two parameters is intuitively incorrect. However, much have changed today in the way consumers absorb information, and according to a report from Accenture using data from 2015, 87% of consumers use a second screen device while watching TV. People see something advertised that piques their interest and then use a smartphone, tablet or desktop to go online and learn more, often by visiting the advertiser’s website.
Below, I have plotted a simple scatter plot to study our variables alongside a more enhanced graph, where I randomly manipulated the size of the markers. Even though it is not a recommended practice when communicating more detailed and comprehensive data, it is a great way to withhold attention of the user. A typical user will spend more time studying the graph on the right and will dismiss the graph on the left within seconds. Therefore, it is important to know who your audience is. For example, the graph on the right will be perfect for a website but the graph on the left will be more suitable for reports.
#simple
p.scatter(campaign.Views, campaign.WebTraffic)
p.xlabel("\nChannel Views")
p.ylabel("Website Traffic\n")
p.title('Website Traffic vs Views\n')
p.grid(which='major', axis='both')
p.figure(figsize=(16, 8), dpi=80)
#enhanced
N = 2112
# 0 to 15 point radii
area = (5 * np.random.rand(N))**3
colors = np.random.rand(N)
p.scatter(campaign.Views, campaign.WebTraffic, s=area, c=colors, alpha=0.5)
fig.set_tight_layout(True)
p.xlabel("\nChannel Views")
p.ylabel("Website Traffic\n")
p.title('Website Traffic vs Views\n')
p.grid(which='major', axis='both')
p.figure(figsize=(13, 8))
Consecutively, we can also plot a straight regression line imposed on to the scatter graph. The simplest way to do this is use seaborn lmplot. However, you can use the straight line equation and plot it with matplolib as well. Below you will see how I have combined both matplotlib and seaborn to get the graph on the left.
sns.set_style('whitegrid')
sns.despine(offset=10, trim=True)
g = sns.lmplot(x='Views',y='WebTraffic',data=campaign, fit_reg=True)
g.set_xticklabels(rotation=30)
p.xlabel("\nChannel Views")
p.ylabel("Website Traffic\n")
p.title('Website Traffic vs Views\n')
This is a neat way of plotting the regression line with confidence intervals, in just a few lines of code.
Another interesting way to plot our data is by using cmap. Previously, I randomly generated the sizes of each point. On the colourmap graph the sizes are determined by the cost of running the campaign. The bigger the size the more the campaign cost. Additionally, I added the colorbar which allows us to add another parameter. In this case I have added time in hours. The colour determines at which time was the campaign run. The results are sporadic, however, gives interesting insights on how our data is relevant in all these different features.
The aim of this article was to provide a variety yet simple ways to use visualisations with pandas, python. It is important to know how to communicate your data as that is one of the most important aspects of management. Once we understand our data we can apply machine learning methodologies for forecasting. In another project I will use the best text book example to elaborate on a complete machine learning project and on how it should be used in practice.
Reach out if you may any questions!