Time Series Analysis 

Analyzing Regional Temporal Data for Weather Changes in R

Using R for Exploratory Analysis

Author: Yusra Farooqui    

23/12/2021

Introduction

The uproar by the climate activists and scientists in the last decade, and the advent of prominent personalities, such as Greta Thunberg, making headlines every week, it is hard to ignore the unprecedented challenge the world is currently facing; climate change. The scholarly articles published in the field show compelling evidence of the global temperature rising. According top NASA global sea levels have risen more than 80 millimeters (3.15 inches) over the past three decades. Furthermore, NASA has also conducted research on ice sheets, and has found that the Greenland ice sheet's mass has rapidly declined in the last several years due to surface melting and iceberg calving. Research based on satellite data indicates that between 2002 and 2020, Greenland shed an average of 279 billion metric tons of ice per year, adding to global sea level rise. Similarly there is a plethora of research by NASA and other notable research institutes, providing admonitory outcomes for the future of the world. Yet, population as a whole is not convinced of the ill-will of our rapid consumption of resources on Earth. Therefore, this research is focused on regional implications of climate change to investigate how climate change is impacting our society in a more tangible fashion. 

For this research, a data set provided  from Weather Underground API to study the weather changes in the city of Delhi, India is utilized. The article will investigate, temperature, humidity, pressure and wind speed recorded in Delhi over four and a half years. The article will gauge into the tangible impact of climate change. For example, investigating warmer average temperatures which directly causes discomfort to the average person. Most available data on climate change is mainly global whereas the effects are more at regional levels. It is on this premise that this study will investigate the regional evidences of climate change using the city of Delhi in India as a case study. In this case we are interested in understanding, the mean temperature, pressure, humidity and wind speed changes in Delhi and whether they have any association with each other. Additionally, we will be checking whether this data is suitable to make future weather forecasts for Delhi.

The article will circle around using R for graphical output of the data. Programming and statistical systems even though fully capable to assist humans in decision making, producing predictions or recommendations, often do a terrible job in conveying a story or making inferences. Therefore, human decision makers often rely on visualizations to make inferences regrading data, to further exploit the data with machine/deep learning and statistical tools to reach their objectives. Kathleen Mckeown, the director of Data Science Institute at Columbia University, states that visualizations are important to explore data and to develop intuitions on how we will go about solving a problem. In some fields such as clinical research data interpretation is highly relied on visualizations. This is usually a preliminary step to tackling mathematical models and algorithms. Kathleen emphasizes that as we are visual beings, scientists should have the ability to correctly visualize their data. The exploratory analysis will include this posterior analysis for further studies and mathematical modeling for climate data. Therefore, this article will narrate how the data behaves with the different parameters through various graphical representations. Furthermore, as the case study revolves around time-series data, it will be incomplete without the analysis of autocorrelation, seasonality and stationarity of the time series data. The article will be investigating whether time is a factor to predict these weather metrics. Fortunately all three scenarios can be graphically studied and analyzed. 

Data Source

The data is collected from Weather Undergroud API. This data set provides weather data from 1st January 2013 to 24th April 2017 for the city of Delhi, India. The four parameters provided in the data set are the following: Mean Temperature (meantemp) in Celsius, Humidity (humidity) in percentage, Wind Speed (wind_speed) in meters per second and Atmospheric Mean Pressure (meanpressure) in millibars. The original data contains two data sets; one for training the model and one for testing the model. The data set was developed as a part of Data Analytic Course, 2019 at PES University, Bangalore. The original purpose of the data is to investigate climate change regionally. 

The data can be downloaded via the following link:

https://www.kaggle.com/sumanthvrao/daily-climate-time-series-data?select=DailyDelhiClimateTrain.csv

Data Cleaning

There are 1,576 observation in the data set. Firstly the data is checked for missing values and incomprehensible values, such as character data point in float column. The data is checked for these discrepancies through summary statistics and data type of the columns. The data set shows consistency and accuracy for data types and null values. 

Secondly, checks are made concerning the logical value of data points. For example, the study checks for the unique dates in the data set. As the metrics being used are averages of the daily recorded temperature, pressure, humidity and wind; no two same dates should exist in the data set. Metrics corresponding to date 2017-01-01 appear twice. The first data point has extremes metrics for the day, for example the wind measurement is zero, the minimum value for the metric and the humidity is 100, which is the maximum value of the metric. These values were compared with summary statistics in R. Therefore, it is safe to assume that data point 1462 is flawed and can be safely removed from the study. 

Additionally, checks are made concerning the units of each value. This is done to ensure no flawed values are in the data set. According to Guinness the highest barometric pressure ever recorded was 1083.8mb (32 in) at Agata, Siberia, Russia (alt. 262m or 862ft) on 31 December 1968. Therefore, checks are made in the data set for any mean pressure exceeding this value. Three observations have pressure of value greater than 1083.8mb, which clearly are data entry errors. These three data points are dropped from the data sets, as there is no way to corroborate whether the rest of the entries are accurate as well. Similarly, according to Guinness the lowest pressure recorded ever is 870 millibar. Five values recored below 870 millibar are removed from the data set. Other abnormal values include wind speed value of 0m/s which is an acceptable value therefore kept in the data set. The humidity levels and the mean temperature are also in acceptable range of average metrics. The total observations for the study is now limited to 1567 observations.

#find duplicated values in date

df[duplicated(df$date),]

x = df %>% 

  group_by(date) %>% 

  count(date)

filter(x, n > 1)

with(df, df[(date == "2017-01-01"), ])

#remove duplicated values in date

df <- df[-c(1462), ]

with(df, df[(date == "2017-01-01"), ])

#find abnormal pressure values

df[(df$meanpressure >=1083) | (df$meanpressure <=870),]

#drop abnormal values

df<-df[!(df$meanpressure >=1083 | df$meanpressure <=870), ]

df[(df$meanpressure >=1083) | (df$meanpressure <=870),]

Monthly and Yearly Weather Data Analysis

Dealing with time series data, the date column is manipulated to create month and year columns. This is to discern how monthly temperature changes with time. This can be astutely visualized in bar charts and scatter plots.  

#creating columns for month year with Days

class(df$date)

# converting to datetime object

df[['date']] <- strptime(df[['date']],

                                   format = "%Y-%m-%d")

#creating month

df$month = format(df$date,"%m")


#converting month to numeric

class(df$month)

df$month<- as.numeric(as.character(df$month))

#creating month name

df$monthname = month.name[df$month]

z = df %>% 

  group_by(monthname) %>% 

  count(monthname)

df$monthname = factor(df$monthname, levels = month.name)

#extracting year from Date

df$year = format(df$date,"%Y")

df$year = as.numeric(as.character(df$year))

Figure 1 shows how temperature changes yearly in Delhi. This basically sets the expectations for the rest of the analysis.  The code to obtain figure 1 is added in the snippet below.

#exploratory analysis

ggplot(df, aes(x = year, y = meantemp)) +

  geom_jitter(alpha = 1/10) +

  stat_summary(geom = "line", fun = "mean", color = "orange", size = 1) +

  stat_summary(geom = "line", fun = "quantile", fun.args = list(probs = .9), linetype = 2, color = "red") +

  labs(y = "Mean Temperature", x= "Year")+

  ggtitle("Yearly Mean Temperature Fluctuations")

Figure 1 provides interesting insights and trends. It can immediately be observed, that mean temperature is dropping steeply for 2017, however, this is mainly because the data points for 2017 are limited up until April. However, the graph reveals that up until 2016 the average temperature is gradually rising, with highest mean yearly temperature recorded in 2016. The orange line is the mean temperature for each year and the red line is the 90th percentile temperature for each year. A more astute picture can be learned from the mean temperature recorded for each month, shown in Figure 2.

Replacing year with month name column in the snippet above with slight changes we can obtain Figure 2 and 3. Figure 2 & 3 show that the most extreme temperatures were observed in months of January, July and December. This goes to show that temperature is regionally changing.  This highest temperatures are observed in June and the lowest in January. Similarly yearly plots can be observed for pressure, humidity and wind speed, shown in Figures 4.

Figure 1

Figure 2

Figure 3

Additionally, other metrics can be explored in a similar fashion. Figure 4 shows the all three metrics are gradually increasing with time. Whilst mean pressure have two perplexing outliers, which can be assumed as either incorrect data, or stormy weather. This can be corroborated with the fact that in 2016 North Indian Ocean cyclone season was the deadliest season since 2010, killing more than 400 people. According to national geographic when a low-pressure system moves into an area, it usually leads to cloudiness, wind, and precipitation. This is possibly one explanation for the outliers for 2016 Delhi Mean Pressure and Wind Data. 

Figure 4

Correlation Analysis of Metrics

Next, we investigate whether there is a relationship between these metrics. For this I have used the corrplot package to visually, see how the metrics correlate. Mean temperature and mean pressure seem to be negatively correlated, indicating an increase in temperature decreases pressure. The correlation matrix is plotted in Figure 5. This is obviously a poor way to identify correlation as there are more than two variables and eigen values or variance inflation factor VIFs should be used instead, but it is still acceptable for exploratory analysis. Additionally, I have added the scatter plot, to visualize how all of the metrics relate with date and themselves in Figure 6. 

df3 = df

df3$date <- NULL

# Plot

M <-cor(df3)

x11(width = 10, height = 5)

par(mfrow=c(1,2))

corrplot(M, method="color")

box("figure",lty="solid", col = 'black')

corrplot(M, method="number")

box("figure",lty="solid", col = 'black')

dev.off()

Figure 5

Figure 6

We can further investigate the relationship of pressure and temperature by plotting these values together with year and month in a 3D plot in Figure 7 and 8. We can observe that for all years there is a linear association of temperature with pressure. Therefore, we can establish that we can predict temperature with pressure metrics. It should also be noted that this indicates that the overall atmosphere changes with slight changes in temperature. The monthly graph also presents interesting finds and shows temperature and pressure change with months, showing a curvature trend for both values against months. 

plot_ly(df, x = ~year, y = ~meanpressure, z = ~meantemp,

        colors = c('#FFE1A1', '#683531'),size = I(3)) %>%

  add_markers() %>%

  layout(scene = list(xaxis = list(title = 'Year'),

                      yaxis = list(title = 'Mean Pressure'),

                      zaxis = list(title = 'Mean Temperature')),

         title = "3D Scatter plot: Year vs Pressure vs Temperature",

         showlegend = FALSE)


plot_ly(df, x = ~monthname, y = ~meanpressure, z = ~meantemp,

        colors = c('#FFE1A1', '#683531'),size = I(3)) %>%

  add_markers() %>%

  layout(scene = list(xaxis = list(title = 'Month'),

                      yaxis = list(title = 'Mean Pressure'),

                      zaxis = list(title = 'Mean Temperature')),

         title = "3D Scatter plot: Year vs Pressure vs Temperature",

         showlegend = FALSE)

Figure 7

Figure 8

Autocorrelation, Seasonality and Stationarity

Autocorrelation is a mathematical representation of the degree of similarity between a given time series and a lagged version of itself over successive time intervals. It's conceptually similar to the correlation between two different time series, but autocorrelation uses the same time series twice in its original form and in lagged form with one or more time periods. Analysis of time series is always interesting because with most models, the goal is to predict a value from a set of other predictor variables, no matter how the values are sorted. Usually, we explicitly assume that there is no autocorrelation in which the sequence of observations does not matter. With time series, we assume that previous observations help predict future observations, which makes time series analysis unique. Our second objective is to understand whether our data set is sufficient enough to make future predictions as a time series. Autocorrelation is an important part of time series analysis. It helps us understand how each observation in a time series is related to its recent past observations. When autocorrelation is high in a time series, it becomes easy to predict their future observations. 

#autocorrelation check

dfa = df[,c("date","meantemp")]

class(dfa.ar$date)

acf(dfa)

pacf(dfa)

dfa.ar = ar(dfa)

ts.plot(dfa, predict(dfa.ar,n.ahead=24)$pred, lty=c(1:2))

#time series for all metrics seasonality check

drops <- c("date")

dfold = dfold[,!(names(dfold) %in% drops)]

dfold 

dcp = zoo(coredata(dfold), df$date)

head(dcp)

par(mar=c(1,1,1,1))

plot(dcp, screens=1)


xlab="Date"

ylab=c("Temp", "Humidity", "Wind Speed", "Pressure")

main="Average Weather Metrics Fluctuations by Year"

lty=c("solid", "dashed", "dotted", "dotdash")

# Plot the two time series in two plots

plot(dcp, screens=c(1,2,3,4), lty=lty, main=main, xlab=xlab, ylab=ylab)

dev.off()

Firstly, I investigated for partial autocorrelation between temperature and date as shown in graph below. Each spike that rises above or falls below the dashed lines is considered to be statistically significant. Meaning if a spike is significantly different from zero, there is evidence of autocorrelation. A spike that’s close to zero is evidence against autocorrelation. For temperature as shown in Figure 9 the spikes are not statistically significant for most lags. Indicating that when temperature rises it does not continue to keep that trend. The figure on the top right of Figure 9 illustrates this trend.

Figure 9

Figure 9

This indicates seasonality in the data set. A seasonal plot is similar to a time plot except the data is plotted against the individual seasons in this case months in which the data were observed. Below, I have plotted all metrics to visualize seasonality in the data set. Figure 10 shows, that humidity and temperature highly depends on seasonality, however, pressure and wind speed have a more sporadic trend. This means that a time series with cyclic behavior with no trend such as wind speed and pressure is stationary. In general, a stationary time series will have no predictable patterns in the long-term. Therefore, we can conclude that time series for temperature and humidity can be used for forecasting  but not pressure and wind. 

Time-Series Calendar Heat-map

I have created a heat map to visually discern all temperatures on a given day in the data set. For this particular heat map I had to create a function which calculates the month week and then create week days variables. The heat map in Figure 11 shows the changes in temperatures which wouldn't have been possible to observe otherwise. For example consider the year 2016, we can visually observe that the months of March, April and May were hotter for 2016 as compared to the previous years. We can also visually observe that January was less cooler for both 2016 and 2017, while December was less cooler for 2016. This gives evidence that regionally the temperature of Delhi has gotten warmer over successive years.

df$wk = nth_week(df$date)

ggplot(df, aes(wk, weekdayf, fill = meantemp)) + 

  geom_tile(colour = "white") + 

  facet_grid(year~monthname) + 

  scale_fill_gradient(low="green", high="red") +

  labs(x="Week of Month",

       y="",

       title = "Time-Series Calendar Heatmap", 

       subtitle="Delhi Daily Average Temperatures", 

       fill="Temp")

Summary

Our analysis gave compelling evidence that weather is changing regionally, due to climate change. It showed that the weather considerably became hotter in successive years. 2016 and 2017 showed a general trend where temperatures peaked the highest, while 2013 showed the coldest temperature for colder seasons. Furthermore, it was established that the weather parameters pressure and temperature have significant correlation. Additionally, presence of seasonality was established for temperature and humidity, while pressure and wind speed were shown as stationary time series. This makes logical sense, as atmospheric pressure does not vary by a great degree with time.

It was interesting to corroborate the data with real events. For example we were able to identify high pressure data points observations during cyclone season in Delhi. Another interesting find was to see how temperature is actually changing daily, with notable changes in heat map with successive years. This was also observed in the initial analysis of scatter plot of temperature against date, but did not do  visual justice to prove the temperature increase. It is also interesting that temperature with great variation is actually negatively correlated with pressure which was mostly stationary. This shows that a high change in temperature actually changes the pressure, making the variable correlated. The research also established how these variables can be used for forecasting weather, either using time series or the metrics itself.

Even though the research does not have a direct consumer, it serves as an example that the global temperature is rising and could have detriment impact on people. For example high temperatures means, more use of air conditioning which in turn causes global warming. Higher temperatures are also the culprit of heat strokes, due the heat wave which are detriment to health and in severe conditions take lives [6]. Heat waves are also responsible for melting roads and subsequently a warm phase of  surface temperatures is likely to lead to an above-average number of typhoons and super-typhoons, AccuWeather reports . Therefore, people as a whole should make an effort to consume less and reduce their waste, be mindful of the products they use and their electricity consumption, this in turn may not reverse global warming but could slow it affects.