Machine Learning with Scikit-Learn 

Machine Learning with Python Scikit-Learn 

Author: Yusra Farooqui

23/02/2019  

Introduction

The project aims to cover simple machine learning techniques to predict California housing prices. The project requires an advanced working knowledge of Python programming language 3 and a sound knowledge of regression, decision trees and random forest. For the most part the script relies on standard scientific python tools (numpy/matplotlib/pandas/scikit-learn).  In addition to these standard libraries you will require:

The project will cover the following topics:

Most data scientists specially those who are just getting into data science are very eager to get the 'correct' solution with machine learning techniques. Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from the data without being explicitly programmed. It is an iterative process with no one step solution, where we can use applicable ML methodologies but cannot certainly prove it to be the correct methodology. As George Edward Pelham Box, a renowned mathematician, said, 'All models are wrong; some models are useful'. This quote from the book 'Statistics for Experimenters' is one of the most quoted lines in data science.  He further states that, 'Since all models are wrong the scientist cannot obtain a "correct" one by excessive elaboration. On the contrary following William of Occam he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist so over-elaboration and over-parameterization is often the mark of mediocrity.' The latter quote is taken from Journal of the American Statistical Association, Vol. 71, No. 356, page 792.

The project aims to uncover which model is useful and what determines its usefulness.   

The project is written in Spyder IDE, python 3.5.5.. Any questions regarding this project can be forwarded using this contact form.

PS: I am transitioning how I write my codes in Python to a more organised style. I am currently following PEP 8. The style guide aims to improve the readability of the code. 

Exploratory Analysis

I will be using the data-set for California Housing Prices from StatLib repo. It serves as an excellent introduction to implementing machine learning algorithms and is the best text-book example for non-data science professionals to follow through. 

The first step is to check what our data entails. This can be accomplished by using the describe module for pandas; a very simple way of gathering descriptive analysis. I have added the include all parameter as I am interested in knowing all missing values and discrepancies early on. 

housing.describe(include=all)


           longitude      latitude  housing_median_age   total_rooms  \

count   20640.000000  20640.000000        20640.000000  20640.000000   

unique           NaN           NaN                 NaN           NaN   

top              NaN           NaN                 NaN           NaN   

freq             NaN           NaN                 NaN           NaN   

mean     -119.569704     35.631861           28.639486   2635.763081   

std         2.003532      2.135952           12.585558   2181.615252   

min      -124.350000     32.540000            1.000000      2.000000   

25%      -121.800000     33.930000           18.000000   1447.750000   

50%      -118.490000     34.260000           29.000000   2127.000000   

75%      -118.010000     37.710000           37.000000   3148.000000   

max      -114.310000     41.950000           52.000000  39320.000000   

 

        total_bedrooms    population    households  median_income  \

count     20433.000000  20640.000000  20640.000000   20640.000000   

unique             NaN           NaN           NaN            NaN   

top                NaN           NaN           NaN            NaN   

freq               NaN           NaN           NaN            NaN   

mean        537.870553   1425.476744    499.539680       3.870671   

std         421.385070   1132.462122    382.329753       1.899822   

min           1.000000      3.000000      1.000000       0.499900   

25%                NaN    787.000000    280.000000       2.563400   

50%                NaN   1166.000000    409.000000       3.534800   

75%                NaN   1725.000000    605.000000       4.743250   

max        6445.000000  35682.000000   6082.000000      15.000100   

 

        median_house_value ocean_proximity  

count         20640.000000           20640  

unique                 NaN               5  

top                    NaN       <1H OCEAN  

freq                   NaN            9136  

mean         206855.816909             NaN  

std          115395.615874             NaN  

min           14999.000000             NaN  

25%          119600.000000             NaN  

50%          179700.000000             NaN  

75%          264725.000000             NaN  

max          500001.000000             NaN  



housing.columns

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',

       'total_bedrooms', 'population', 'households', 'median_income',

       'median_house_value', 'ocean_proximity'],

        dtype='object')



housing.head(5) 

   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \

0    -122.23     37.88                41.0        880.0           129.0   

1    -122.22     37.86                21.0       7099.0          1106.0   

2    -122.24     37.85                52.0       1467.0           190.0   

3    -122.25     37.85                52.0       1274.0           235.0   

4    -122.25     37.85                52.0       1627.0           280.0   

 

   population  households  median_income  median_house_value ocean_proximity  

0       322.0       126.0         8.3252            452600.0        NEAR BAY  

1      2401.0      1138.0         8.3014            358500.0        NEAR BAY  

2       496.0       177.0         7.2574            352100.0        NEAR BAY  

3       558.0       219.0         5.6431            341300.0        NEAR BAY  

4       565.0       259.0         3.8462            342200.0        NEAR BAY  

There are 9 features or predictors for the data-set. The descriptive analysis shows that the median_income variable is scaled and capped at 15. The descriptive analysis also reveals missing bedroom data corresponding to households.  

Below I have plotted histograms for each numerical attribute. The histograms reveal that the median house value and median age are capped. Interestingly, none of our features follow a normal distribution pattern.  A skewed predictor will not necessarily result in a skewed distribution for the response. But it is common that a skewed predictor will not have a linear relationship with the response. We can therefore do a log transformation for this, however, there are no underlying assumptions in most linear models for the independent variables to be normally distributed.  I will explain more of that in the later sections.

housing.hist(bins=50, figsize=(20, 15))

The descriptive statistics revealed that there are missing values for bedrooms. I want to explore how this variable is correlated to other features, representing room data, to consider different possibilities in data preparation for missing values. To do so we make use of the correlation matrix in pandas. The correlation matrix shows high correlation of this feature with population, households and total_rooms.  The correlation matrix in pandas uses Pearson correlation as default, however this can be changed by calling the 'method' parameter in the module. Both, Kendall Tau and Spearman correlation are accessible with pandas correlation module.

corr_matrix = housing.corr()

corr_matrix['total_bedrooms']

 

longitude             0.069608

latitude             -0.066983

housing_median_age   -0.320451

total_rooms           0.930380

total_bedrooms        1.000000

population            0.877747

households            0.979728

median_income        -0.007723

median_house_value    0.049686

Name: total_bedrooms, dtype: float64

 

#scatter plots

fig, (ax1, ax2) = p.subplots(1, 2, sharey=True)

housing.plot(kind='scatter', 

             x='households', 

             y='total_bedrooms', 

             ax=ax1)

housing.plot(kind='scatter', 

             x='total_rooms', 

             y='total_bedrooms', 

             ax=ax2)

The feature is positively, linearly correlated with total_rooms and households. Pairwise correlation however, are an inaccurate way of detecting collinearity when it comes to multiple variables and are only effective when there are only two features present. Further analysis with eigen values and variance inflation factors (VIF) is necessary in this case. Dropping and changing features is known as feature selection. Feature Selection as most things in Data Science is highly contextual and data dependent with no one stop solution. There are a few techniques that can be utilized to account for this, which I will cover in more detail in the data preparation section.  

Additionally, total_bedrooms is poorly correlated with median_house_value, therefore, indicating that there are no linear relationships of this feature with our target attribute. This does not rule out any non-linear relationship. 

Next, is to check how the response variable correlates with other predictors. The correlation matrix below shows us the linear relationship of the target attribute with different features. 

corr_matrix['median_house_value']

longitude            -0.045967

latitude             -0.144160

housing_median_age    0.105623

total_rooms           0.134153

total_bedrooms        0.049686

population           -0.024650

households            0.065843

median_income         0.688075

median_house_value    1.000000

Name: median_house_value, dtype: float64

With the geographical data available we can plot the longitude and the latitude in a scatter plot along with different parameters to reveal the patterns. In the scatter plot below, the radius of each circle represents the district’s population and the colour map on the scale shows median house value. The yellow circles reveal the areas which are costly, while the vice versa for purple circles. The map reveals that the properties closer to sea are valued higher, especially in the vicinity of Los Angeles and San Francisco.

Note: There is a bug in pandas plot when choosing a colour for cmap, causing the x-axis ticks to disappear. A quick workaround for this is to use subplots as shown in the syntax below.

fig, ax = p.subplots()

housing.plot(kind ="scatter", 

             x ="longitude", 

             y ="latitude", 

             alpha = 0.4,

             s = housing["population"]/100, 

             label ="Population", figsize =(10,7), 

             c ="median_house_value", cmap = 'viridis', 

             ax=ax)

Generating Training and Testing Data sets

Once we have gathered sufficient insights through the exploratory analysis phase the data needs to be immediately sliced into training data set and test data set. Many data scientists apply this step even before exploratory analysis, depending on the amount of decision making required during the process. However, this certainly must be done before any data processing has taken place. It is a precautionary step to make sure that the train data does not learn from the test data in any way. For example, when we compute the missing values in the training data, we will solely have to rely on the train data and cannot allow the training data to derive values from the test data. Let's assume the strategy is to fill the missing values with median values. The median value generated by the entire data set will be different from the median value generated by the train data set, thus using the later ensures that the training data has not learned from the entire data set.

We initially saw the uneven distribution of our only correlated predictor; the median income. The observations are scaled into 15 divisions. There is a high density of observations between 2 till 6, while the other bins are poorly represented. The box plot below shows how the data is spread across this attribute. As we learnt from the descriptive statistics 75% of the values fell beneath 4.743250, while maximum is capped at 15. 

housing.boxplot(column=['median_income'])

As the data set is quite small, the ideal scenario is for the ML to learn from all house values. Random sampling may result our training set to be non-representative to the population, however, usually in large data sets this may not have such a large significance. Therefore, it is better to opt for stratified sampling. Instead of stratifying by the median income I chose to use stratification by the house value, our target attribute.

#dividing data intro stratas

housing["value_cat"] = np.ceil(housing["median_house_value"] / 100000) 

housing["value_cat"].where(housing["value_cat"] < 5, 5.0, inplace = True)

 

#stratified sampling with scikit

from sklearn.model_selection import StratifiedShuffleSplit 

split = StratifiedShuffleSplit(test_size = 0.2, 

                               random_state = 42) 

for train_index, test_index in split.split(housing, housing["value_cat"]): 

    strat_train_set = housing.loc[train_index]

    strat_test_set = housing.loc[test_index]

 

#drop value_cat

for set_ in (strat_train_set, strat_test_set):

    set_.drop("value_cat", axis = 1, inplace = True)

The syntax reserves 20% of the data for the test sample. Additionally, I added a random seed to the data set, which basically means that the sample is not really random but follows an algorithm, which is undetectable. The reason this is usually used is to increase the usability and validity for the syntax. For example if you were to use the exact same syntax and would like to compare your solution to mine it becomes easier as the syntax will result to the same sample whenever it is run.

Data Pre-processing - SK-Learn Transformation Pipeline

Before, we deep dive into the data for processing we need to analyse why are we following these processing steps. One of the main reason is to ensure data quality as most ML methodologies CANNOT work with categorical data and missing values. Additionally, when applying ML techniques, we need to be careful of the assumptions of that technique. Therefore, if I were to take log of the long-tailed distributed features to make them more normally distributed, it is beneficial to know whether that step is necessary, and if so, the reasoning behind it. Even though the step to treat data to meet the assumptions falls into data processing, to keep this project more organised I will be adding the processing related to the concerned ML method in the section pertaining to the method. Additionally, I have restrained this project to two models or estimators to focus more on the theoretical part of the project. Therefore, the project is specifically designed, and the data is specifically transformed considering only, regression and random forest. 

For this entire process I will be using Scikit-Learn Transformation Pipeline. The steps pertaining to the pipeline of this project is to be studied with caution as scikit original documentation has released warnings for several classes which are in experimental phase and the APIs are subjected to change. 

Before, I get into the details on how to write the syntax for transformation pipelines, I will go through the explanation of the reasoning of each step of the pipeline. 

Discretizing Longitude & Latitude Features

In exploratory analysis I plotted the data along with the latitude and longitude to see how the features affect the location. By visual analysis, it was discerned that location on the map has some impact on the price of the location. The ability of humans to visually discern patterns is what makes us superior to machines. Therefore, it can be concluded that the longitude and latitude are of some significance to this study, however, these features cannot be used in their true form as they are spatial coordinates representing a 3-dimensional space.

One common way to treat these features is by discretizing them. Discretization refers to the process of converting or partitioning continuous variables. A useful library for this purpose is Geohash which is available in Python. Partitioning of geospatial data is commonly known as geohashing. It creates grids on the map that divides the lat and long into boxes relevant to the geospatial position. It basically bins the data by its position on the map and gives a unique code to each position that can be decoded back to the long and lat coordinates. 

Firstly, the process transforms longitude and latitude to categorical data, and, unfortunately categorical data cannot be directly fed into sckit-learn ML algorithms. Secondly, as there is no method in scikit-learn to achieve this transformation so we cannot add this systematically to the transformation pipeline. However, this process can be added as a function to the transformation pipeline by writing a custom transformer. As this is a non-statistical transformation which only concerns encoding data and does not impact the variance and distribution of the features, I have excluded this process from the pipeline.  The syntax therefore, has to be applied to both train and test data. The encoding will not differ if done separately or on the entire data set. Basically, I am creating another variable in the pandas dataframe, not determined by statistics but by discerning visuals. The precision in the syntax below, refers to the length of the geohash code.

strat_train_set['geohash'] = 

                strat_train_set.apply(lambda x: gh.encode(x.latitude, x.longitude, precision=3), axis=1)

strat_test_set['geohash'] = 

              strat_test_set.apply(lambda x: gh.encode(x.latitude,           x.longitude, precision=3), axis=1)

#check for train only

strat_train_set['geohash'].nunique()

Treating Categorical Data for ML

Most libraries in python for ML cannot deal with categorical data as of yet. Therefore, we have to give a numerical attribute to the categorical feature of ocean proximity and geohash. One way to do this is to use label encoding, however, that leads to an incorrect assumption of the variable. It assigns a number in ascending order to every unique category in the feature. Thereby, assigning a value which essentially means nothing. However, that value would impact your machine learning model, as it will assume it to be a numerical attribute.

Therefore, one-hot encoding is used instead. One-hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction. It is basically binary assignment of 1 to every category when it is present and 0 when it is not. The data is pivoted by categories from rows to columns that allows distinction between them.  To understand this better, I have attached a table that showcases how the categorical data changes with these methodologies. However, to apply regression to this model we need to be careful to not fall in the dummy trap, which will make the VIF infinite and we will be unable to invert the gram matrix, making it impossible to predict the coefficients. It is ideal to remove the last dummy variable.

The syntax to perform this step can be added to the transformation pipeline by using sklearn OneHotEncoder method. 

Missing Value Imputations

Most ML libraries cannot handle missing data. Therefore, we have to make a decision on how we are going to account for missing values in our sample. The simplest method in order the retain all the data, is to use the median of all values in the training data set to replace all missing values. This method reasonably works well but can distort the findings. There are a few other methods that can be used that offer higher precision in erratic and large data-sets. 

A method that is becoming increasingly popular is multiple imputation. It is a process where the missing values are filled multiple times to create “complete” data-sets. Filling any missing value, comes with a level of uncertainty as there is no way of knowing that the assigned value is true. Multiple imputation is useful because it accounts for this uncertainty. Multiple Imputation by Chained Equations (MICE)  works with the assumption that the missing data are Missing at Random (MAR). MAR, means there is a systematic relationship between the propensity of missing values and the observed data, but not the missing data. Whether an observation is missing has nothing to do with the missing values, but it does have to do with the values of an individual’s observed variables. So, for example, if men are more likely to tell you their age than women then in that case age is MAR. Additionally, the method assumes Gaussian (output) variables.

For k-Nearest Neighbour imputation, the missing values are based on a kNN algorithm. These values are obtained by using similarity-based methods that rely on distance metrics (Euclidean distance, Jaccard similarity, Minkowski norm etc). They can be used to predict both discrete and continuous attributes. The main disadvantage of using kNN imputation is that it becomes time-consuming when analysing large data-sets because it searches for similar instances through all the data-set. Sklearn, unfortunately only offers, mean, medium, highest frequency and interestingly MICE imputations for missing values using the transformer; SimpleImputer (changed recently from Imputer depending in the scikit version being used). However, there is a library, known as fancyimpute which offers more complicated imputations such as KNNs, decision trees etcetra. 

In case, when we are absolutely certain that the values are missing completely at random then it is best to discard the observations pertaining to missing values.

The data-set is relatively straight-forward and following Edward Pelham rule it is best to adopt the simplest approach. In this case the simplest approach is to fill the missing values with the median value. When using a pipeline, the median value generated from the training data-set will always be used in the test-data set. Therefore, if used without pipeline the imputation for the test data set will generate a different value, thus a different result.

Feature Scaling - Standardisation

As we are dealing with data measured in different units and features which have been scaled, in the case of median salary, it is better to standardise the data to make sure the weights of the variables are equally distributed. Standardisation converts the original measurements to unit-less variables. However, this is only relevant for the regression part. Monotonic transformations do not impact random forest.

The easiest way to achieve this is by using StandardScaler method from scikit-learn. StandardScalar is the methodology for z-score normalisation. The method transforms the distribution of the data with a mean value 0 and the standard deviation value of 1.

Transformation Pipeline

Before setting up the transformer pipeline, the first thing we need to do is to define the outcome, which is setting aside our y_train and x_train dataframes:

#seperating predictors from target values

X_train = strat_train_set.drop("median_house_value", axis = 1) 

y_train = strat_train_set["median_house_value"].copy()

X_test = strat_test_set.drop("median_house_value", axis = 1) 

y_test = strat_test_set["median_house_value"].copy()

Transformation pipelines are extremely useful in combining multiple transformers and estimators into one process. They make the code easier to use and edit, and basically gives a bit more of an understanding on how the whole ML was developed. Traditionally, you will be writing many pieces of code to transform your data-set, however, with pipelines you can combine all the transformations into one. Furthermore, you can use FeatureUnion if you are undecided on which strategy to use. For example, you maybe questioning on how to compute your missing values. FeatureUnion takes all the strategies and decides on its own which is the best for our ML. So, you avoid running the script multiple times to get your desired result.

Let’s say you want to implement a transformation that is not in sk-learn. Thankfully, sk-learn provides a functiontransformer module that allows you to write your transformation and apply to the entire dataset. Or you can write your own custom transformer! In case of changing our categorical variables to a numerical form that is acceptable to our model, we can opt to use a custom transformer. I have already added details in the previous section as to why this is important. Even though sk-learn offers the module DictVectorizer it does not have the input and output that we want.

The biggest issue that you will face with sk-learn is that, it does not work with non-numeric data and cannot handle nulls. Pandas data frame have labelled rows and columns which makes it easier for us to make inferences. Sk-learn uses numpy arrays which will basically just tell you that feature 3 was very important and then you end up scratching your head trying to figure what the heck was feature 3. Similarly, transformers being part of sk-learn will downcast this data and strip out the labels, making it a number only territory.

Now there are a few methods that you go on about to solving this. Either you can change everything to a numpy array and then transform it back to a dataframe, the approach Aurélien Géron's uses. Or you can choose a more interesting and complex path and write all your custom transformers in a way which will return a dataframe and we will always be dealing with pandas dataframe in the entire process. In addition, the pipelines are pandas-friendly, only if your custom transformer knows how to return a dataframe. The advantage of the former is that most transformations that you will be using are part of the transformation pipeline in sk-learn, and you will probably only be done using less than 4-word code to call that transformation. But, in the real-world scenario you will almost always have to write a custom transformer for fancier functions. 

And finally, we can construct our pipeline. The next step is to fit the data to the train set ONLY and then transform both train and test datasets.

#pipeline for transformations

pipeline = Pipeline([

    ('features', PDFeatureUnion([

        ('cat', Pipeline([

            ('extract', ColumnExtractor(cat_feature)),

            ('dummy', DummyTransformer())

        ])),

        ('numerics', Pipeline([

            ('extract', ColumnExtractor(num_feature)),

            ('imputer', DFImputer()),

        ]))

    ])), 

     ('scale', PDStandardScaler())  

])

#fit only to training data

pipeline.fit(X_train)

#transform train data

X_train_transformed = pipeline.transform(X_train)

#transform test data

X_test_transformed = pipeline.transform(X_test)

The logic can become quite complex here, but the reason why most data scientist prefer writing their own functions is to have a greater control and flexibility on the decision making of the project. And if this becomes a little complicated to follow through you can always opt in using the standard functions of the library.

Great! Now we can finally start applying this transformed data to our machine learning algorithms. 

Performance Measures

Before we jump the wagon to passing the transformed data to ML models, we must analyse what performance measure will be the governing factor deciding which model performs the best. In all honesty I started writing this piece to address some of the misconceptions out there concerning regression, however, I ended up elaborating on more steps after encountering data scientists who are struggling to identify themselves in this role.

A typical performance measure for regression problems is the Root Mean Square Error (RMSE). It is a frequently used measure of the differences between values (sample or population values) predicted by a model or an estimator and the values observed. 

Even though the RMSE is generally the preferred performance measure for regression tasks, in some contexts you may prefer to use another function. For now, we will stick with RMSE as the performance measure for our models. 

Regression

Linear Regression

I will try to address some of the questions that are populated all over the internet regarding the assumptions of the linear regression. Just by doing so I learnt various techniques people have adopted to solve regression problems. My advice to all is that when we are concerned with ML algorithms which relies on statistical inference it is best to stick to books written by statisticians or mathematicians. However, in most scenarios given the flexibility of a project you do not need to restrain yourself too strictly with all these mathematical constraints. I have highlighted a few assumptions of linear regressions that have some weightage on how we transform our data. However, there is no agreement amongst statisticians regarding these assumptions and you will always come across some different suggestions for treating bias in your result.

Let’s begin by applying the linear regression model without considering any assumptions of the OLS and study the results.

#applying linear regression

model_OLS = LinearRegression()

model_OLS.fit(X_train_transformed , y_train)

As we are still in the decision-making phase, we will keep our test data aside and strictly keep our evaluations with the training data set. The model can be evaluated by calculating the RMSE as shown below.

#evaluating OLS

OLS_Predictions = model_OLS.predict(X_train_transformed)

OLS_mse =mean_squared_error(y_train,OLS_Predictions, multioutput='raw_values') 

OLS_rmse = np.sqrt(OLS_mse)

print(OLS_mse)

print(OLS_rmse)

[4.29430075e+09]

[65530.91446157]

The result basically indicates that there is a prediction error of $65530 and considering our range of house values this seems way off.  These results were generated without considering the dummy trap as this project is geared more towards variable selection rather than precision. Therefore, the results of this regression should be taken loosely.

I have not seen much content with python focusing particularly on visualisations to make decisions. To include visualizers in your machine learning workflow, the best library that I have come across is yellowbrick. The Yellowbrick API is specifically designed to play nicely with scikit-learn. The primary interface is therefore a Visualizer – an object that learns from data to produce a visualization. Just like in Scikit-learn visualizers are estimators objects and have a similar interface along with methods for drawing. While considering the assumptions of the linear regression and studying the errors, I chose to visualize the residuals and predictors to explain the assumptions of linear regression from a new perspective.

How do we visualize the prediction error of our model? We use yellowbrick. The logic is to plot the predicted values against the actual values and create inferences regarding our model. The scoring in this case is also done on the training data.

We already know through the RMSE that OLS is performing rather poorly but it is better to see where the problems are lying. 

# prediction error plot for linear regression

model_OLS_Error = PredictionError(LinearRegression(), identity = False, shared_limits = False)

model_OLS_Error.fit(X_train_transformed , y_train)

model_OLS_Error.score(X_train_transformed , y_train)

model_OLS_Error.poof()

The figure demonstrates that the model performs poorly, specially for data points lying at 500,000 mark. There is a higher range of predicted values at this point. I previously identified this to be a problem while plotting the histograms. I also mentioned that this basically means that a linear relationship is not recommended here. 

I hope by now it is starting to make more sense as to why data pre-processing and exploratory analysis are fundamental to any machine learning practice. Additionally, the regression accounts for 67.7% of variance explained by the model, which is not really optimal in this case.

I will not be making any changes to the model before covering the assumptions of the OLS. Below I have identified a few assumptions that are considered in making appropriate choices with transformations and selecting the regression model. 

Assumptions of Linear Regression

There are a few assumptions of every ML technique and sometimes we need to make sure that we are adhering to these assumptions before choosing the given technique. If this step is ignored it results to a serious bias. There is limited literature when it comes to assumptions of the linear regression model, therefore, I will draw out my conclusions considering three freely available research papers, with brief citations from statistics books. Even though regression is the most commonly used ML technique in data science there has been quite contradictory literature on the assumptions of linear regression.

One of the most cited articles regarding assumptions of linear regression is written by Jason Osbourne and Elaine Waters. The paper titled, 'Four Assumptions Of Multiple Regression That Researchers Should Always Test', explores the assumptions of linearity, reliability of measurement, homoscedasticity, and normality. In fact, the search terms 'assumptions of multiple regression' will show Osbourne and Waters article in the top results on Google. On a side note if you search for 'Obama and Trump sentiment analysis', my Github project on the topic is currently amongst the top 4 results on Google. 

1. Normal Distributions of Independent Variables and Disturbance Term

If you are well-versed on the topic you must have heard that in regression, residuals of the regression model should follow a normal distribution which is normal distribution of the disturbance term, but this assumption is Only required for trustworthy significance tests and confidence intervals in small samples.

Osbourne and Waters begin the paper by emphasising that it’s the variables that need to follow normal distributions (overall frequency distribution should be normal) and that highly kurtotic variables can distort relationships and significance tests. I found another interesting paper written to debunk these assumptions. To an extent, Osbourne and Waters are not wrong in suggesting that normality of the variables should be checked and that it will have an impact on significance tests. HOWEVER, they do not mention in which case this is TRUE. The linked article which opposes these assumptions also fail to mention when the normality of the variables specifically should be of concern and dismiss the assumption as a misconception.

So, there are two assumptions which are quite often confused with each other; the distribution of the variables and the distribution of the disturbance terms.  Both these assumptions are valid under separate circumstances and should not be confused with one another.  The normal distribution of the disturbance terms is required when considering the fixed effects models and the normal distribution of the independent and the dependant variables is required when considering the random effects model

According to Poole and Ferrell, in inferential statistics, the two significance tests; Student's t-test and Snedecor's F-test, require the disturbance term to be normally distributed, however, that normality doesn't need to be assumed for point estimation. They further elaborate that even in relation to inferential problems, this assumption is not binding, if the sample is very large. When the sample size is sufficiently large (>200), the normality assumption is not needed as the Central Limit Theorem ensures that the distribution of disturbance term will approximate normality. Interval estimates are most often made when computing confidence intervals for the individual regression coefficients, and when calculating interval estimates for predicted values of Y corresponding to specific values of X and in both these cases Student’s t is used. They go into much detail on the different scenarios on when these assumptions needs to be met which you can read in their paper, 'The assumptions of the linear regression model.'

Unlike deeming both assumptions as misconceptions on the internet; the books and the papers suggest that they are both in fact valid under certain circumstances. In summary, OLS does not require the error term to follow a normal distribution. However, satisfying this assumption allows you to perform statistical hypothesis testing and generate reliable reliable confidence and prediction intervals. The data in scope is cross-sectional and given the requirements of these assumptions we do not need to religiously follow this rule. However, when discussing homoscedasticity, I will show how this can be visualized as well with yellowbrick.

2. Multi-collinearity  

The presence of correlations between the predictors is termed collinearity (for a relationship between two predictor variables) or multicollinearity (for relationships between more than two predictors). A perfect correlation between two or more predictors, can mean that no unique least squares solution to a regression analysis can be computed. More commonly, less severe multicollinearity can lead to unstable estimates of the coefficients for individual predictors: That is, the standard errors and confidence intervals for the coefficient estimates will be inflated 

The extent to which multicollinearity is a concern depends on the aims of the analysis: If prediction is the main objective which is our case in this scenario, multicollinearity is not a significant obstacle, as prediction of the response variable (including prediction intervals) will not be affected. If the aim is inference about population parameters multicollinearity becomes more of a problem as it will become difficult to decipher which variables are responsible for the result.

3. Other Assumptions of Error Terms - It is all about the Errors!

Apart from the normal distributions of the errors there are a few other assumptions that the OLS revolve around residuals. Regression also assumes a zero-conditional mean of error which is one of the fundamental conditions for the regression coefficients to be unbiased. The error term accounts for the variation in the dependent variable that the independent variables do not explain. Random chance should determine the values of the error term. For your model to be unbiased, the average value of the error term must equal zero. This occurs due to two reasons; omitting an important explanatory variable and mis-specified functional relationship between the dependent and independent variables (e.g. omitted squared term; using level instead of log; or log instead of level). Violation of this assumption can be omitted by including a constant in your regression model because it forces mean of the residuals to equal zero. 

Additionally, regression assumes the errors to be independent. This means that residuals should be uncorrelated.

And lastly regression assumes, homoscedasticity (constant variance) of errors. This basically means that the model errors are generally assumed to have an unknown but finite variance that is constant across all levels of the predictor variables. This assumption is also known as the homogeneity of variance assumption. If the errors have a variance that is finite but not constant across different levels of the predictor/s (i.e., heteroscedasticity is present), ordinary least squares estimates will be unbiased and consistent if the errors are independent but will not be efficient.

Residual Plot

Residuals, in the context of regression models, are the difference between the observed value of the target variable (y) and the predicted value (ŷ), e.g. the error of the prediction. Below I have plotted the residual plot for OLS with Yellowbrick:

visualizer = ResidualsPlot(model_OLS)

visualizer.fit(X_train_transformed , y_train)

visualizer.poof()

A common use of the residuals plot is to analyse the variance of the error of the regressor. If the points are randomly dispersed around the horizontal axis, a linear regression model is usually appropriate for the data. In the case above, we do not see a random, uniform distribution of the residuals against the target in two dimensions. The residual plot is an indicator of 3 things; 

1.       If the residuals spread randomly around the 0-line indicating that the relationship is linear 

2.    The residuals form an approximate horizontal band around the 0-line indicating homogeneity of error variance.

3.      No one residual is visibly away from the random pattern of the residuals indicating that there are no outliers. 

All these 3 indicators are absent from the model. The residuals are not spread randomly and are following a visible pattern. The histogram however reveals that our error is normally distributed around zero, which generally indicates a well fitted model.

Even though we have a good R-squared indicator, the OLS model is not applicable to this data set. You can therefore run a polynomial regression and see whether the results change!

Lasso Regression

Even though the data set at this stage might not be suitable for OLS, I wanted to give a brief background on the different types of linear regression techniques that are available on scikit-learn. I have worked with data which had more than 100 features at a time. Due to encoding our categorical data our data-set is at a strong 45. Instead of going through the tedious task of eliminating features we can adopt approaches that account that for us. One such approach is Lasso Regression. Lasso regression analysis is a shrinkage and variable selection method for linear regression models. The goal of lasso regression is to obtain the subset of predictors that minimizes prediction error for a quantitative response variable. Lasso does this by imposing a constraint on the model parameters that causes regression coefficients for some variables to shrink toward zero. Variables with a regression coefficient equal to zero after the shrinkage process are excluded from the model. Variables with non-zero regression coefficients variables are most strongly associated with the response variable. 

Let’s evaluate the features before we run a lasso regression. The feature engineering process involves selecting the minimum required features to produce a valid model because the more features a model contains, the more complex it is (and the sparser the data), therefore the more sensitive the model is to errors due to variance.  We can visualize that by ranking features using Yellowbrick. I am going to use pearson as my method of evaluation.

X_train_transformed.columns

all_features = ['geohash=9mg', 'geohash=9mu', 'geohash=9mv', 'geohash=9my',

       'geohash=9nz', 'geohash=9pp', 'geohash=9pr', 'geohash=9q3',

       'geohash=9q4', 'geohash=9q5', 'geohash=9q6', 'geohash=9q7',

       'geohash=9q8', 'geohash=9q9', 'geohash=9qb', 'geohash=9qc',

       'geohash=9qd', 'geohash=9qe', 'geohash=9qf', 'geohash=9qg',

       'geohash=9qh', 'geohash=9qj', 'geohash=9qk', 'geohash=9qm',

       'geohash=9qn', 'geohash=9qs', 'geohash=9r0', 'geohash=9r1',

       'geohash=9r2', 'geohash=9r3', 'geohash=9r4', 'geohash=9r6',

       'ocean_proximity=<1H OCEAN', 'ocean_proximity=INLAND',

       'ocean_proximity=ISLAND', 'ocean_proximity=NEAR BAY',

       'ocean_proximity=NEAR OCEAN', 'longitude', 'latitude',

       'housing_median_age', 'total_rooms', 'total_bedrooms', 'population',

       'households', 'median_income']

 

features_ranking = Rank2D(features= all_features, algorithm='pearson')

features_ranking.fit(X_train_transformed , y_train)

features_ranking.transform(X_train_transformed)

features_ranking.poof()

This is basically a correlation matrix showing the linear correlation between two variables. We do see some strong correlations between variables in the lower end of the matrix. 

I have applied lasso regression on the dataset in the similar way I used OLS.

#Lasso regression

model_lasso = Lasso(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=1000,

   normalize=False, tol=0.0001, warm_start=False)

model_lasso.fit(X_train_transformed , y_train)

Lasso_Predictions = model_lasso.predict(X_train_transformed)

Lasso_mse = mean_squared_error(y_train, Lasso_Predictions) 

Lasso_rmse = np.sqrt(Lasso_mse)

print(Lasso_rmse)

65529.72432518464

model_lasso_Error = PredictionError(Lasso(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=1000,

   normalize=False, tol=0.0001, warm_start=False))

model_lasso_Error.fit(X_train_transformed , y_train)

model_lasso_Error.score(X_train_transformed , y_train)

model_lasso_Error.poof()

So we have gotten the exact same result as linear regression because Lasso assumes the same assumptions as of OLS being a linear model. However, I would like to point out how this approach changes feature importance. The feature importance using OLS and Lasso graphically are shown below:

Remember, how we evaluate linear regression. The coefficients basically tell us which features are important. The higher the number the more important they are. In linear regression we get all these features that should have little impact on the model with very high coefficients, and median income our main feature being pushed to 0. However, with Lasso median income is the feature with the highest coefficient. For feature relevance and importance when we have large number of features, in linear regression, Lasso is the best approach.

On a side note, scikit learn also offers LassoCV object, that automatically does hyperparameter tuning for you!

Ridge Regression

Ridge regression is an older approach to Lasso regression. The only difference is where, Lasso penalizes features of low importance to 0, Ridge enforces the β coefficients to be lower, but it does not enforce them to be zero like in Lasso. One of the fundamental reasons why we use these two approaches is because that more features might cause an overfitting of the data. Additionally, we have a noisy sample, having features such as total rooms with missing values. These inaccuracies can lead to a low-quality model if not trained carefully. The model might end up memorising the noise instead of learning the trend of the data.

Try running Ridge regression on the same data and see how the results change!

Random Forest

Let’s try the model, that I had my bets on since the exploratory phase! The RandomForestRegressor! Random Forests work by training many Decision Trees on random subsets of the features, then averaging out their predictions. Building a model on top of many other models is called Ensemble Learning, and it is often a great way to push ML algorithms even further. Let’s being by applying the Random Forrest Model!

model_RFR = RandomForestRegressor()

model_RFR.fit(X_train_transformed, y_train)

RFR_Predictions = model_RFR.predict(X_train_transformed)

RFR_mse = mean_squared_error(y_train, RFR_Predictions) 

RFR_rmse = np.sqrt(RFR_mse)

print(RFR_rmse)

22043.15446425278

The model now has an error of 22043, which is much better than what we have been producing with regression. The prediction error plot also shows massive improvements with a R-square value of 79.7%. Note that half of our prediction values were below 0 with regression! Additionally, the outlier 500,000 is not having a huge impact on the model as it did before.

Fine Tune your Model - GridSearchCV

Currently, we have not tested our data on the test set. Therefore, we still do not have a clear picture on how will our model perform. We can further improve our model by choosing cross validation approaches. Note that running this example may take a long time. 

To understand what our CV score should be let’s plot the cross validation curve with some basic inputs. 

from yellowbrick.model_selection import ValidationCurve

viz = ValidationCurve(

    RandomForestRegressor(), param_name="max_depth",

    param_range=np.arange(1, 11), cv=10, scoring="r2"

)

# Fit and poof the visualizer

viz.fit(X_train_transformed , y_train)

viz.poof()

Model validation is used to determine how effective an estimator is on data that it has been trained on as well as how generalizable it is to new input. 

In order to maximize the score, the hyperparameters of the model must be selected which best allow the model to operate in the specified feature space. Most models have multiple hyperparameters and the best way to choose a combination of those parameters is with a grid search. However, it is sometimes useful to plot the influence of a single hyperparameter on the training and test data to determine if the estimator is underfitting or overfitting for some hyperparameter values.

We can see in the resulting visualization that a depth limit of less than 7 levels severely underfits the model because the training score and testing score climb together in this and because of the high variability of cross validation on the test scores. After a depth of 8, the training and test scores diverge, this is because deeper trees are beginning to overfit the training data. The cross validation score does not necessarily decrease, we can assume that the model is not suffering from high error due to variance.

With this we can apply our findings to grid search CV and evaluate the prediction error plot.

from sklearn.model_selection import GridSearchCV 

param_grid = [ {'n_estimators': [5, 10, 30], 'max_features': [2, 4, 6]} ] 

grid_search = GridSearchCV(model_RFR, param_grid, cv = 8,  scoring = 'r2') 

grid_search.fit(X_train_transformed , y_train)

grid_search.best_params_

final_model = grid_search.best_estimator_

final_model_Error = PredictionError(final_model)

final_model_Error.fit(X_train_transformed , y_train)

final_model_Error.score(X_test_transformed , y_test)

final_model_Error.poof()

We have massively improved our results by incorporating a few techniques. We accomplished this by using visualisations to guide us though the process and using data science knowledge and techniques. We can use other ML models such as SVMs and try different approaches, such as taking a log1p, or using polynomial features. We can further use ensemble models choosing the best models, to get the best out of our results. 

Machine learning is an iterative process as you learn from the data in order to make decisions. Quite often a times you have to adjust step 1 of the entire process to get better results. This project is means to just give an understanding on how a ML project usually looks like with Scikit using the easiest example. There are usually more steps that need to be incorporated with several other adjustments to choose the right model for your data set. 

This is means to give an overview on how you can go about making those decisions. Reach out if you have any questions!