menu

KING COUNTY HOUSE PRICES

In [1]:

##THE LYBRARIES USED IN THIS NOTEBOOK

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn import metrics
import seaborn as sns
from sklearn.ensemble import AdaBoostRegressor


We load the file into pandas to see which features the dataset has

This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015. Here you have the Kaggle competition . I start with this dataset as I followed the ones in the course from Washington University in Coursera. If you want to take a look to the contents from the course Machine Learning Foundations

Here you can find a brief version of the features in the dataset:

  • Price
  • Bedrooms
  • Bathrooms
  • Sqft living
  • Sqft_lot
  • Floors

and so on

In [2]:

dataset_link = 'https://bit.ly/2GxNbuV'
houses_df = pd.read_csv(dataset_link)

minimum_y = houses_df['price'].min()
# s refers to size
#alpha --> 0.0 transparent through 1.0 opaque
plt.scatter(x = houses_df.sqft_living,y = houses_df.price/minimum_y,s = 1, alpha = 1)
plt.xlabel("SQFT_LIVING")
plt.ylabel("PRICES")
plt.title("SEATTLE HOME PRICES")
plt.yscale('log')

#Lets see what we have in the dataset
houses_df.head()


1 2 3 4 5
ID 7129300520 6414100192 5631500400 2487200875 1954400510
DATE 20141013T000000 20141209T000000 20150225T000000 20141209T000000 20150218T000000
PRICE 221900 538000 180000 604000 510000
BEDROOMS 3 3 2 4 3
BATHROOMS 1.0 2.25 1.00 3.00 2.00
SQFT_LIVING 1180 2570 770 1960 1680
SQFT_LOT 5650 7242 10000 5000 8080
FLOORS 1.0 2.0 1.0 1.0 1.0
WATERFRONT 0 0 0 0 0
VIEW 0 0 0 0 0
GRADE 7 7 6 7 8
SQFT_ABOVE 1180 2170 770 1050 1680
SQFT_BASEMENT 0 400 0 910 0
YEAR_BUILT 1955 1951 1933 1965 1987
YEAR_RENOVATED 0 1991 0 0 0
ZIPCODE 98178 98125 98028 98136 98074
LAT 47.5112 47.7210 47.7379 47.5208 47.6168
LONG -122.257 -122.319 -122.233 -122.393 -122.045
SQFT_LIVING15 -1340 -1690 -2720 -1360 -1800
SQFT_LOT15 -5650 -7639 -8062 -5000 -7503

We use a log scale as it allows a large range of elements to be displayed without small values being compressed down into bottom of the graph. If you want to see how you appreciate the change between a normal and log scale, check out this question in Stackoverflow

Explore the data


In [3]:

pd.set_option('display.float_format', lambda x: '%.3f' % x) #supress scientific notation
houses_df.describe().iloc[:,1:].drop(['yr_built','yr_renovated','zipcode'],axis=1)
COUNT MEAN STD MIN 25% 75% MAX
Price 21613 540088 367127 75000 321950 645000 7700000
Bedrooms 21613 3371 0.930 0.000 3.000 4.000 33.000
Bathrooms 21613 2.115 0.770 0.000 1750 2500 8000
Sqft_Living 21613 2079 918 290 1427 2550 13540
Sqft_Lot 21613 15106 41420 520 5040 10688 1651359
Floors 21613 1.494 0.540 1.000 1.000 2.000 3500
Sqft_above 21613 1788 828 290 1190 2210 9410
Sqft_Basement 21613 291.509 442.575 0 0 560 4820

The average sale price of a house in our dataset is close to $540,088 with most of the values falling within $321,950 to $645,000 range.


Pearson Correlation

To see how each variable is correlated to the other we are gonna use the Pearson Correlation Coeficient. This is a measure of the linear correlation between two variables X and Y. It has a value between +1 and −1, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation.

For any more information refer to Pearson Correlation Coefficient


In [4]:

correlation = houses_df.iloc[:,2:].corr(method='pearson')
correlation.style.format("{:.2}").background_gradient(cmap=plt.get_cmap('coolwarm'), axis=1)

In [5]:

correlation.price.sort_values(ascending=False)[1:]
#we drop the first value as it is with itself.
VARIABLES VALUES
Sqft_Living 0.702
Grade 0.667
Sqft_above 0.606
Sqft_Living15 0.585
Bathrooms 0.525
View 0.397
Sqft_Basement 0.324
Bedrooms 0.308
Lat 0.307
Waterfront 0.266
Floors 0.257
Yr_renovated 0.126
Sqft_lot 0.090
Sqft_lot15 0.082
Yr_built 0.054
Condition 0.036
Long 0.022
Zipcode -0.053

All the features are positively correlated with the House Price, except zipcode. The correlation of Price with Sqft_Living is the greatest with 0.702. A negative correlation between two variables means that one variable increases whenever the other decreases. We can see the biggest minimum of every column right below.

In [6]:

correlated_variables = correlation.idxmin()
correlation_values = correlation.min().values

correlation_dict = {'First Variable':correlated_variables.index, 'Second Variable':correlated_variables.values, 'Values':correlation_values}
pd.DataFrame(correlation_dict)
FIRST VARIABLE SECOND VARIABLE VALUES
0 price zipcode -0.053
1 bedrooms zipcode -0.153
2 bathrooms zipcode -0.204
3 sqft_living zipcode -0.199
4 sqft_lot zipcode -0.130
5 floors condition -0.264
6 waterfront long -0.042
7 view long -0.078
8 condition yr_built -0.361
9 grade zipcode -0.185
10 sqft_above zipcode -0.261
11 sqft_basement floors -0.246
12 yr_built condition -0.361
13 yr_renovated yr_built -0.225
14 zipcode long -0.564
15 lat yr_built -0.148
16 long zipcode -0.564
17 sqft_living15 zipcode -0.279
18 sqft_lot15 zipcode -0.147
Data Visualization

Now we are gonna pick the most interesting columns in this case, price,sqft_living,sqft_lot bedrooms, bathrooms and yr_built,to see how each other is correlated to one another.

In [7]:

sns.set(style = "ticks", color_codes=True)
correlation_features = ['price','bedrooms','bathrooms','sqft_living','sqft_lot','yr_built']
sns.set_style("darkgrid")
sns.pairplot(houses_df[correlation_features], size = 2.75,diag_kind="kde",dropna=True)
#diag_kind:Use kernel density estimates for univariate plots:
#kind:Fit linear regression models to the scatter plots

By looking to the scatter plots you can observe the next :

\[Price \rightarrow Strong \space Correlation = Sqft \space living , \space Very \space Weak \space Correlation= Year \space built\] \[Bedrooms \rightarrow Moderate\space Correlation = Sqft \space living , \space Very\space Weak \space Correlation= Year \space built\] \[Bathrooms \rightarrow Strong\space Correlation = Sqft \space living , \space Very\space Weak \space Correlation = Sqft \space Lot\] \[Sqft \space Living \rightarrow Strong\space Correlation = Bathroom , \space Very\space Weak \space Correlation = Sqft \space Lot\] \[Sqft \space Lot \rightarrow Very\space Weak \space Correlation = Sqft \space living , \space Very \space Weak \space Correlation = Year \space built\] \[Year \space Built \rightarrow Moderate\space Correlation = Bathrooms , \space Very\space Weak \space Correlation = Sqft \space Lot\]

Here you can find the intervals to classify the correlation I think the most difficult plot to analyze is the one from Year Built because they almost all look the same. Thats when it is useful if we calculate the Pearson coefficient along the plots.

Lets divide the dataset into training and test

In [8]:

dataset_train, dataset_test, price_train, price_test = train_test_split(houses_df,houses_df['price'],test_size=0.2,random_state=3)
Building a Linear Regressor

Regression is the process of estimating the relationship between input data and the continuous-valued output data. This data is usually in the form of real numbers, and our goal is to estimate the underlying function that governs the mapping from the input to the output.

Ordinary Least Squares

The first method we use is ordinary least squares and the idea behind is to find the best line that fits the data.

The error function or also called the loss function \(\epsilon_i\) is the difference between the observed values of y y i and the predicted values of y ŷi . This term is called Residual sum of squares for more information RSS

The linear model is written as:

\[y_i = a + bx_i + \epsilon_i\]

The ordinary least squares (OLS) seeks the coefficient \(a\) and \(b\). The goal is to find values of 𝑎 and 𝑏 that minimize the error. We redefine the error by the next formula

\[\epsilon(a,b)=\sum_{i=1}^n (y_i−ŷ )^2= \sum_{i=1}^n(y_i−(a+bx_i))^2\]

This requires us to find the values of (𝑎, 𝑏) such that the gradient of \(\epsilon\) with respect to our variables (which are 𝑎 and 𝑏) vanishes; then we require:

\[\frac {∂\epsilon}{∂a}=0\] \[\frac {∂\epsilon}{∂b}=0\]

Differentiating \(\epsilon(𝑎, 𝑏)\) yields:

\[\frac {∂\epsilon}{∂a}= 2\sum_{i=1}^n (y_i-a-bx_i)(-1)\] \[\frac {∂\epsilon}{∂b}= 2\sum_{i=1}^n (y_i-a-bx_i)(-x_i)\]

To solve this equations remember to use:

\[\bar{X} =\sum_{i=1}^n \frac{1}{n} x_i\]

So we will end up with the following coefficients:

\[a=\bar{y}−b\bar{x}\] \[b=\frac {\sum_{i=1}^n (x_i−\bar{x})(y_i−\bar{y})}{\sum_{i=1}^n(x_i−\bar{x})^2}\]

RMSE –> Root Mean Square Error

It indicates how close the observed data points are to the model's predicted values. Lower values of RMSE indicate better fit. RMSE is a good measure of how accurately the model predicts the response, and is the most important criterion for fit if the main purpose of the model is prediction.

The main advantages of using Least Squares are:

  • Applicability: There are hardly any applications where least squares doesn’t work
  • Calculations are very fast
  • Has no paramaters to tune

Disadvantages:

  • Sensitivity to outliers
  • Tendency to overfit data. If we have many features the learned hypothesis may fit the training set very well but fail to generalize to new examples


Let's train the model only taking into account one feature from the dataset, in this case we pick up Square feet living

In [9]:

#Build the regression model using only sqft_living as a feature
# Create linear regression object
regression_ols = linear_model.LinearRegression()
#We convert the column sqft_living to a numpy array to make it easier to work
living_train = np.asarray(dataset_train.sqft_living)
living_train = living_train.reshape(-1,1)
#Train the model using the training sets
#Here price is the "target" data in this model, the other features are the independet variables
ols_model = regression_ols.fit(living_train, price_train)
living_test = np.asarray(dataset_test.sqft_living)
living_test = living_test.reshape(-1,1)
#We the trained dataset we make a prediction for the test dataset
prediction_test_ols = ols_model.predict(living_test)

print ('Ordinary Least Squares')
#Coefficient
print('Coefficient:',ols_model.coef_[0])
print ('Intercept', ols_model.intercept_)
# Apply the model we created using the training data to the test data, and calculate the RSS.
print('RSS',((price_test - prediction_test_ols) **2).sum())
# Calculate the RMSE ( Root Mean Squared Error)
print('RMSE', np.sqrt(metrics.mean_squared_error(price_test,prediction_test_ols)))
#The model's performance on test set is:
print('The model\'s performance is %.2f\n'% ols_model.score(living_test, price_test))



living_test_sort = np.sort(living_test.reshape(-1))
plt.scatter(living_test, price_test, color='blue', alpha=0.25,label='Real Price')
#When you plot you have to sort the array, in this case square feet living , the one that belongs to the test, if you dont do this, the plot looks weird
plt.plot(living_test_sort, ols_model.predict(living_test_sort.reshape(-1,1)),'r--',linewidth=3, label='Ordinary Least squares Regression')


plt.xlabel('Price')
plt.ylabel('Square_feet_living')
plt.legend()
plt.yscale('log')





#Blue dots are from the original data the red line is the prediction from the least squares


plt.show()

Ordinary Least Squares
Coefficient: 282.24681417145496
Intercept -47235.80881852331
RSS 279538022220474.28
RMSE 254289.1477693324
The model's performance is: 0.50

In [10]:

actual_predicted_data_ols = pd.DataFrame({'Actual': price_test, 'Predicted': np.round(prediction_test_ols,decimals=3)})
actual_predicted_data_ols.head()
Actual Predicted
4131 525000 404359.094
17459 1870000 1225697.323
2192 750000 853131.528
12418 244900 127757.216
15773 275000 356377.135
Lasso Regression

It's a shrinkage and variable selection method. LASSO is an acronym form Least Absolute selection and Shrinkage Operator. The Lasso imposes a constraint on the sum of the absolute values of the model parameters where the sum has a specified constant as an upper bound. This constraint causes regression coefficients for some variables to shrink towards zero. The shrinkage process identifies the variables most strongly associated with the response variable. The goal is to obtain the subset of predictors that minimized the prediction error. You should use this method when you hove more than two features at least.

\[Y= \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + b\]

\(\beta_1 , \beta_2 , \beta_3\) are coefficients of regression

\(X_1, X_2, X_3\) are features

The Lasso method uses \(L_1\) regularization. What is that? It’s a way of avoiding overfitting.

\[\|X\|_1 = \sum_{i=1}^n|x_i|\]

\(L_1\) norm is the sum of the absolute value of the coefficients.

The cost function in Lasso is the next formula:

\[\epsilon = Error + Penalty\] \[\epsilon(a,b)=\sum_{i=1}^n (y_i−ŷ )^2 + \lambda\sum_{j=1}^p |\beta_j|\] \[\epsilon(a,b)=\sum_{i=1}^n(y_i−(\sum_{j=1}^p x_{ij}\beta_j))^2 +\lambda\sum_{j=1}^p|\beta_j|\]
Tuning paramater λ :

It is to control the strenght of the penalty.

  • \(\lambda\) increases more coefficients are reduced to zero
  • \(\lambda\) is zero then it’s OLS Regression.
  • \(\lambda \rightarrow \infty\) : we get \(\beta=0\) all coefficients are eliminated
  • \(\lambda\) increases, bias increases.
  • \(\lambda\) decreases, variance increases

The bias is an error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs(underfitting).The variance is an error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs (overfitting). For more information learn about the Bias and VarianceTradeoff

Advantages

  • Greater prediction accuracy
  • Increase model interpretability. Reduce variance without a substantial increase in bias.
  • The regression coefficients for unimportant variables are reduced to zero and produces a simple model that selects only the most important predictors.

Disadvantages

  • If coefficients are correlated , Lasso arbitrarily chooses only one of them.
  • Estimating p-values is not very straightforward

In [11]:

regression_lasso = linear_model.Lasso(alpha=.1)
lasso_model = regression_lasso.fit(living_train, price_train)
prediction_test_lasso = lasso_model.predict(living_test)

print ('Lasso Regression')
#Intercept
print ('Intercept', lasso_model.intercept_)
# Coefficient
print('Coefficient:', lasso_model.coef_[0])
# Apply the model we created using the training data to the test data, and calculate the RSS.
print('RSS',((price_test - prediction_test_lasso) **2).sum())
# Calculate the RMSE (Root Mean Squared Error)
print('RMSE', np.sqrt(metrics.mean_squared_error(price_test,prediction_test_lasso)))
# Coefficient of determination R^2 of the prediction
print('The model\'s performance is %.2f\n' % lasso_model.score(living_test, price_test))
# Plot
plt.scatter(living_test, price_test, color='green', alpha=0.25,label='Real Price')
plt.plot(living_test_sort, lasso_model.predict(living_test_sort.reshape(-1,1)),'b--',linewidth=3, label='Lasso Regression')
plt.xlabel('Price')
plt.ylabel('aquare_feet_living')
plt.legend()
plt.yscale('log')


plt.show()

Lasso Regression
Intercept -47235.808571451926
Coefficient: 282.24681405273867
RSS 279538022213446.22
RMSE 254289.14776613575
The model's performance is : 0.50

In [12]:

actual_predicted_data_lasso = pd.DataFrame({'Actual': price_test, 'Predicted': np.round(prediction_test_lasso,decimals=3)})

actual_predicted_data_lasso.head()
Actual Predicted
4131 525000 404359.094
17459 1870000 1225697.323
2192 750000 853131.528
12418 244900 127757.216
15773 275000 356377.136
Ridge Regression

Aims to avoid overfitting adding a cost to the RSS term of OLS. A tuning parameter \(\lambda\) controls the strength of the penalty.The \(\lambda\) parameter is a scalar that should be learned using cross validation. The penalty uses the \(L_2\) (euclidean length) of the coefficient vector .

The Ridge method uses \(L_2\) regularization. What is that? It’s a way of avoiding overfitting.

\[\|X\|_2 =\sum_{i=1}^n|x_i|^2\]

\(L_2\) norm is the sum of the squared value of the coefficients.

The cost function in Ridge is the next formula:

\[\epsilon = Error + Penalty\] \[\epsilon(a,b)=\sum_{i=1}^n (y_i−ŷ )^2 + \lambda\sum_{j=1}^p |\beta_j|^2\] \[\epsilon(a,b)=\sum_{i=1}^n(y_i−( \sum_{j=1}^p x_{ij}\beta_j))^2 +\lambda\sum_{j=1}^p|\beta_j| ^2\]
Tuning parameter λ :
  • When \(\lambda = 0\), we get the linear regression estimate
  • When \(\lambda\rightarrow \infty\),we get \(\beta_{j} =0\)
  • For \(\lambda\) in between, we are balancing two ideas: fitting a linear model of y on X, and shrinking the coefficients

As Lasso regression \(\rightarrow\) The bias increases as \(\lambda\) (amount of shrinkage) increases. And the variance decreases as \(\lambda\) increases The amount of shrinkage is controlled by \(\lambda\), the tuning parameter that multiplies the ridge penalty. Large λ means more shrinkage, and so we get different coefficient estimates for different values of λ. Choosing an appropriate value of λ is important, and also difficult.

Advantages

  • Ridge regression performs particularly well when there is a subset of true coefficients that are small or even zero.
  • Sparsity (Doesn’t produce sparse results i.e. it does not shrink coefficients all the way to zero)

Disadvantages

  • It doesn’t do as well when all of the true coefficients are moderately large

In [13]:

regression_ridge = linear_model.Ridge(alpha=[.1])
ridge_model = regression_ridge.fit(living_train, price_train)
prediction_test_ridge = ridge_model.predict(living_test)

print ('Ridge Regression')
#Intercept
print ('Intercept', ridge_model.intercept_)
# Coeficient
print('Coefficient:', ridge_model.coef_[0])
# Apply the model we created using the training data to the test data, and calculate the RSS.
print('RSS',((price_test - prediction_test_ridge) **2).sum())
# Calculate the RMSE (Root Mean Squared Error)
print('RMSE', np.sqrt(metrics.mean_squared_error(price_test,prediction_test_ridge)))
# Coefficient of determination R^2 of the prediction
print('The model\'s performance is %.2f\n' % ridge_model.score(living_test, price_test))
# Plot
plt.scatter(living_test, price_test, color='brown', alpha=0.25,label='Real Price')
plt.plot(living_test_sort, ridge_model.predict(living_test_sort.reshape(-1,1)),'g--',linewidth=3, label='Ridge Regression')
plt.xlabel('Price')
plt.ylabel('aquare_feet_living')
plt.legend()
plt.yscale('log')

plt.show()

Ridge Regression
Intercept -47235.808814489865
Coefficient: 282.2468141695169
RSS 279538022220359.56
RMSE 254289.14776928021
The model's performance is: 0.50

In [14]:

actual_predicted_data_ridge = pd.DataFrame({'Actual': price_test, 'Predicted': np.round(prediction_test_ridge,decimals=3)})
actual_predicted_data_ridge.head()
Actual Predicted
4131 525000 404359.094
17459 1870000 1225697.323
2192 750000 853131.528
12418 244900 127757.216
15773 275000 356377.135
AdaBoost Algorithm

AdaBoost stands for Adaptive Boosting. When we mention boosting we refer to aggregate a set of weak classifiers into a strong classifier. It is adaptive in the sense that classifiers that come in next for execution are adjusted according to those instances that were wrongly classified width the previous classifiers. You could say that by just focusing on the training data samples misclassified by the previous weak classifier, each weak classifier contributes its bit the best it can to improve the overall classification rate. AdaBoost calls the weak classifiers repeatedly, performing a series of \(t = 1,...,T\) classifiers. In each execution, “weight” calculated by incorrectly classified examples increases (or, alternatively, weights of each correctly classified examples decreases). New classifiers are constrained to focus on those examples that were incorrectly classified by previous classifiers.

Disadvantages

  • It is sensitive to noisy data and information that doesn’t belong to the required set

Advantages

  • In some situations, this algorithm may be less susceptible to memory input set in comparison to many other algorithms

Basic Idea

1- Take lots of (possibly) weak predictors

2- Weight them and add them up

3- Get a stronger predictor

First : Initialize the weight of each observation to \(W_i =\frac { 1}{N}\) For \(t\) in 1 to T do the following.

Second : Using the weights, learn model \(h_t(x_i) : X \rightarrow [0,1]\)

Third : Compute
\(\epsilon =\sum_{i=1}^{n} w_i^t | y_i −h_t (x_i )|\) as the error for \(h_t\)

Fourth : Let \(\beta_{t}\) = \(\frac {\epsilon_{t}}{1 - \epsilon_{t}}\) and update the weights of each of the observations as \(w_i ^{(t+1)} = w_i^{(t)}\beta_{t}^{1-|y_i -h_t(x_i)|}\) This scheme increases the weights of observations poorly predicted by \(h_t\)

Fifth : Normalize \(w^{t+1}\) so that they sum to one

In [15]:

#n_estimators: It controls the number of weak learners.
#learning_rate:Controls the contribution of weak learners in the final combination. There is a trade-off between learning_rate and n_estimators.
#base_estimators: It helps to specify different ML algorithm. By default sklearn uses decision tree
adaboost_regressor = AdaBoostRegressor(n_estimators=1500, learning_rate = 0.001, loss='exponential')
ada_model = adaboost_regressor.fit(living_train, price_train)
prediction_test_ada = ada_model.predict(living_test)
# Apply the model we created using the training data to the test data, and calculate the RSS.
print('RSS',((price_test - prediction_test_ada) **2).sum())
# Calculate the RMSE (Root Mean Squared Error)
print('RMSE', np.sqrt(metrics.mean_squared_error(price_test,prediction_test_ada)))
#Coefficient of determination R^2 of the prediction
print('The model\'s performance is %.2f\n' % ada_model.score(living_test, price_test))
# Plot
plt.scatter(living_test, price_test, color='black', alpha=0.25,label='Real Price')
plt.plot(living_test_sort, ada_model.predict(living_test_sort.reshape(-1,1)),'g--',linewidth=3, label='AdaBoost regressor')
plt.xlabel('Price')
plt.ylabel('aquare_feet_living')
plt.legend()
plt.yscale('log')

plt.show()

RSS 262076146288792.1
RMSE 246218.7540170321
The model's performance is: 0.53

In [16]:

actual_predicted_data_ada = pd.DataFrame({'Actual': price_test, 'Predicted': np.round(prediction_test_ada,decimals=3)})
actual_predicted_data_ada.head()
Actual Predicted
4131 525000 442630.127
17459 1870000 1469039.993
2192 750000 777804.825
12418 244900 343675.471
15773 275000 347800.098
Let's train the model adding more features

If we have too many features and we are not sure which ones might work the best, you can carry out a feature selection step through either PCA (Principal Components Analysis) or LDA (Linear Discriminant Analysis)


Polynomial Curve Fitting

Consider the general form for a polynomial of order N

\[\hat y(x) = a_0 + a_1x^1 + a_2x^2 + ...... a_nx^n = a_0 +\sum_{i=1}^n a_ix^i\]

Just as was the case for linear regression, we ask, how can we pick the coefficients that best fits the curve to the data? We use the same idea: The curve that gives minimum error between data and the fit \(\hat y (x )\) is ‘the best’


Error - Least squares approach

As we mentioned before the error using the least squares approach is: \(\epsilon(x)= \sum_{i=1}^n (y_i - \hat y)^2\)

\[\epsilon(x)= \sum_{i=1}^n (y_i - (a_0 + a_1x_i^1 +a_2x_i^2 + a_3x_i^3... ))^2\]

where ‘i’ is the current point and ‘n’ is the total number of points that we have

\[\epsilon(x)=\sum_{i=1}^n (y_i - (a_0 +\sum_{j=1}^k a_jx^j ))^2\]

To minimize this equation we proceed to take the derivative respect to each coefficient in order to find the best curve

In [17]:

dataset_train, dataset_test = train_test_split(houses_df,test_size=0.2,random_state=3)
price_train= np.asarray(dataset_train.price).reshape(-1,1)
my_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors','yr_built','zipcode']
train_matrix = dataset_train.as_matrix(my_features)
regr_with_more_features = linear_model.LinearRegression()
model_least_squares = regr_with_more_features.fit(train_matrix, price_train)


matrix_test = dataset_test.as_matrix(my_features)

price_test_multiple_regression = np.asarray(dataset_test.price).reshape(-1,1)

prediction_test_least_squares = model_least_squares.predict(matrix_test)

print ('Least Squares Means')
#Coefficient
print('Coefficient:',model_least_squares.coef_[0])
#Apply the model we created using the training data to the test data, and calculate the RSE.
print('RSS',((price_test_multiple_regression - prediction_test_least_squares) **2).sum())
# Calculate the MSE
print('RMSE', np.sqrt(metrics.mean_squared_error(price_test_multiple_regression,prediction_test_least_squares)))
# Coefficient of determination R^2 of the prediction
print('The model\'s performance is %.2f\n' % model_least_squares.score(matrix_test, price_test_multiple_regression))

Least Squares Means
Coefficient: [-6.66351683e+04 6.66137331e+04 3.04489220e+02 -2.82811904e-01 5.38124224e+04 -3.39351106e+03 6.28399203e+01]
RSS 248170780815915.44
RMSE 239597.73367059807
The model's performance is: 0.55

In [18]:

actual_predicted_data_least_squares = pd.DataFrame({'Actual': price_test_multiple_regression, 'Predicted': np.round(prediction_test_least_squares,decimals=3)})
actual_predicted_data_least_squares.head()
Actual Predicted
4131 525000 348437.108
17459 1870000 1328478.319
2192 750000 791693.408
12418 244900 264810.433
15773 275000 347495.643
Putting all together

So this is the part that you ask yourself: How do I choose the best model that represents my data? In this particularly case we have to look at:

-\(RMSE\)

-\(R^2\)

I will detail a little bit about the last one. \(R^2\) It’s the coefficient of determination. It explains how good is your model when compared to the baseline model. The math formula is given by:

\[R^2 = 1 - \frac{SS_{res}}{SS_{tot}}\]

Where:

  • \(\bar{y}\) is the mean of the observed data:

  • \(y_i\) represents the observed values

    \[SS_{tot} =\sum_{i=1}^n (y_i - \bar{y})^2\]

    \(SS_{tot}\) quantifies how much the data points \(y_i\) varies from their mean \(\bar{y}\)

    \[SS_{res} = \sum_{i=1}^n (y_i - \hat{y})^2\]

    \(SSE_{res}\) quantifies how much the data points, \(y_i\) varies around the estimated regression \(\hat{y}\)

If this number is large, it is said, the regression gives a good fit . When is a large number? Well R goes from 0 to 1 for linear regressions.

  • \(R^2\) = 1 indicates that the regression predictions perfectly fit the data.
  • \(R^2\) = 0 indicates that the estimated regression line is perfectly horizontal

So… How do we interpret this coefficient?

”\(R^2\) ×100 percent of the variation in y is accounted for by the variation in predictor x.”

If \(R^2\)=0.55 then it means 55% variations in House prices is accounted for by the variation in Square Feet Living..

If you want to read more about this, check the Statistics Program from Pennsylvania State University

As for RMSE, the lowest the better, because it means that our prediction line is not varying that much from the actual values So taking all this into account. Which are ours best models? According to \(R^2\) and \(RMSE\), our best pick is Linear regression with more features and then Adaboost algorithm. As for Linear regression if we take under consideration all the features probably we are going to have a better model. Thats all for today, folks. I hope this comes useful to someone and don’t hesitate if you have any doubt, or see an error. Drop a line!

References