Data Analysis with Python(Part 3) ~ Learning help you to achieve your biggest Goals

Data Analysis with Python(Part 3)

Model Development

Data Analytics, we often use Model Development to help us predict future observations from the data we have.
A Model will help us understand the exact relationship between different variables and how these variables are used to predict the result.

1.Linear Regression and Multiple Linear Regression

Simple Linear Regression is a method to help us understand the relationship between two variables:

The predictor/independent variable (X)
The response/dependent variable (that we want to predict)(Y)

𝑌:𝑅𝑒𝑠𝑝𝑜𝑛𝑠𝑒 𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒

𝑋:𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑜𝑟 𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠

𝑌ℎ𝑎𝑡=𝑎+𝑏𝑋

-a refers to the intercept of the regression line0, in other words: the value of Y when X is 0
-b refers to the slope of the regression line, in other words: the value with which Y changes when X increases by 1 unit

we should get a final linear model with the structure
𝑌ℎ𝑎𝑡=𝑎+𝑏𝑋

Multiple Linear Regression

If we want to use more variables in our model to predict car price, we can use Multiple Linear Regression. Multiple Linear Regression is very similar to Simple Linear Regression, but this method is used to explain the relationship between one continuous response (dependent) variable and two or more predictor (independent) variables.

𝑌ℎ𝑎𝑡=𝑎+𝑏1𝑋1+𝑏2𝑋2+𝑏3𝑋3+𝑏4𝑋4

𝑌:𝑅𝑒𝑠𝑝𝑜𝑛𝑠𝑒 𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒
𝑋1:𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑜𝑟 𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒 1
𝑋2:𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑜𝑟 𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒 2
𝑋3:𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑜𝑟 𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒 3
𝑋4:𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑜𝑟 𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒 4

𝑎:𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡
𝑏1:𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡𝑠 𝑜𝑓 𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒 1
𝑏2:𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡𝑠 𝑜𝑓 𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒 2
𝑏3:𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡𝑠 𝑜𝑓 𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒 3
𝑏4:𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡𝑠 𝑜𝑓 𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒 4

2) Model Evaluation using Visualization

Regression Plot
When it comes to simple linear regression, an excellent way to visualize the fit of our model is by using regression plots.

Residual Plot

A good way to visualize the variance of the data is to use a residual plot.

The difference between the observed value (y) and the predicted value (Yhat) is called the residual (e). When we look at a regression plot, the residual is the distance from the data point to the fitted regression line.

Multiple Linear Regression
One way to look at the fit of the model is by looking at the distribution plot: We can look at the distribution of the fitted values that result from the model and compare it to the distribution of the actual values.

3)Polynomial Regression and Pipelines

Polynomial regression is a particular case of the general linear regression model or multiple linear regression models.
We get non-linear relationships by squaring or setting higher-order terms of the predictor variables.

Pipeline
Data Pipelines simplify the steps of processing the data. We use the module Pipeline to create a pipeline. We also use StandardScaler as a step in our pipeline.

4)Measures for In-Sample Evaluation

Two very important measures that are often used in Statistics to determine the accuracy of a model are:

R^2 / R-squared
R squared, also known as the coefficient of determination, is a measure to indicate how close the data is to the fitted regression line.

The value of the R-squared is the percentage of variation of the response variable (y) that is explained by a linear model.The higher the coefficient, the higher percentage of points the line passes through when the data points and line are plotted.

Mean Squared Error (MSE)
Mean Squared Error measures the average of the squares of errors, that is, the difference between actual value (y) and the estimated value (ŷ)

5)Prediction and Decision Making

the model with the higher R-squared value is a better fit for the data.
the model with the smallest MSE value is a better fit for the data

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

#load data and store in dataframe df:

path="C:/Users/thakudev/PYTHON/Data/automobileEDA.csv"

df=pd.read_csv(path)

print(df.head())

#load the modules for linear regression

from sklearn.linear_model import LinearRegression

lm=LinearRegression()

#Highway-mpg help us predict car price

#highway-mpg" as the predictor variable and the "price" as the response variable.

X=df[['highway-mpg']]

Y=df['price']

#Fit the linear model using highway-mpg

lm.fit(X,Y)

#We can output a prediction

Yhat=lm.predict(X)

print(Yhat[0:5])

#value of the intercept (a)

print(lm.intercept_)

#value of the Slope (b)

print(lm.coef_)

#Multiple Linear Regression

Z=df[['horsepower','curb-weight','engine-size','highway-mpg']]

lm.fit(Z,df['price'])

print("Z intercept",lm.intercept_)

print("Z coe",lm.coef_)

#Visualization

import seaborn as sns

#%matplotlib inline

'exec(%matplotlib inline)'

width=12

height=10

plt.figure(figsize=(width,height))

#sns.regplot(x="highway-mpg",y="price",data=df)

sns.regplot(x="peak-rpm", y="price", data=df)

plt.ylim(0,)

print(plt.show())

#from this plot that price is negatively correlated to highway-mpg, since the regression slope is negative

df[["peak-rpm","highway-mpg","price"]].corr()

#A good way to visualize the variance of the data is to use a residual plot.

width=13

height=11

plt.figure(figsize=(width,height))

sns.residplot(df["highway-mpg"],df["price"])

print(plt.show())

Y_hat=lm.predict(Z)

plt.figure(figsize=(width,height))

ax1=sns.distplot(df["price"],hist=False,color='r',label="Actual Value")

sns.distplot(Y_hat,hist=False,color="b",label="Fitted Values",ax=ax1)

plt.title("Actual vs Fitted Values for Price")

plt.xlabel('Price(in dollar')

plt.ylabel("Proporation of cars")

print(plt.show())

plt.close()

def PlotPolly(model, independent_variable, dependent_variabble, Name):

x_new = np.linspace(15, 55, 100)

y_new = model(x_new)

plt.plot(independent_variable, dependent_variabble, '.', x_new, y_new, '-')

plt.title('Polynomial Fit with Matplotlib for Price ~ Length')

ax = plt.gca()

ax.set_facecolor((0.898, 0.898, 0.898))

fig = plt.gcf()

plt.xlabel(Name)

plt.ylabel('Price of Cars')

plt.show()

plt.close()

x = df['highway-mpg']

y = df['price']

f=df["highway-mpg"]

y=df["price"]

f=np.polyfit(x,y,3)

p=np.poly1d(f)

print(p)

PlotPolly(p,x,y,"highway-mpg")

np.polyfit(x,y,3)

#Pipeline

from sklearn.pipeline import Pipeline

from sklearn.preprocessing import StandardScaler

from sklearn.preprocessing import PolynomialFeatures

pr=PolynomialFeatures(degree=2)

Input=[('scale',StandardScaler()), ('polynomial', PolynomialFeatures(include_bias=False)), ('model',LinearRegression())]

pipe=Pipeline(Input)

pipe.fit(Z,y)

#ypipe=pipe.predict(Z)

#ypipe[0:4]

#highway_mpg_fit

lm.fit(X, Y)

# Find the R^2

print('The R-square is: ', lm.score(X, Y))

Yhat=lm.predict(X)

print('The output of the first four predicted value is: ', Yhat[0:4])

from sklearn.metrics import mean_squared_error

mse = mean_squared_error(df['price'], Yhat)

print('The mean square error of price and predicted value is: ', mse)

Learning help you to achieve your biggest Goals

Tuesday, April 23, 2019