Data Analysis with Python(Part 3)
Model Development
Data Analytics, we often use Model Development to help us predict future observations from the data we have.
A Model will help us understand the exact relationship between different variables and how these variables are used to predict the result.
1.Linear Regression and Multiple Linear Regression
Simple Linear Regression is a method to help us understand the relationship between two variables:
The predictor/independent variable (X)
The response/dependent variable (that we want to predict)(Y)
๐:๐ ๐๐ ๐๐๐๐ ๐ ๐๐๐๐๐๐๐๐
Data Analytics, we often use Model Development to help us predict future observations from the data we have.
A Model will help us understand the exact relationship between different variables and how these variables are used to predict the result.
1.Linear Regression and Multiple Linear Regression
Simple Linear Regression is a method to help us understand the relationship between two variables:
The predictor/independent variable (X)
The response/dependent variable (that we want to predict)(Y)
๐:๐ ๐๐ ๐๐๐๐ ๐ ๐๐๐๐๐๐๐๐
๐:๐๐๐๐๐๐๐ก๐๐ ๐๐๐๐๐๐๐๐๐
๐โ๐๐ก=๐+๐๐
-a refers to the intercept of the regression line0, in other words: the value of Y when X is 0
-b refers to the slope of the regression line, in other words: the value with which Y changes when X increases by 1 unit
we should get a final linear model with the structure
๐โ๐๐ก=๐+๐๐
Multiple Linear Regression
If we want to use more variables in our model to predict car price, we can use Multiple Linear Regression. Multiple Linear Regression is very similar to Simple Linear Regression, but this method is used to explain the relationship between one continuous response (dependent) variable and two or more predictor (independent) variables.
๐โ๐๐ก=๐+๐1๐1+๐2๐2+๐3๐3+๐4๐4
๐:๐ ๐๐ ๐๐๐๐ ๐ ๐๐๐๐๐๐๐๐
๐1:๐๐๐๐๐๐๐ก๐๐ ๐๐๐๐๐๐๐๐ 1
๐2:๐๐๐๐๐๐๐ก๐๐ ๐๐๐๐๐๐๐๐ 2
๐3:๐๐๐๐๐๐๐ก๐๐ ๐๐๐๐๐๐๐๐ 3
๐4:๐๐๐๐๐๐๐ก๐๐ ๐๐๐๐๐๐๐๐ 4
๐:๐๐๐ก๐๐๐๐๐๐ก
๐1:๐๐๐๐๐๐๐๐๐๐๐ก๐ ๐๐ ๐๐๐๐๐๐๐๐ 1
๐2:๐๐๐๐๐๐๐๐๐๐๐ก๐ ๐๐ ๐๐๐๐๐๐๐๐ 2
๐3:๐๐๐๐๐๐๐๐๐๐๐ก๐ ๐๐ ๐๐๐๐๐๐๐๐ 3
๐4:๐๐๐๐๐๐๐๐๐๐๐ก๐ ๐๐ ๐๐๐๐๐๐๐๐ 4
2) Model Evaluation using Visualization
Regression Plot
When it comes to simple linear regression, an excellent way to visualize the fit of our model is by using regression plots.
Residual Plot
A good way to visualize the variance of the data is to use a residual plot.
The difference between the observed value (y) and the predicted value (Yhat) is called the residual (e). When we look at a regression plot, the residual is the distance from the data point to the fitted regression line.
Multiple Linear Regression
One way to look at the fit of the model is by looking at the distribution plot: We can look at the distribution of the fitted values that result from the model and compare it to the distribution of the actual values.
3)Polynomial Regression and Pipelines
Polynomial regression is a particular case of the general linear regression model or multiple linear regression models.
We get non-linear relationships by squaring or setting higher-order terms of the predictor variables.
Pipeline
Data Pipelines simplify the steps of processing the data. We use the module Pipeline to create a pipeline. We also use StandardScaler as a step in our pipeline.
4)Measures for In-Sample Evaluation
Two very important measures that are often used in Statistics to determine the accuracy of a model are:
R^2 / R-squared
R squared, also known as the coefficient of determination, is a measure to indicate how close the data is to the fitted regression line.
The value of the R-squared is the percentage of variation of the response variable (y) that is explained by a linear model.The higher the coefficient, the higher percentage of points the line passes through when the data points and line are plotted.
Mean Squared Error (MSE)
Mean Squared Error measures the average of the squares of errors, that is, the difference between actual value (y) and the estimated value (ลท)
5)Prediction and Decision Making
the model with the higher R-squared value is a better fit for the data.
the model with the smallest MSE value is a better fit for the data
๐โ๐๐ก=๐+๐๐
-a refers to the intercept of the regression line0, in other words: the value of Y when X is 0
-b refers to the slope of the regression line, in other words: the value with which Y changes when X increases by 1 unit
we should get a final linear model with the structure
๐โ๐๐ก=๐+๐๐
Multiple Linear Regression
If we want to use more variables in our model to predict car price, we can use Multiple Linear Regression. Multiple Linear Regression is very similar to Simple Linear Regression, but this method is used to explain the relationship between one continuous response (dependent) variable and two or more predictor (independent) variables.
๐โ๐๐ก=๐+๐1๐1+๐2๐2+๐3๐3+๐4๐4
๐:๐ ๐๐ ๐๐๐๐ ๐ ๐๐๐๐๐๐๐๐
๐1:๐๐๐๐๐๐๐ก๐๐ ๐๐๐๐๐๐๐๐ 1
๐2:๐๐๐๐๐๐๐ก๐๐ ๐๐๐๐๐๐๐๐ 2
๐3:๐๐๐๐๐๐๐ก๐๐ ๐๐๐๐๐๐๐๐ 3
๐4:๐๐๐๐๐๐๐ก๐๐ ๐๐๐๐๐๐๐๐ 4
๐:๐๐๐ก๐๐๐๐๐๐ก
๐1:๐๐๐๐๐๐๐๐๐๐๐ก๐ ๐๐ ๐๐๐๐๐๐๐๐ 1
๐2:๐๐๐๐๐๐๐๐๐๐๐ก๐ ๐๐ ๐๐๐๐๐๐๐๐ 2
๐3:๐๐๐๐๐๐๐๐๐๐๐ก๐ ๐๐ ๐๐๐๐๐๐๐๐ 3
๐4:๐๐๐๐๐๐๐๐๐๐๐ก๐ ๐๐ ๐๐๐๐๐๐๐๐ 4
2) Model Evaluation using Visualization
Regression Plot
When it comes to simple linear regression, an excellent way to visualize the fit of our model is by using regression plots.
Residual Plot
A good way to visualize the variance of the data is to use a residual plot.
The difference between the observed value (y) and the predicted value (Yhat) is called the residual (e). When we look at a regression plot, the residual is the distance from the data point to the fitted regression line.
Multiple Linear Regression
One way to look at the fit of the model is by looking at the distribution plot: We can look at the distribution of the fitted values that result from the model and compare it to the distribution of the actual values.
3)Polynomial Regression and Pipelines
Polynomial regression is a particular case of the general linear regression model or multiple linear regression models.
We get non-linear relationships by squaring or setting higher-order terms of the predictor variables.
Pipeline
Data Pipelines simplify the steps of processing the data. We use the module Pipeline to create a pipeline. We also use StandardScaler as a step in our pipeline.
4)Measures for In-Sample Evaluation
Two very important measures that are often used in Statistics to determine the accuracy of a model are:
R^2 / R-squared
R squared, also known as the coefficient of determination, is a measure to indicate how close the data is to the fitted regression line.
The value of the R-squared is the percentage of variation of the response variable (y) that is explained by a linear model.The higher the coefficient, the higher percentage of points the line passes through when the data points and line are plotted.
Mean Squared Error (MSE)
Mean Squared Error measures the average of the squares of errors, that is, the difference between actual value (y) and the estimated value (ลท)
5)Prediction and Decision Making
the model with the higher R-squared value is a better fit for the data.
the model with the smallest MSE value is a better fit for the data
|
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#load data and store in dataframe df:
path="C:/Users/thakudev/PYTHON/Data/automobileEDA.csv"
df=pd.read_csv(path)
print(df.head())
#load the modules for linear regression
from sklearn.linear_model import LinearRegression
lm=LinearRegression()
lm
#Highway-mpg help us predict car price
#highway-mpg" as the predictor variable and the
"price" as the response variable.
X=df[['highway-mpg']]
Y=df['price']
#Fit the linear model using highway-mpg
lm.fit(X,Y)
#We can output a prediction
Yhat=lm.predict(X)
print(Yhat[0:5])
#value of the intercept (a)
print(lm.intercept_)
#value of the Slope (b)
print(lm.coef_)
#Multiple Linear Regression
Z=df[['horsepower','curb-weight','engine-size','highway-mpg']]
lm.fit(Z,df['price'])
print("Z intercept",lm.intercept_)
print("Z coe",lm.coef_)
#Visualization
import seaborn as sns
#%matplotlib
inline
'exec(%matplotlib inline)'
width=12
height=10
plt.figure(figsize=(width,height))
#sns.regplot(x="highway-mpg",y="price",data=df)
sns.regplot(x="peak-rpm", y="price", data=df)
plt.ylim(0,)
print(plt.show())
#from this plot that price is negatively correlated to highway-mpg,
since the regression slope is negative
df[["peak-rpm","highway-mpg","price"]].corr()
#A good way to visualize the variance of the data is to use a residual
plot.
width=13
height=11
plt.figure(figsize=(width,height))
sns.residplot(df["highway-mpg"],df["price"])
print(plt.show())
Y_hat=lm.predict(Z)
plt.figure(figsize=(width,height))
ax1=sns.distplot(df["price"],hist=False,color='r',label="Actual Value")
sns.distplot(Y_hat,hist=False,color="b",label="Fitted Values",ax=ax1)
plt.title("Actual vs
Fitted Values for Price")
plt.xlabel('Price(in dollar')
plt.ylabel("Proporation
of cars")
print(plt.show())
plt.close()
def PlotPolly(model,
independent_variable, dependent_variabble, Name):
x_new = np.linspace(15, 55, 100)
y_new = model(x_new)
plt.plot(independent_variable, dependent_variabble, '.', x_new, y_new, '-')
plt.title('Polynomial Fit with Matplotlib for Price ~ Length')
ax = plt.gca()
ax.set_facecolor((0.898, 0.898, 0.898))
fig = plt.gcf()
plt.xlabel(Name)
plt.ylabel('Price of Cars')
plt.show()
plt.close()
x = df['highway-mpg']
y = df['price']
f=df["highway-mpg"]
y=df["price"]
f=np.polyfit(x,y,3)
p=np.poly1d(f)
print(p)
PlotPolly(p,x,y,"highway-mpg")
np.polyfit(x,y,3)
#Pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
pr=PolynomialFeatures(degree=2)
Input=[('scale',StandardScaler()), ('polynomial', PolynomialFeatures(include_bias=False)), ('model',LinearRegression())]
pipe=Pipeline(Input)
pipe.fit(Z,y)
#ypipe=pipe.predict(Z)
#ypipe[0:4]
#highway_mpg_fit
lm.fit(X, Y)
# Find the R^2
print('The
R-square is: ', lm.score(X, Y))
Yhat=lm.predict(X)
print('The
output of the first four predicted value is: ', Yhat[0:4])
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(df['price'], Yhat)
print('The
mean square error of price and predicted value is: ', mse)
|



0 comments:
Post a Comment