Saturday, April 13, 2019

Data Analysis with Python(Part 2)

Data Analysis with Python (Part 2)


Exploration Data Analysis


1.Exploratory Data Analysis

-Summarize main characteristics of the data
-Gain better understanding of the data set
-Uncover releationships between variable
-Extract important variables

2.Descriptive Statistics

Describe basic feature of data
Giving short summaries about the sample and measure of the data

(df.describe )

summarize the categorical dat is by using the value_counts() method

drive_wheels_counts=df["drive-wheels].vvalue_counts()
drive_wheels_counts.rename(columns={'drive-wheels':'value_counts' inplace=True)

drive_wheels_counts.index.name='drive-wheels'

3.GroupBy in Python

Use Panda dataframe.GroupBy() method
-Can be applied on categorical variables
-GroupBy data into categories
-Single or multiple

df_test=df[['drive-wheels','body-style','price']]

df_grp=df_test.groupby(['drive-wheel','body-style'],as_index=False).mean()
df_grp


We can transform table to a pivot table by using the pivot.

df_pivot=df_grp.pivot(index='drive-wheels',columns='body-style')

Heatmap

Plot target variable over multiple variables
plt.pcolor(df_pivot,cmap='RdBu')
plt.colorbar()
plt.show()


4.Analysis of variance (ANOVA)


-Statistical comparison of groups
Example:average priceof different vehicle makes

ANOVA is a statistical test that stands for analysis of Variance

ANOVA can be used to find the correlation between different groups of categorical variable.

Obtain from ANOVA
F-test score:variation between sample group means divided by variation within
sample group.

p-value:confidence degree

Small F imply poor coorrelation between variable categories and target variable.

Large F imply Strong coorrelation between variable categories and target variable.

ANOVA between Honda and Subaru

df_anova=df[["make","price"]]
grouped_anova=df_anova.groupby(["make"])


anova_results=stats.f_oneway(grouped_anova.get_group("honds")["price"],grouped_anova.get_group("subaru")["price"])



5.Correlation(Positive Linear releationship)

Measures to what extent different variables are interdependent

Example:Rain->Umbrella

Correlation between two feature (engine-size and price)

sns.regplot(x="engine-size",y="prices",data=df)
plt.ylim(0,)

Negative Linear releationship
Weak correlation

6.Correlation Statistics


Pearson Correlation
Measures the strength of the correlation between two features.
-Correlation coefficient
-P-value

Correlation coefficient

close tp +1:Large positive releationship
Close to -1:Large Negative releationship
Close to  0 : No releationship

P-value

P-value<0.001 Strong certainty in the result
P-value<0.05  Moderate certainty in the result
P-value<0.1 Weak certainty in the result
P-value>0.1 No certainty in th result


Strong Correlation:
Correlation coefficient close to 1 or -1
P value less than 0.001

pearson_coef,p_value=stats.pearsonr(df(['horsepower'],df['price']

Pearson correlation:0.81
P-value:9.35 e-48


import pandas as pd
import numpy as np

path='C:/Users/thakudev/PYTHON/automobileEDA.csv'
df = pd.read_csv(path)
#print(df.head())
# pip install seaborn

import matplotlib.pyplot as plt
import seaborn as sns
'exec(%matplotlib inline)'

#print(df.dtypes)

#calculate the correlation between variables of type "int64" or "float64" using the method "corr":
#print(df.corr())
#print(df[['bore', 'stroke', 'compression-ratio', 'horsepower']].corr())

# Engine size as potential predictor variable of price
sns.regplot(x="engine-size", y="price", data=df)
plt.ylim(0,)
#print(plt.show())


#correlation between 'engine-size' and 'price'
print(df[["engine-size", "price"]].corr())



print(sns.regplot(x="highway-mpg", y="price", data=df))
#print(plt.show())

print(df[['highway-mpg', 'price']].corr())


sns.boxplot(x="body-style", y="price", data=df)
#print(plt.show())


#Descriptive Statistical Analysis
#default setting of "describe" skips variables of type object
print(df.describe())
print(df.describe(include=['object']))

#convert the series to a Dataframe as follows
drive_wheels_counts = df['drive-wheels'].value_counts().to_frame()
drive_wheels_counts.rename(columns={'drive-wheels': 'value_counts'}, inplace=True)
print(drive_wheels_counts)
drive_wheels_counts.index.name = 'drive-wheels'
print(drive_wheels_counts)

#Basics of Grouping
print(df["drive-wheels"].unique())
df_group_one = df[['drive-wheels','body-style','price']]
df_group_one = df_group_one.groupby(['drive-wheels'],as_index=False).mean()
print(df_group_one)

df_gptest = df[['drive-wheels','body-style','price']]
grouped_test1 = df_gptest.groupby(['drive-wheels','body-style'],as_index=False).mean()
grouped_test1
grouped_pivot = grouped_test1.pivot(index='drive-wheels',columns='body-style')
print(grouped_pivot)

grouped_pivot = grouped_pivot.fillna(0) #fill missing values with 0
print(grouped_pivot)

#Pearson Correlation
print(df.corr())
from scipy import stats
pearson_coef,p_value=stats.pearsonr(df['wheel-base'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)

#ANOVA
grouped_test2=df_gptest[['drive-wheels', 'price']].groupby(['drive-wheels'])
grouped_test2.head(2)
print(df_gptest)

grouped_test2.get_group('4wd')['price']
# ANOVA
f_val, p_val = stats.f_oneway(grouped_test2.get_group('fwd')['price'], grouped_test2.get_group('rwd')['price'], grouped_test2.get_group('4wd')['price']) 

print( "ANOVA results: F=", f_val, ", P =", p_val)








0 comments:

Post a Comment