Data Analysis with Python(Part 2) ~ Learning help you to achieve your biggest Goals

Data Analysis with Python (Part 2)

Exploration Data Analysis

1.Exploratory Data Analysis

-Summarize main characteristics of the data
-Gain better understanding of the data set
-Uncover releationships between variable
-Extract important variables

2.Descriptive Statistics

Describe basic feature of data
Giving short summaries about the sample and measure of the data

(df.describe )

summarize the categorical dat is by using the value_counts() method

drive_wheels_counts=df["drive-wheels].vvalue_counts()
drive_wheels_counts.rename(columns={'drive-wheels':'value_counts' inplace=True)

drive_wheels_counts.index.name='drive-wheels'

3.GroupBy in Python

Use Panda dataframe.GroupBy() method
-Can be applied on categorical variables
-GroupBy data into categories
-Single or multiple

df_test=df[['drive-wheels','body-style','price']]

df_grp=df_test.groupby(['drive-wheel','body-style'],as_index=False).mean()
df_grp

We can transform table to a pivot table by using the pivot.

df_pivot=df_grp.pivot(index='drive-wheels',columns='body-style')

Heatmap

Plot target variable over multiple variables
plt.pcolor(df_pivot,cmap='RdBu')
plt.colorbar()
plt.show()

4.Analysis of variance (ANOVA)

-Statistical comparison of groups
Example:average priceof different vehicle makes

ANOVA is a statistical test that stands for analysis of Variance

ANOVA can be used to find the correlation between different groups of categorical variable.

Obtain from ANOVA
F-test score:variation between sample group means divided by variation within
sample group.

p-value:confidence degree

Small F imply poor coorrelation between variable categories and target variable.

Large F imply Strong coorrelation between variable categories and target variable.

ANOVA between Honda and Subaru

df_anova=df[["make","price"]]
grouped_anova=df_anova.groupby(["make"])

anova_results=stats.f_oneway(grouped_anova.get_group("honds")["price"],grouped_anova.get_group("subaru")["price"])

5.Correlation(Positive Linear releationship)

Measures to what extent different variables are interdependent

Example:Rain->Umbrella

Correlation between two feature (engine-size and price)

sns.regplot(x="engine-size",y="prices",data=df)
plt.ylim(0,)

Negative Linear releationship
Weak correlation

6.Correlation Statistics

Pearson Correlation
Measures the strength of the correlation between two features.
-Correlation coefficient
-P-value

Correlation coefficient

close tp +1:Large positive releationship
Close to -1:Large Negative releationship
Close to 0 : No releationship

P-value

P-value<0.001 Strong certainty in the result
P-value<0.05 Moderate certainty in the result
P-value<0.1 Weak certainty in the result
P-value>0.1 No certainty in th result

Strong Correlation:
Correlation coefficient close to 1 or -1
P value less than 0.001

pearson_coef,p_value=stats.pearsonr(df(['horsepower'],df['price']

Pearson correlation:0.81
P-value:9.35 e-48

import pandas as pd

import numpy as np

path='C:/Users/thakudev/PYTHON/automobileEDA.csv'

df = pd.read_csv(path)

#print(df.head())

# pip install seaborn

import matplotlib.pyplot as plt

import seaborn as sns

'exec(%matplotlib inline)'

#print(df.dtypes)

#calculate the correlation between variables of type "int64" or "float64" using the method "corr":

#print(df.corr())

#print(df[['bore', 'stroke', 'compression-ratio', 'horsepower']].corr())

# Engine size as potential predictor variable of price

sns.regplot(x="engine-size", y="price", data=df)

plt.ylim(0,)

#print(plt.show())

#correlation between 'engine-size' and 'price'

print(df[["engine-size", "price"]].corr())

print(sns.regplot(x="highway-mpg", y="price", data=df))

#print(plt.show())

print(df[['highway-mpg', 'price']].corr())

sns.boxplot(x="body-style", y="price", data=df)

#print(plt.show())

#Descriptive Statistical Analysis

#default setting of "describe" skips variables of type object

print(df.describe())

print(df.describe(include=['object']))

#convert the series to a Dataframe as follows

drive_wheels_counts = df['drive-wheels'].value_counts().to_frame()

drive_wheels_counts.rename(columns={'drive-wheels': 'value_counts'}, inplace=True)

print(drive_wheels_counts)

drive_wheels_counts.index.name = 'drive-wheels'

print(drive_wheels_counts)

#Basics of Grouping

print(df["drive-wheels"].unique())

df_group_one = df[['drive-wheels','body-style','price']]

df_group_one = df_group_one.groupby(['drive-wheels'],as_index=False).mean()

print(df_group_one)

df_gptest = df[['drive-wheels','body-style','price']]

grouped_test1 = df_gptest.groupby(['drive-wheels','body-style'],as_index=False).mean()

grouped_test1

grouped_pivot = grouped_test1.pivot(index='drive-wheels',columns='body-style')

print(grouped_pivot)

grouped_pivot = grouped_pivot.fillna(0) #fill missing values with 0

print(grouped_pivot)

#Pearson Correlation

print(df.corr())

from scipy import stats

pearson_coef,p_value=stats.pearsonr(df['wheel-base'], df['price'])

print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)

#ANOVA

grouped_test2=df_gptest[['drive-wheels', 'price']].groupby(['drive-wheels'])

grouped_test2.head(2)

print(df_gptest)

grouped_test2.get_group('4wd')['price']

# ANOVA

f_val, p_val = stats.f_oneway(grouped_test2.get_group('fwd')['price'], grouped_test2.get_group('rwd')['price'], grouped_test2.get_group('4wd')['price'])

print( "ANOVA results: F=", f_val, ", P =", p_val)

Learning help you to achieve your biggest Goals

Saturday, April 13, 2019