Data Analysis with Python(Part 1) ~ Learning help you to achieve your biggest Goals

Data Analysis with Python(Part 1)

Data Pre-processing

- Identify and handle missing values

- Data Formatting

- Data Normalization(centering/scaling)

- Data binning

- Turning categorial values to numeric variables

1.Identify and handle missing values

Missing value in dataset appears as ?,N/A,0 or just a blank cell

Possibility to remove the data where that missing value is found.

Droping the missing values
drop the variable
drop the data entry

Replace the missing values
replace it with an average(of similar datapoints)
replace it by frequency
replace it based on other functions

Replacing data is better,since no data is wasted.

One standard replacement technique is to replace missing values by the average value of the entire variable.

mean=df["normalized-lossed"].mean()
df["normalized-lossed"].replace(np.nan,mean)

dataframe.replace(missing_value,new_value):

Dropna method,you can choose to drop rows or columns that contain missing values ( like NaN)
Need to specify axis=0 to drop the rows or axis=1 to drop column

dataframes.dropna()
dataframes.dropna(subset=["price"],axi=0,inplace=True)
To modify dataframes ,use inplace parameter

2.Data Formatting

Problems of data with different format,units and conventions.Data is usually collected from the different places,by different people,which may be stored in different formats.

Data formatting means bringing data into a common standard of expression that users to make meaningful comparisons.

Non-formatted data

confusing

hard to aggregate

hard to compare

Formatted data

more clear

easy to aggregate

easy to compare

As a part of dataset cleaning,data formatting ensures that data is consistent and easily understandable.

Method dataframe.astype() can be used to convert a datatype from one format to another.

df["price]=df["price"].astype("int")

3.Data Normalization(centering/scaling)

We may want to normalize these variables so that the range of the values is consistent.This normalization can make some statistical analyses easier down the road.

By making the ranges consistent between variables, normalization enables a fairer comparison

between the different features.Making sure they have the same impact, it is also important for computational reasons.

There are several ways to normalize data.

i. The first method, called “simple feature scaling”, just divides each value by the
maximum value for that feature.This makes the new values range between 0 and 1.

x(new)=x(old)/x(max)

ii.The second method, called “Min-Max”, takes each value, X_old, subtracted from the minimum
value of that feature, then divides by the range of that feature.
Again, the resulting new values range between 0 and 1.

x(new)=[x(old)- x(min)]/[x(max)-x(min)]
iii.The third method is called “z-score” or “standard score”.
In this formula, for each value, you subtract the Mu which is the average of the feature,
and then divide by the standard deviation (sigma).The resulting values hover around 0, and typically range between -3 and +3, but can be higher
or lower.

x(new)=[x(old)-μ]/ σ

Not-normalized
-age and income are in different range
-hard to compare
-income will influence the result more

Normalized
-similar range
-similar intrinsic influence on
analytical model

Nature of the data biased the linear regression model to weigh income more heavily than age.

4.Data binning

Binning :Groping of values into bins
Convert numeric into categorical variables
Group a set of numerical values into a set of bins

price is a feature range from 5000 to 45000
(Better representation of price)

bins:
low(5000,12000)
Mid (30000,31000)
High(39000,45000)

We can use numpy function (linespace) to
return the array(bins) that contains equally space numbers over the specified interval of the price.

bins=np.linespace(min(df["price"]),
max(df["price"]),4)

group_names=["Low,"Medium","High"]
df["price-binned"=pd.cut(df["price"],bins,labels=group_names,include_lowest=True)

5.Turning categorial values to numeric variables

Categorical ->Numeric

Most statistical model cannot take in the objects/String as input(ony take int as input)

Add dummy variables for each unique category

Use padas,get_dummies() is used to convert variables to dummy variables(0 or 1)

pd.get_dummies(df['fuel'])

get_dummies() automatically generates list of numbers, each one corresponding to a particular category of the variables.

There are two methods to detect missing data:

    .isnull()
    .notnull()

"True" stands for missing value, while "False" stands for not missing value.

Count missing values in each column

"normalized-losses": 41 missing data
    "num-of-doors": 2 missing data
    "bore": 4 missing data
    "stroke" : 4 missing data
    "horsepower": 2 missing data
    "peak-rpm": 2 missing data
    "price": 4 missing data

Replace by mean:

    "normalized-losses": 41 missing data, replace them with mean
    "stroke": 4 missing data, replace them with mean
    "bore": 4 missing data, replace them with mean
    "horsepower": 2 missing data, replace them with mean
    "peak-rpm": 2 missing data, replace them with mean

Replace by frequency:

    "num-of-doors": 2 missing data, replace them with "four".
        Reason: 84% sedans is four doors. Since four doors is most frequent, it is most likely to occur

Drop the whole row:

    "price": 4 missing data, simply delete the whole row
        Reason: price is what we want to predict. Any data entry without price data cannot be used for prediction; therefore any row now without price data is not useful to us

The last step in data cleaning is checking and making sure that all data is in the correct format (int, float, text or other).

import pandas as pd

import matplotlib.pylab as plt

from unittest.mock import inplace

import matplotlib

from jedi.refactoring import inline

filename = "C:/Users/thakudev/PYTHON/auto1.csv"

headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",

         "drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type",

         "num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",

         "peak-rpm","city-mpg","highway-mpg","price"]

df=pd.read_csv(filename,names=headers)

print(df.head())

import numpy as np

#replace "?" to NaN

df.replace("?", np.nan, inplace=True)

print(df.head(5))

#To Detect missing data

missing_data=df.isnull()

print(missing_data.head(5))

#Count missing values in each column

for column in missing_data.columns.values.tolist():

    print(column)

    print(missing_data[column].value_counts())

    print("")



#Deal with missing data

avg_norm_loss=df["normalized-losses"].astype("float").mean(axis=0)

print("Average of normalized-losses",avg_norm_loss)

#Replace "NaN" by mean value in "normalized-losses" column

df["normalized-losses"].replace(np.nan,avg_norm_loss,inplace=True)

print(df.head(5))

#replace the missing 'num-of-doors' values by the most frequent

print(df['num-of-doors'].value_counts())

print(df['num-of-doors'].value_counts().idxmax())

df["num-of-doors"].replace(np.nan, "four", inplace=True)

print(df.head())

# simply drop whole row with NaN in "price" column

df.dropna(subset=["price"],axis=0,inplace=True)

# reset index, because we droped two rows

df.reset_index(drop=True, inplace=True)

print(df.head())

#check data types of all columns and correct it

print(df.dtypes)

df[["bore","stroke"]]=df[["bore","stroke"]].astype("float")

df[["normalized-losses"]] = df[["normalized-losses"]].astype("int")

df[["price"]] = df[["price"]].astype("float")

df[["peak-rpm"]] = df[["peak-rpm"]].astype("float")

print("----------")

print(df.dtypes)

#Normalization is the process of transforming values of several variables into a similar range.

## replace (original value) by (original value)/(maximum value)

df['length']=df['length']/df['length'].max()

df['width']=df['width']/df['width'].max()

df['height'] = df['height']/df['height'].max()

print(df[["length","width","height"]].head())

# Binning Data

#Convert data to correct format

avg_horsepower = df['horsepower'].astype('float').mean(axis=0)

print("Average horsepower:", avg_horsepower)

df['horsepower'].replace(np.nan, avg_horsepower, inplace=True)

df["horsepower"]=df["horsepower"].astype(int, copy=True)

#import matplotlib.pyplot as plt

'exec(%matplotlib inline)'

import matplotlib as plt

from matplotlib import pyplot

plt.pyplot.hist(df["horsepower"])

# set x/y labels and plot title

plt.pyplot.xlabel("horsepower")

plt.pyplot.ylabel("count")

plt.pyplot.title("horsepower bins")

print(plt.pyplot.show())

bins = np.linspace(min(df["horsepower"]), max(df["horsepower"]), 4)

print(bins)

group_names = ['Low', 'Medium', 'High']

#the function "cut" the determine what each value of "df['horsepower']" belongs to

df['horsepower-binned'] = pd.cut(df['horsepower'], bins, labels=group_names, include_lowest=True )

df[['horsepower','horsepower-binned']].head(20)

df["horsepower-binned"].value_counts()

pyplot.bar(group_names, df["horsepower-binned"].value_counts())

# set x/y labels and plot title

plt.pyplot.xlabel("horsepower")

plt.pyplot.ylabel("count")

plt.pyplot.title("horsepower bins")

#Indicator variable

dummy_variable_1 = pd.get_dummies(df["fuel-type"])

print(dummy_variable_1.head())

df = pd.concat([df, dummy_variable_1], axis=1)

dummy_variable_1.rename(columns={'fuel-type-diesel':'gas', 'fuel-type-diesel':'diesel'}, inplace=True)

dummy_variable_1.head()

df.drop("fuel-type", axis = 1, inplace=True)

print(df.head())

df.to_csv('clean_df.csv')

Learning help you to achieve your biggest Goals

Saturday, April 13, 2019