Data Analysis with Python(Part 1)
Data Pre-processing
- Identify and handle missing values
- Data Formatting
- Data Normalization(centering/scaling)
- Data binning
- Turning categorial values to numeric variables
1.Identify and handle missing values
Missing value in dataset appears as ?,N/A,0 or just a blank cell
Possibility to remove the data where that missing value is found.
Droping the missing values
drop the variable
drop the data entry
Replace the missing values
replace it with an average(of similar datapoints)
replace it by frequency
replace it based on other functions
Replacing data is better,since no data is wasted.
One standard replacement technique is to replace missing values by the average value of the entire variable.
mean=df["normalized-lossed"].mean()
df["normalized-lossed"].replace(np.nan,mean)
dataframe.replace(missing_value,new_value):
Dropna method,you can choose to drop rows or columns that contain missing values ( like NaN)
Need to specify axis=0 to drop the rows or axis=1 to drop column
dataframes.dropna()
dataframes.dropna(subset=["price"],axi=0,inplace=True)
To modify dataframes ,use inplace parameter
Possibility to remove the data where that missing value is found.
Droping the missing values
drop the variable
drop the data entry
Replace the missing values
replace it with an average(of similar datapoints)
replace it by frequency
replace it based on other functions
Replacing data is better,since no data is wasted.
One standard replacement technique is to replace missing values by the average value of the entire variable.
mean=df["normalized-lossed"].mean()
df["normalized-lossed"].replace(np.nan,mean)
dataframe.replace(missing_value,new_value):
Dropna method,you can choose to drop rows or columns that contain missing values ( like NaN)
Need to specify axis=0 to drop the rows or axis=1 to drop column
dataframes.dropna()
dataframes.dropna(subset=["price"],axi=0,inplace=True)
To modify dataframes ,use inplace parameter
2.Data Formatting
Problems of data with different format,units and conventions.Data is usually collected from the different places,by different people,which may be stored in different formats.
Data formatting means bringing data into a common standard of expression that users to make meaningful comparisons.
Non-formatted data
confusing
hard to aggregate
hard to compare
Formatted data
more clear
easy to aggregate
easy to compare
As a part of dataset cleaning,data formatting ensures that data is consistent and easily understandable.
Method dataframe.astype() can be used to convert a datatype from one format to another.
df["price]=df["price"].astype("int")
3.Data Normalization(centering/scaling)
We may want to normalize these variables so that the range of the values is consistent.This normalization can make some statistical analyses easier down the road.
By making the ranges consistent between variables, normalization enables a fairer comparison
between the different features.Making sure they have the same impact, it is also important for computational reasons.
There are several ways to normalize data.
i. The first method, called “simple feature scaling”, just divides each value by the
maximum value for that feature.This makes the new values range between 0 and 1.
x(new)=x(old)/x(max)
ii.The second method, called “Min-Max”, takes each value, X_old, subtracted from the minimum
value of that feature, then divides by the range of that feature.
Again, the resulting new values range between 0 and 1.
x(new)=[x(old)- x(min)]/[x(max)-x(min)]
iii.The third method is called “z-score” or “standard score”.
In this formula, for each value, you subtract the Mu which is the average of the feature,
and then divide by the standard deviation (sigma).The resulting values hover around 0, and typically range between -3 and +3, but can be higher
or lower.
x(new)=[x(old)-μ]/ σ
Not-normalized
-age and income are in different range
-hard to compare
-income will influence the result more
Normalized
-similar range
-similar intrinsic influence on
analytical model
Nature of the data biased the linear regression model to weigh income more heavily than age.
4.Data binning
Binning :Groping of values into bins
Convert numeric into categorical variables
Group a set of numerical values into a set of bins
price is a feature range from 5000 to 45000
(Better representation of price)
bins:
low(5000,12000)
Mid (30000,31000)
High(39000,45000)
We can use numpy function (linespace) to
return the array(bins) that contains equally space numbers over the specified interval of the price.
bins=np.linespace(min(df["price"]),
max(df["price"]),4)
group_names=["Low,"Medium","High"]
df["price-binned"=pd.cut(df["price"],bins,labels=group_names,include_lowest=True)
5.Turning categorial values to numeric variables
Categorical ->Numeric
Most statistical model cannot take in the objects/String as input(ony take int as input)
Add dummy variables for each unique category
Use padas,get_dummies() is used to convert variables to dummy variables(0 or 1)
pd.get_dummies(df['fuel'])
get_dummies() automatically generates list of numbers, each one corresponding to a particular category of the variables.
There are two methods to detect missing data:
.isnull()
.notnull()
"True" stands for missing value, while "False" stands for not missing value.
Count missing values in each column
"normalized-losses": 41 missing data
"num-of-doors": 2 missing data
"bore": 4 missing data
"stroke" : 4 missing data
"horsepower": 2 missing data
"peak-rpm": 2 missing data
"price": 4 missing data
Replace by mean:
"normalized-losses": 41 missing data, replace them with mean
"stroke": 4 missing data, replace them with mean
"bore": 4 missing data, replace them with mean
"horsepower": 2 missing data, replace them with mean
"peak-rpm": 2 missing data, replace them with mean
Replace by frequency:
"num-of-doors": 2 missing data, replace them with "four".
Reason: 84% sedans is four doors. Since four doors is most frequent, it is most likely to occur
Drop the whole row:
"price": 4 missing data, simply delete the whole row
Reason: price is what we want to predict. Any data entry without price data cannot be used for prediction; therefore any row now without price data is not useful to us
The last step in data cleaning is checking and making sure that all data is in the correct format (int, float, text or other).
Most statistical model cannot take in the objects/String as input(ony take int as input)
Add dummy variables for each unique category
Use padas,get_dummies() is used to convert variables to dummy variables(0 or 1)
pd.get_dummies(df['fuel'])
get_dummies() automatically generates list of numbers, each one corresponding to a particular category of the variables.
There are two methods to detect missing data:
.isnull()
.notnull()
"True" stands for missing value, while "False" stands for not missing value.
Count missing values in each column
"normalized-losses": 41 missing data
"num-of-doors": 2 missing data
"bore": 4 missing data
"stroke" : 4 missing data
"horsepower": 2 missing data
"peak-rpm": 2 missing data
"price": 4 missing data
Replace by mean:
"normalized-losses": 41 missing data, replace them with mean
"stroke": 4 missing data, replace them with mean
"bore": 4 missing data, replace them with mean
"horsepower": 2 missing data, replace them with mean
"peak-rpm": 2 missing data, replace them with mean
Replace by frequency:
"num-of-doors": 2 missing data, replace them with "four".
Reason: 84% sedans is four doors. Since four doors is most frequent, it is most likely to occur
Drop the whole row:
"price": 4 missing data, simply delete the whole row
Reason: price is what we want to predict. Any data entry without price data cannot be used for prediction; therefore any row now without price data is not useful to us
The last step in data cleaning is checking and making sure that all data is in the correct format (int, float, text or other).
import pandas as pd
import matplotlib.pylab as plt
from unittest.mock import inplace
import matplotlib
from jedi.refactoring import inline
filename = "C:/Users/thakudev/PYTHON/auto1.csv"
headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",
"drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type",
"num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
"peak-rpm","city-mpg","highway-mpg","price"]
df=pd.read_csv(filename,names=headers)
print(df.head())
import numpy as np
#replace "?" to NaN
df.replace("?", np.nan, inplace=True)
print(df.head(5))
#To Detect missing data
missing_data=df.isnull()
print(missing_data.head(5))
#Count missing values in each column
for column in
missing_data.columns.values.tolist():
print(column)
print(missing_data[column].value_counts())
print("")
#Deal with missing data
avg_norm_loss=df["normalized-losses"].astype("float").mean(axis=0)
print("Average
of normalized-losses",avg_norm_loss)
#Replace "NaN" by mean value in
"normalized-losses" column
df["normalized-losses"].replace(np.nan,avg_norm_loss,inplace=True)
print(df.head(5))
#replace the missing 'num-of-doors' values by the most
frequent
print(df['num-of-doors'].value_counts())
print(df['num-of-doors'].value_counts().idxmax())
df["num-of-doors"].replace(np.nan, "four", inplace=True)
print(df.head())
# simply drop whole row with NaN in "price" column
df.dropna(subset=["price"],axis=0,inplace=True)
# reset index, because we droped two rows
df.reset_index(drop=True, inplace=True)
print(df.head())
#check data types of all columns and correct it
print(df.dtypes)
df[["bore","stroke"]]=df[["bore","stroke"]].astype("float")
df[["normalized-losses"]] = df[["normalized-losses"]].astype("int")
df[["price"]] = df[["price"]].astype("float")
df[["peak-rpm"]] = df[["peak-rpm"]].astype("float")
print("----------")
print(df.dtypes)
#Normalization is the process of transforming values of several
variables into a similar range.
## replace (original value) by (original value)/(maximum value)
df['length']=df['length']/df['length'].max()
df['width']=df['width']/df['width'].max()
df['height'] = df['height']/df['height'].max()
print(df[["length","width","height"]].head())
# Binning Data
#Convert data to correct format
avg_horsepower = df['horsepower'].astype('float').mean(axis=0)
print("Average
horsepower:", avg_horsepower)
df['horsepower'].replace(np.nan, avg_horsepower, inplace=True)
df["horsepower"]=df["horsepower"].astype(int, copy=True)
#import matplotlib.pyplot as plt
'exec(%matplotlib inline)'
import matplotlib as plt
from matplotlib import pyplot
plt.pyplot.hist(df["horsepower"])
# set x/y labels and plot title
plt.pyplot.xlabel("horsepower")
plt.pyplot.ylabel("count")
plt.pyplot.title("horsepower
bins")
print(plt.pyplot.show())
bins = np.linspace(min(df["horsepower"]), max(df["horsepower"]), 4)
print(bins)
group_names = ['Low', 'Medium', 'High']
#the function "cut" the determine what each value of
"df['horsepower']" belongs to
df['horsepower-binned'] = pd.cut(df['horsepower'], bins, labels=group_names, include_lowest=True )
df[['horsepower','horsepower-binned']].head(20)
df["horsepower-binned"].value_counts()
pyplot.bar(group_names, df["horsepower-binned"].value_counts())
# set x/y labels and plot title
plt.pyplot.xlabel("horsepower")
plt.pyplot.ylabel("count")
plt.pyplot.title("horsepower
bins")
#Indicator variable
dummy_variable_1 = pd.get_dummies(df["fuel-type"])
print(dummy_variable_1.head())
df = pd.concat([df, dummy_variable_1], axis=1)
dummy_variable_1.rename(columns={'fuel-type-diesel':'gas', 'fuel-type-diesel':'diesel'}, inplace=True)
dummy_variable_1.head()
df.drop("fuel-type", axis = 1,
inplace=True)
print(df.head())
df.to_csv('clean_df.csv')
|



0 comments:
Post a Comment