2019 ~ Learning help you to achieve your biggest Goals

Tuesday, June 25, 2019

SQL Queries for Interview

Q1 SQL select total and split into success and failed

Table 1                Table 2                   
|leadid|Location|      |leadid|leadstatus|       
|---------------|      |-----------------|
|1     |Japan   |      |1     | Hired    |
|2     |China   |      |2     | Failed   |
|3     |Korea   |      |3     | Hired    |
|4     |Japan   |      |4     | Hired    |
|5     |Japan   |      |5     | Hired    |

|Location|Interview|Hired|Failed|
|-------------------------------|
|Japan   | 3       |3    |0     |
|Korea   | 1       |1    |0     |
|China   | 1       |0    |1     |

SELECT Location, COUNT(*) as Interview,
SUM(CASE WHEN leadstatus = 'Hired' THEN 1 ELSE 0 END) as Hired,
SUM(CASE WHEN leadstatus = 'Failed' THEN 1 ELSE 0 END) as Failed
FROM table1 
LEFT JOIN table2 ON table1.leadid = table2.leadid
            GROUP BY Location 
            ORDER BY Interview DESC

Select location,count(*) as Interview,
SUM(CASE WHEN (status='Hired')Then 1 Else 0 END) as Hired,
SUM(CASE WHEN(status='Failed') Then 1 Else 0 END) as Failed 
from loc inner join status on loc.leadid= status.leadid 
group by location;

Q 2 Second maximum salary

Second maximum salary using sub query and IN clause 

mysql> SELECT max(salary) FROM Employee WHERE salary NOT IN

(SELECT max(salary) FROM Employee);


mysql> SELECT max(salary) FROM Employee WHERE salary <

(SELECT max(salary) FROM Employee);


Second highest salary using TOP keyword of Sybase or SQL Server database


SELECT TOP 1 salary FROM ( SELECT TOP 2 salary FROM employees ORDER BY salary DESC)

 AS emp ORDER BY salary ASC

Second maximum salary using LIMIT keyword of MYSQL database

SELECT salary  FROM (SELECT salary FROM Employee ORDER BY salary DESC LIMIT 2)

AS emp ORDER BY salary LIMIT 1;



 Q 3 Write an SQL query to clone a new table from another table.

 SELECT * INTO WorkerClone FROM Worker; (With data )
 SELECT * INTO WorkerClone FROM Worker WHERE 1 = 0; (Without data )
 CREATE TABLE WorkerClone LIKE Worker; (Without data )

Thursday, May 2, 2019

Hadoop HDFS Commands

Returns all the available files and subdirectories present under the root directory.
hadoop fs –ls /

cd C:\hadoop-2.8.0\data\namenode
hadoop fs -mkdir test1
hadoop fs -ls \
hadoop fs -mkdir -p /test1/Hadoop
hadoop fs -mkdir /test1/Hadoop/tweeter_data

copyFromLocal
Copy a file from local filesytem to HDFS location.
hadoop fs –copyFromLocal Sample1.txt /test1/Hadoop/tweeter_data

put –
This hadoop command uploads a single file or multiple source files from local file system to hadoop distributed file system (HDFS).
hadoop fs –put Sample2.txt /test1/Hadoop/tweeter_data

moveFromLocal
This hadoop command functions similar to the put command but the source file will be deleted after copying.
hadoop fs –put Sample3.txt /test1/Hadoop/tweeter_data

du
Displays the disk usage for all the files available under a given directory.
hadoop fs –du /test1/Hadoop/

df
Displas disk usage of current hadoop distributed file system.
hadoop fs –df

Expunge
This HDFS command empties the trash by deleting all the files and directories.
hadoop fs –expunge

Cat
This is similar to the cat command in Unix and displays the contents of a file.
hadoop fs –cat /test1/Hadoop/tweeter_data/Sample3.txt

cp
Copy files from one HDFS location to another HDFS location.
hadoop fs –cp /test1/Hadoop/tweeter_data/ /test2/

mv
Move files from one HDFS location to another HDFS location.
hadoop fs –mv /test2/ /test12/

rm
Removes the file or directory from the mentioned HDFS location.
hadoop fs –rm -r /test12/

tail
This hadoop command will show the last kilobyte of the file to stdout.

hadoop fs -tail /test1/Hadoop/tweeter_data/Sample3.txt

touchz
This command can be used to create a file of zero bytes size in HDFS filesystem.
hadoop fs -touchz /test1/URI.txt

tail
This command is used to show the last 1KB of the file.
hadoop fs -tail /test1/Hadoop/tweeter_data/Sample3.txt

count
This command is used to count the number of directories, files and bytes
hadoop fs -count /test1/Hadoop/

Tuesday, April 23, 2019

Data Analysis with Python(Part 4)

Data Analysis with Python(Part 4)

Model Evaluation

An important step in testing your model is to split your data into training and testing data.
Sometimes you do not have sufficient testing data; as a result, you may want to perform Cross-validation. Let's go over several methods that you can use for Cross-validation.

You can also use the function 'cross_val_predict' to predict the output. The function splits up the data into the specified number of folds, using one fold to get a prediction while the rest of the folds are used as test data.

Overfitting, Underfitting and Model Selection

It turns out that the test data sometimes referred to as the out of sample data is a much better measure of how well your model performs in the real world. One reason for this is overfitting

Overfitting
Overfitting occurs when the model fits the noise, not the underlying process. Therefore when testing your model using the test-set, your model does not perform as well as it is modelling noise, not the underlying process that generated the relationship.

The lower the R^2, the worse the model, a Negative R^2 is a sign of overfitting.

Ridge regression

In Ridge Regression we will see how the parameter Alfa changes the model.

Grid Search

The term Alfa is a hyperparameter, sklearn has the class GridSearchCV to make the process of finding the best hyperparameter simple

import pandas as pd

import numpy as np

from test.test_functools import capture

import matplotlib.pyplot as plt

import seaborn as sns

#%matplotlib inline

'exec(%matplotlib inline)'

#load data and store in dataframe df:

path="C:/Users/thakudev/PYTHON/Data/module_5_auto.csv"

df=pd.read_csv(path)

#print(df.head())

df.to_csv("module_5_auto.csv")

df=df._get_numeric_data()

print(df.head())

#Libraries for plotting

from IPython.display import display

from ipywidgets import widgets

from IPython.display import display

from ipywidgets import interact, interactive, fixed, interact_manual

#Functions for plotting

def DistributionPlot(RedFunction, BlueFunction, RedName, BlueName, Title):

width = 12

height = 10

plt.figure(figsize=(width, height))

ax1 = sns.distplot(RedFunction, hist=False, color="r", label=RedName)

ax2 = sns.distplot(BlueFunction, hist=False, color="b", label=BlueName, ax=ax1)

plt.title(Title)

plt.xlabel('Price (in dollars)')

plt.ylabel('Proportion of Cars')

plt.show()

plt.close()

def PollyPlot(xtrain, xtest, y_train, y_test, lr,poly_transform):

width = 12

height = 10

plt.figure(figsize=(width, height))

#training data

#testing data

# lr: linear regression object

#poly_transform: polynomial transformation object

xmax=max([xtrain.values.max(), xtest.values.max()])

xmin=min([xtrain.values.min(), xtest.values.min()])

x=np.arange(xmin, xmax, 0.1)

plt.plot(xtrain, y_train, 'ro', label='Training Data')

plt.plot(xtest, y_test, 'go', label='Test Data')

plt.plot(x, lr.predict(poly_transform.fit_transform(x.reshape(-1, 1))), label='Predicted Function')

plt.ylim([-10000, 60000])

plt.ylabel('Price')

plt.legend()

#An important step in testing your model is to split your data into training and testing data.

y_data=df['price']

x_data=df.drop('price',axis=1)

#we randomly split our data into training and testing data using the function

from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test=train_test_split(x_data,y_data,test_size=0.15, random_state=1)

print("Number of test samples:",x_test.shape[0])

print("Number of Training ",x_train.shape[0])

from sklearn.linear_model import LinearRegression

lre=LinearRegression()

print("1:",lre.fit(x_train[['horsepower']],y_train))

#cal R^2 on the test data

print("2:",lre.score(x_test[['horsepower']],y_test))

print("3:",lre.score(x_train[['horsepower']],y_train))

#cross_val_score.

from sklearn.model_selection import cross_val_score

Rcross=cross_val_score(lre, x_data[['horsepower']],y_data,cv=4)

print("Rcross",Rcross)

print("The mean of the folds are",Rcross.mean(),"and the standard deviation is",Rcross.std())

-1 * cross_val_score(lre,x_data[['horsepower']], y_data,cv=4,scoring='neg_mean_squared_error')

from sklearn.model_selection import cross_val_predict

yhat = cross_val_predict(lre,x_data[['horsepower']], y_data,cv=4)

yhat[0:5]

#create Multiple linear regression objects and train the model

lr = LinearRegression()

lr.fit(x_train[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']], y_train)

#Prediction using training data:

yhat_train = lr.predict(x_train[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']])

print("yhat_train",yhat_train[0:5])

#Prediction using test data:

yhat_test = lr.predict(x_test[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']])

print("yhat_test",yhat_test[0:5])

Title = 'Distribution Plot of Predicted Value Using Training Data vs Training Data Distribution'

DistributionPlot(y_train, yhat_train, "Actual Values (Train)", "Predicted Values (Train)", Title)

Title='Distribution Plot of Predicted Value Using Test Data vs Data Distribution of Test Data'

DistributionPlot(y_test,yhat_test,"Actual Values (Test)","Predicted Values (Test)",Title)

#Overfitting

from sklearn.preprocessing import PolynomialFeatures

x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.45, random_state=0)

pr = PolynomialFeatures(degree=5)

x_train_pr = pr.fit_transform(x_train[['horsepower']])

x_test_pr = pr.fit_transform(x_test[['horsepower']])

print(pr)

#create a linear regression model "poly" and train it

poly = LinearRegression()

poly.fit(x_train_pr, y_train)

yhat = poly.predict(x_test_pr)

yhat[0:5]

print("Predicted values:", yhat[0:4])

print("True values:", y_test[0:4].values)

PollyPlot(x_train[['horsepower']], x_test[['horsepower']], y_train, y_test, poly,pr)

#R^2 of the training data:

poly.score(x_train_pr, y_train)

#R^2 of the test data:

poly.score(x_test_pr, y_test)

#Ridge regression

pr=PolynomialFeatures(degree=2)

x_train_pr=pr.fit_transform(x_train[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg','normalized-losses','symboling']])

x_test_pr=pr.fit_transform(x_test[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg','normalized-losses','symboling']])

from sklearn.linear_model import Ridge

#create a Ridge regression object, setting the regularization parameter to 0.1

RigeModel=Ridge(alpha=0.1)

RigeModel.fit(x_train_pr, y_train)

RigeModel.score(x_test_pr, y_test)

yhat = RigeModel.predict(x_test_pr)

print('predicted:', yhat[0:4])

print('test set :', y_test[0:4].values)

#Grid Search

from sklearn.model_selection import GridSearchCV

parameters1= [{'alpha': [0.001,0.1,1, 10, 100, 1000, 10000, 100000, 100000],'normalize':[True,False]}]

print(parameters1)

RR=Ridge()

#Create a ridge grid search object

Grid1=GridSearchCV(RR,parameters1,cv=4)

#Fit the model

Grid1.fit(x_data[['horsepower','curb-weight','engine-size','highway-mpg']],y_data)

BestRR=Grid1.best_estimator_

print(BestRR)

# test our model on the test data

print(BestRR.score(x_test[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']], y_test))

Learning help you to achieve your biggest Goals

Tuesday, June 25, 2019

SQL Queries for Interview

SQL Queries for Interview

Thursday, May 2, 2019

Hadoop HDFS Commands

Hadoop HDFS Commands

Tuesday, April 23, 2019

Data Analysis with Python(Part 4)

Popular Posts

Blogger news

About

Blog Archive

About Me

Blog Archive

Blogger templates

Blogroll

Labels

Text Widget

Report Abuse

Contact Form

Followers

Search This Blog

Pages

Tuesday, June 25, 2019

SQL Queries for Interview

Thursday, May 2, 2019

Hadoop HDFS Commands

Tuesday, April 23, 2019

Popular Posts

Blogger news

About

Blog Archive

About Me

Blog Archive

Blogger templates

Blogroll

Labels

Text Widget

Report Abuse

Contact Form

Subscription

Followers

Search This Blog

Pages