Tuesday, June 25, 2019

SQL Queries for Interview

SQL Queries for Interview

Q1 SQL select total and split into success and failed

Table 1                Table 2                   
|leadid|Location|      |leadid|leadstatus|       
|---------------|      |-----------------|
|1     |Japan   |      |1     | Hired    |
|2     |China   |      |2     | Failed   |
|3     |Korea   |      |3     | Hired    |
|4     |Japan   |      |4     | Hired    |
|5     |Japan   |      |5     | Hired    |
 
|Location|Interview|Hired|Failed|
|-------------------------------|
|Japan   | 3       |3    |0     |
|Korea   | 1       |1    |0     |
|China   | 1       |0    |1     |
 
SELECT Location, COUNT(*) as Interview,
SUM(CASE WHEN leadstatus = 'Hired' THEN 1 ELSE 0 END) as Hired,
SUM(CASE WHEN leadstatus = 'Failed' THEN 1 ELSE 0 END) as Failed
FROM table1 
LEFT JOIN table2 ON table1.leadid = table2.leadid
            GROUP BY Location 
            ORDER BY Interview DESC
 
Select location,count(*) as Interview,
SUM(CASE WHEN (status='Hired')Then 1 Else 0 END) as Hired,
SUM(CASE WHEN(status='Failed') Then 1 Else 0 END) as Failed 
from loc inner join status on loc.leadid= status.leadid 
group by location;
 
Q 2 Second maximum salary 
 
Second maximum salary using sub query and IN clause 

mysql> SELECT max(salary) FROM Employee WHERE salary NOT IN 
(SELECT max(salary) FROM Employee);


mysql> SELECT max(salary) FROM Employee WHERE salary < 
(SELECT max(salary) FROM Employee);


Second highest salary using TOP keyword of Sybase or SQL Server database


SELECT TOP 1 salary FROM ( SELECT TOP 2 salary FROM employees ORDER BY salary DESC)
 AS emp ORDER BY salary ASC

Second maximum salary using LIMIT keyword of MYSQL database

SELECT salary  FROM (SELECT salary FROM Employee ORDER BY salary DESC LIMIT 2)
AS emp ORDER BY salary LIMIT 1;



 Q 3 Write an SQL query to clone a new table from another table.
 
 SELECT * INTO WorkerClone FROM Worker; (With data )
 SELECT * INTO WorkerClone FROM Worker WHERE 1 = 0; (Without data )
 CREATE TABLE WorkerClone LIKE Worker; (Without data )
 
 
 
 
 
    

Thursday, May 2, 2019

Hadoop HDFS Commands

Hadoop HDFS Commands

ls
Returns all the available files and subdirectories present under the root directory.
hadoop fs –ls /

cd  C:\hadoop-2.8.0\data\namenode
hadoop fs -mkdir  test1
hadoop fs -ls \
hadoop fs -mkdir -p /test1/Hadoop
hadoop fs -mkdir  /test1/Hadoop/tweeter_data

copyFromLocal
Copy a file from local filesytem to HDFS location.
hadoop fs –copyFromLocal Sample1.txt /test1/Hadoop/tweeter_data

put –
This hadoop command uploads a single file or multiple source files from local file system to hadoop distributed file system (HDFS).
hadoop fs –put Sample2.txt /test1/Hadoop/tweeter_data

moveFromLocal
This hadoop command functions similar to the put command but the source file will be deleted after copying.
hadoop fs –put Sample3.txt /test1/Hadoop/tweeter_data

du
Displays the disk usage for all the files available under a given directory.
hadoop fs –du /test1/Hadoop/

df
 Displas disk usage of current hadoop distributed file system.
hadoop fs –df

Expunge
This HDFS command empties the trash by deleting all the files and directories.
hadoop fs –expunge

Cat
This is similar to the cat command in Unix and displays the contents of a file.
hadoop fs –cat /test1/Hadoop/tweeter_data/Sample3.txt

cp
Copy files from one HDFS location to another HDFS location.
hadoop fs –cp /test1/Hadoop/tweeter_data/  /test2/

mv
Move files from one HDFS location to another HDFS location.
hadoop fs –mv /test2/  /test12/

rm
Removes the file or directory from the mentioned HDFS location.
hadoop fs –rm -r /test12/

tail
This hadoop command will show the last kilobyte of the file to stdout.
hadoop fs -tail /test1/Hadoop/tweeter_data/Sample3.txt

touchz
This command can be used to create a file of zero bytes size in HDFS filesystem.
hadoop fs -touchz /test1/URI.txt

tail
This command is used to show the last 1KB of the file.
hadoop fs -tail /test1/Hadoop/tweeter_data/Sample3.txt

count
This command is used to count the number of directories, files and bytes
hadoop fs -count /test1/Hadoop/
 

Tuesday, April 23, 2019

Data Analysis with Python(Part 4)

Data Analysis with Python(Part 4)
 
Model Evaluation

An important step in testing your model is to split your data into training and testing data.
Sometimes you do not have sufficient testing data; as a result, you may want to perform Cross-validation. Let's go over several methods that you can use for Cross-validation.

You can also use the function 'cross_val_predict' to predict the output. The function splits up the data into the specified number of folds, using one fold to get a prediction while the rest of the folds are used as test data.


Overfitting, Underfitting and Model Selection

It turns out that the test data sometimes referred to as the out of sample data is a much better measure of how well your model performs in the real world. One reason for this is overfitting

Overfitting

Overfitting occurs when the model fits the noise, not the underlying process. Therefore when testing your model using the test-set, your model does not perform as well as it is modelling noise, not the underlying process that generated the relationship.

The lower the R^2, the worse the model, a Negative R^2 is a sign of overfitting.

Ridge regression

In Ridge Regression we will see how the parameter Alfa changes the model.

Grid Search

The term Alfa is a hyperparameter, sklearn has the class GridSearchCV to make the process of finding the best hyperparameter simple
 
 
 
import pandas as pd
import numpy as np
from test.test_functools import capture
import matplotlib.pyplot as plt
import seaborn as sns
#%matplotlib inline
'exec(%matplotlib inline)'


#load data and store in dataframe df:

path="C:/Users/thakudev/PYTHON/Data/module_5_auto.csv"
df=pd.read_csv(path)
#print(df.head())


df.to_csv("module_5_auto.csv")
df=df._get_numeric_data()
print(df.head())

#Libraries for plotting
from IPython.display import display
from ipywidgets import widgets
from IPython.display import display
from ipywidgets import interact, interactive, fixed, interact_manual

#Functions for plotting
def DistributionPlot(RedFunction, BlueFunction, RedName, BlueName, Title):
    width = 12
    height = 10
    plt.figure(figsize=(width, height))

    ax1 = sns.distplot(RedFunction, hist=False, color="r", label=RedName)
    ax2 = sns.distplot(BlueFunction, hist=False, color="b", label=BlueName, ax=ax1)

    plt.title(Title)
    plt.xlabel('Price (in dollars)')
    plt.ylabel('Proportion of Cars')

    plt.show()
    plt.close()
   
def PollyPlot(xtrain, xtest, y_train, y_test, lr,poly_transform):
    width = 12
    height = 10
    plt.figure(figsize=(width, height))
   
   
    #training data
    #testing data
    # lr:  linear regression object
    #poly_transform:  polynomial transformation object

    xmax=max([xtrain.values.max(), xtest.values.max()])

    xmin=min([xtrain.values.min(), xtest.values.min()])

    x=np.arange(xmin, xmax, 0.1)


    plt.plot(xtrain, y_train, 'ro', label='Training Data')
    plt.plot(xtest, y_test, 'go', label='Test Data')
    plt.plot(x, lr.predict(poly_transform.fit_transform(x.reshape(-1, 1))), label='Predicted Function')
    plt.ylim([-10000, 60000])
    plt.ylabel('Price')
    plt.legend()   

#An important step in testing your model is to split your data into training and testing data.
y_data=df['price']
x_data=df.drop('price',axis=1)

#we randomly split our data into training and testing data using the function

from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test=train_test_split(x_data,y_data,test_size=0.15, random_state=1)
print("Number of test samples:",x_test.shape[0])
print("Number of Training ",x_train.shape[0])

from sklearn.linear_model import LinearRegression
lre=LinearRegression()
print("1:",lre.fit(x_train[['horsepower']],y_train))
#cal R^2 on the test data
print("2:",lre.score(x_test[['horsepower']],y_test))
print("3:",lre.score(x_train[['horsepower']],y_train))

#cross_val_score.
from sklearn.model_selection import cross_val_score
Rcross=cross_val_score(lre, x_data[['horsepower']],y_data,cv=4)
print("Rcross",Rcross)
print("The mean of the folds are",Rcross.mean(),"and the standard deviation is",Rcross.std())

-1 * cross_val_score(lre,x_data[['horsepower']], y_data,cv=4,scoring='neg_mean_squared_error')

from sklearn.model_selection import cross_val_predict
yhat = cross_val_predict(lre,x_data[['horsepower']], y_data,cv=4)
yhat[0:5]



#create Multiple linear regression objects and train the model

lr = LinearRegression()
lr.fit(x_train[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']], y_train)

#Prediction using training data:
yhat_train = lr.predict(x_train[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']])
print("yhat_train",yhat_train[0:5])

#Prediction using test data:
yhat_test = lr.predict(x_test[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']])
print("yhat_test",yhat_test[0:5])


Title = 'Distribution  Plot of  Predicted Value Using Training Data vs Training Data Distribution'
DistributionPlot(y_train, yhat_train, "Actual Values (Train)", "Predicted Values (Train)", Title)


Title='Distribution  Plot of  Predicted Value Using Test Data vs Data Distribution of Test Data'
DistributionPlot(y_test,yhat_test,"Actual Values (Test)","Predicted Values (Test)",Title)

#Overfitting
from sklearn.preprocessing import PolynomialFeatures
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.45, random_state=0)
pr = PolynomialFeatures(degree=5)
x_train_pr = pr.fit_transform(x_train[['horsepower']])
x_test_pr = pr.fit_transform(x_test[['horsepower']])
print(pr)

#create a linear regression model "poly" and train it
poly = LinearRegression()
poly.fit(x_train_pr, y_train)
yhat = poly.predict(x_test_pr)
yhat[0:5]
print("Predicted values:", yhat[0:4])
print("True values:", y_test[0:4].values)

PollyPlot(x_train[['horsepower']], x_test[['horsepower']], y_train, y_test, poly,pr)

#R^2 of the training data:
poly.score(x_train_pr, y_train)

#R^2 of the test data:
poly.score(x_test_pr, y_test)


#Ridge regression
pr=PolynomialFeatures(degree=2)
x_train_pr=pr.fit_transform(x_train[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg','normalized-losses','symboling']])
x_test_pr=pr.fit_transform(x_test[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg','normalized-losses','symboling']])

from sklearn.linear_model import Ridge
#create a Ridge regression object, setting the regularization parameter to 0.1
RigeModel=Ridge(alpha=0.1)
RigeModel.fit(x_train_pr, y_train)
RigeModel.score(x_test_pr, y_test)
yhat = RigeModel.predict(x_test_pr)
print('predicted:', yhat[0:4])
print('test set :', y_test[0:4].values)

#Grid Search
from sklearn.model_selection import GridSearchCV
parameters1= [{'alpha': [0.001,0.1,1, 10, 100, 1000, 10000, 100000, 100000],'normalize':[True,False]}]
print(parameters1)
RR=Ridge()
RR

#Create a ridge grid search object
Grid1=GridSearchCV(RR,parameters1,cv=4)
#Fit the model
Grid1.fit(x_data[['horsepower','curb-weight','engine-size','highway-mpg']],y_data)
BestRR=Grid1.best_estimator_
print(BestRR)


# test our model on the test data
print(BestRR.score(x_test[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']], y_test))