Friday 15 October 2021

Supervised Learning - Classification/ SLC Hands-On Quiz

An Exploratory Data Analysis on Lower Back Pain

Question Answers

Question1:

Load the dataset and identify the variables that have a correlation greater than or equal to 0.7 with the ‘pelvic_incidence’ variable?

  • pelvic tilt, pelvic_radius
  • lumbar_lordosis_angle, sacral_slope
  • Direct_tilt, sacrum_angle
  • thoracic_slope, thoracic_slope

Ans: lumbar_lordosis_angle, sacral_slope


plt.figure(figsize=(10,5))
sns.heatmap(dataset.corr()[dataset.corr()>=0.7],annot=True,vmax=1,vmin=-1,cmap='Spectral');












Question2:

Encode Status variable: Abnormal class to 1 and Normal to 0.

Split the data into a 70:30 ratio. What is the percentage of 0 and 1 classes in the test data (y_test)?

1: In a range of 0.1 to 0.2/ 0: In a range of 0.2 to 0.3

1: In a range of 0.5 to 0.6/ 0: In a range of 0.3 to 0.6

1: In a range of 0.6 to 0.7/ 0: In a range of 0.3 to 0.4

1: In a range of 0.7 to 0.8/ 0: In a range of 0.2 to 0.3


Ans: 

1: In a range of 0.7 to 0.8

0: In a range of  0.2 to 0.3

dataset['Status'] = dataset['Status'].apply(lambda x: 1 if x=='Abnormal' else 0)

X = dataset.drop(['Status'], axis=1)

Y = dataset['Status']

#Splitting data in train and test sets

X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size=0.30, random_state = 1)

y_test.value_counts(normalize=True)


1    0.709677
0    0.290323
Name: Status, dtype: float64

Ans:

1: In a range of 0.7 to 0.8

0: In a range of 0.2 to 0.3



Question3:

Which metric is the most appropriate metric to evaluate the model according to the problem statement? 

Accuracy, Recall, Precision, F1 score


Ans: Recall

Predicting a person doesn't have an abnormal spine and a person has an abnormal spine - A person who needs treatment will be missed. Hence, reducing such false negatives is important

Question4:

Check for multicollinearity in data and choose the variables which show high multicollinearity? (VIF value greater than 5)

  • sacrum_angle, pelvic tilt, sacral_slope
  • pelvic_slope, cervical_tilt, sacrum_angle
  • pelvic_incidence, pelvic tilt, sacral_slope
  • pelvic_incidence, pelvic tilt, lumbar_lordosis_angle

Ans: pelvic_incidence, pelvic tilt, sacral_slope


#dataframe with numerical column only 
num_feature_set = X_train.copy() 
num_feature_set = add_constant(num_feature_set) 
num_feature_set = num_feature_set.astype(float)

# Calculating VIF
vif_series = pd.Series([variance_inflation_factor(num_feature_set.values,i) for i in range(num_feature_set.shape[1])],index=num_feature_set.columns, dtype = float)
print('Series before feature selection: \n\n{}\n'.format(vif_series))

Question5:

How many minimum numbers of attributes will we need to drop to remove multicollinearity (or get a VIF value less than 5) from the data?

  • 1
  • 2
  • 3
  • 4


Ans: 1
# Dropping first variable with high VIF 
num_feature_set1 = num_feature_set.drop(['pelvic_incidence'],axis=1)
 
# Checking VIF value 
vif_series1 = pd.Series([variance_inflation_factor(num_feature_set1.values,i) for i in range(num_feature_set1.shape[1])],index=num_feature_set1.columns, dtype = float) print('Series before feature selection: \n\n{}\n'.format(vif_series1))

# Dropping second variable with high VIF 
num_feature_set2 = num_feature_set.drop(['pelvic tilt'],axis=1) 

# Checking VIF value 
vif_series2 = pd.Series([variance_inflation_factor(num_feature_set2.values,i) for i in range(num_feature_set2.shape[1])],index=num_feature_set2.columns, dtype = float) print('Series before feature selection: \n\n{}\n'.format(vif_series2))


Question6:

Drop sacral_slope attribute and proceed to build a logistic regression model. Drop all the insignificant variables and keep only significant variables (p-value < 0.05).

How many significant variables are left in the final model excluding the constant?

  • 1
  • 2
  • 3
  • 4

Ans: 2

# Dropping sacral slope
X_train, X_test, y_train, y_test = train_test_split(num_feature_set3, Y, test_size=0.30, random_state = 1) # Iteratively dropping variables with a high p-value X_train2 = X_train.drop(['pelvic_slope'],axis=1) X_test2 = X_test.drop(['pelvic_slope'],axis=1) logit = sm.Logit(y_train, X_train2.astype(float)) lg = logit.fit() print(lg.summary()) X_train3 = X_train2.drop(['scoliosis_slope'],axis=1) X_test3 = X_test2.drop(['scoliosis_slope'],axis=1) logit = sm.Logit(y_train, X_train3.astype(float)) lg = logit.fit() print(lg.summary()) X_train4 = X_train3.drop(['cervical_tilt'],axis=1) X_test4 = X_test3.drop(['cervical_tilt'],axis=1) logit = sm.Logit(y_train, X_train4.astype(float)) lg = logit.fit() print(lg.summary()) X_train5 = X_train4.drop(['Direct_tilt'],axis=1) X_test5 = X_test4.drop(['Direct_tilt'],axis=1) logit = sm.Logit(y_train, X_train5.astype(float)) lg = logit.fit() print(lg.summary()) X_train6 = X_train5.drop(['lumbar_lordosis_angle'],axis=1) X_test6 = X_test5.drop(['lumbar_lordosis_angle'],axis=1) logit = sm.Logit(y_train, X_train6.astype(float)) lg = logit.fit() print(lg.summary()) X_train7 = X_train6.drop(['sacrum_angle'],axis=1) X_test7 = X_test6.drop(['sacrum_angle'],axis=1) logit = sm.Logit(y_train, X_train7.astype(float)) lg = logit.fit() print(lg.summary()) X_train8 = X_train7.drop(['thoracic_slope'],axis=1) X_test8 = X_test7.drop(['thoracic_slope'],axis=1) logit = sm.Logit(y_train, X_train8.astype(float)) lg = logit.fit() print(lg.summary())


Question7:


Marks: 2/2

Select the correct option for the following:

Train a decision tree model with default parameters and vary the depth from 1 to 8 (both values included) and compare the model performance at each value of depth

At depth = 1, the decision tree gives the highest recall among all the models on the training set.

At depth = 2, the decision tree gives the highest recall among all the models on the training set.

At depth = 5, the decision tree gives the highest recall among all the models on the training set.

At depth = 8, the decision tree gives the highest recall among all the models on the training set.

Ans: 1


score_DT = [] for i in range(1,9): dTree = DecisionTreeClassifier(max_depth=i,criterion = 'gini', random_state=1) dTree.fit(X_train, y_train) pred = dTree.predict(X_train) case = {'Depth':i,'Recall':recall_score(y_train,pred)} score_DT.append(case)

print(score_DT)

[{'Depth': 1, 'Recall': 0.6875}, {'Depth': 2, 'Recall': 0.8888888888888888}, {'Depth': 3, 'Recall': 0.8888888888888888}, {'Depth': 4, 'Recall': 0.9583333333333334}, {'Depth': 5, 'Recall': 0.9652777777777778}, {'Depth': 6, 'Recall': 0.9930555555555556}, {'Depth': 7, 'Recall': 0.9861111111111112}, {'Depth': 8, 'Recall': 1.0}]


Question8:

Plot the feature importance of the variables given by the model which gives the maximum value of recall on the training set in Q7. Which are the 2 most important variables respectively?

  • lumbar_lordosis_angle, sacrum_angle
  • degree_spondylolisthesis, pelvic tilt
  • scoliosis_slope, cervial_tilt
  • scoliosis_slope, cervial_tilt

Ans: degree_spondylolisthesis, pelvic tilt



Question9:

Perform hyperparmater tuning for Decision tree using GridSrearchCV.

Use the following list of hyperparameters and their values:

Maximum depth: [5,10,15, None], criterion: ['gini','entropy'], splitter: ['best','random'] Set cv = 3 in grid search Set scoring = 'recall' in grid search Which of the following statements is/are True?

A) GridSeachCV selects the max_depth as 10

B) GridSeachCV selects the criterion as 'gini'

C) GridSeachCV selects the splitter as 'random'

D) GridSeachCV selects the splitter as 'best'

E) GridSeachCV selects the max_depth as 5

F) GridSeachCV selects the criterion as 'entropy'

  • A, B, and C
  • B, C, and E
  • A, C, and F
  • D, E, and F

Ans: A, C, and F


# Choose the type of classifier. estimator = DecisionTreeClassifier(random_state=1)

# Grid of parameters to choose from
parameters = {'max_depth': [5,10,15,None], 
 'criterion' : ['gini','entropy'],
 'splitter' : ['best','random']
 }

# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring='recall',cv=3)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_

# Fit the best algorithm to the data. 
estimator.fit(X_train, y_train)

DecisionTreeClassifier(criterion='entropy', max_depth=10, random_state=1, splitter='random')

Question10:

Compare the model performance of a Decision Tree with default parameters and the tuned Decision tree built in Q9 on the test set.

Which of the following statements is/are True?

  • A) Recall Score of tuned model > Recall Score of decision tree with default parameters
  • B) Recall Score of tuned model < Recall Score of decision tree with default parameters
  • C) F1 Score of tuned model > F1 Score Score of decision tree with default parameters
  • D) F1 Score of tuned model < F1 Score of decision tree with default parameters

A and B

B and C

C and D

A and D


Ans: A and D


# Training decision tree with default parameters model = DecisionTreeClassifier(random_state=1) model.fit(X_train,y_train)

# Tuned model estimator.fit(X_train, y_train)

# Checking model performance of Decision Tree with default parameters print(recall_score(y_test,y_pred_test1)) print(metrics.f1_score(y_test,y_pred_test1))

# Checking model performance of tunedDecision Tree print(recall_score(y_test,y_pred_test2)) print(metrics.f1_score(y_test,y_pred_test2))




0 comments: