An Exploratory Data Analysis on Lower Back Pain

Question Answers

Question1:

Load the dataset and identify the variables that have a correlation greater than or equal to 0.7 with the ‘pelvic_incidence’ variable?

pelvic tilt, pelvic_radius
lumbar_lordosis_angle, sacral_slope
Direct_tilt, sacrum_angle
thoracic_slope, thoracic_slope

Ans: lumbar_lordosis_angle, sacral_slope

plt.figure(figsize=(10,5))

sns.heatmap(dataset.corr()[dataset.corr()>=0.7],annot=True,vmax=1,vmin=-1,cmap='Spectral');

Question2:

Encode Status variable: Abnormal class to 1 and Normal to 0.

Split the data into a 70:30 ratio. What is the percentage of 0 and 1 classes in the test data (y_test)?

1: In a range of 0.1 to 0.2/ 0: In a range of 0.2 to 0.3

1: In a range of 0.5 to 0.6/ 0: In a range of 0.3 to 0.6

1: In a range of 0.6 to 0.7/ 0: In a range of 0.3 to 0.4

1: In a range of 0.7 to 0.8/ 0: In a range of 0.2 to 0.3

Ans:

1: In a range of 0.7 to 0.8

0: In a range of 0.2 to 0.3

dataset['Status'] = dataset['Status'].apply(lambda x: 1 if x=='Abnormal' else 0)

X = dataset.drop(['Status'], axis=1)

Y = dataset['Status']

#Splitting data in train and test sets

X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size=0.30, random_state = 1)

y_test.value_counts(normalize=True)

1    0.709677
0    0.290323
Name: Status, dtype: float64

Ans:

1: In a range of 0.7 to 0.8

0: In a range of 0.2 to 0.3

Question3:
Which metric is the most appropriate metric to evaluate the model according to the problem statement? 
Accuracy, Recall, Precision, F1 score

Ans: Recall
Predicting a person doesn't have an abnormal spine and a person has an abnormal spine - A person who needs treatment will be missed. Hence, reducing such false negatives is important
Question4:
Check for multicollinearity in data and choose the variables which show high multicollinearity? (VIF value greater than 5)
sacrum_angle, pelvic tilt, sacral_slope
pelvic_slope, cervical_tilt, sacrum_angle
pelvic_incidence, pelvic tilt, sacral_slope
pelvic_incidence, pelvic tilt, lumbar_lordosis_angle

Ans: pelvic_incidence, pelvic tilt, sacral_slope

#dataframe with numerical column only 
num_feature_set = X_train.copy() 
num_feature_set = add_constant(num_feature_set) 
num_feature_set = num_feature_set.astype(float)

# Calculating VIF
vif_series = pd.Series([variance_inflation_factor(num_feature_set.values,i) for i in range(num_feature_set.shape[1])],index=num_feature_set.columns, dtype = float)
print('Series before feature selection: \n\n{}\n'.format(vif_series))

Question5:
How many minimum numbers of attributes will we need to drop to remove multicollinearity (or get a VIF value less than 5) from the data?
1
2
3
4


Ans: 1
# Dropping first variable with high VIF 
num_feature_set1 = num_feature_set.drop(['pelvic_incidence'],axis=1)
 
# Checking VIF value 
vif_series1 = pd.Series([variance_inflation_factor(num_feature_set1.values,i) for i in range(num_feature_set1.shape[1])],index=num_feature_set1.columns, dtype = float)
print('Series before feature selection: \n\n{}\n'.format(vif_series1))

# Dropping second variable with high VIF 
num_feature_set2 = num_feature_set.drop(['pelvic tilt'],axis=1) 

# Checking VIF value 
vif_series2 = pd.Series([variance_inflation_factor(num_feature_set2.values,i) for i in range(num_feature_set2.shape[1])],index=num_feature_set2.columns, dtype = float)
print('Series before feature selection: \n\n{}\n'.format(vif_series2))


Question6:
Drop sacral_slope attribute and proceed to build a logistic regression model. Drop all the insignificant variables and keep only significant variables (p-value < 0.05).
How many significant variables are left in the final model excluding the constant?
1
2
3
4

Ans: 2

# Dropping sacral slope
X_train, X_test, y_train, y_test = train_test_split(num_feature_set3, Y, test_size=0.30, random_state = 1)

# Iteratively dropping variables with a high p-value
X_train2 = X_train.drop(['pelvic_slope'],axis=1)
X_test2 = X_test.drop(['pelvic_slope'],axis=1)
logit = sm.Logit(y_train, X_train2.astype(float))
lg = logit.fit()
print(lg.summary())

X_train3 = X_train2.drop(['scoliosis_slope'],axis=1)
X_test3 = X_test2.drop(['scoliosis_slope'],axis=1)
logit = sm.Logit(y_train, X_train3.astype(float))
lg = logit.fit()
print(lg.summary())

X_train4 = X_train3.drop(['cervical_tilt'],axis=1)
X_test4 = X_test3.drop(['cervical_tilt'],axis=1)
logit = sm.Logit(y_train, X_train4.astype(float))
lg = logit.fit()
print(lg.summary())

X_train5 = X_train4.drop(['Direct_tilt'],axis=1)
X_test5 = X_test4.drop(['Direct_tilt'],axis=1)
logit = sm.Logit(y_train, X_train5.astype(float))
lg = logit.fit()
print(lg.summary())

X_train6 = X_train5.drop(['lumbar_lordosis_angle'],axis=1)
X_test6 = X_test5.drop(['lumbar_lordosis_angle'],axis=1)
logit = sm.Logit(y_train, X_train6.astype(float))
lg = logit.fit()
print(lg.summary())

X_train7 = X_train6.drop(['sacrum_angle'],axis=1)
X_test7 = X_test6.drop(['sacrum_angle'],axis=1)
logit = sm.Logit(y_train, X_train7.astype(float))
lg = logit.fit()
print(lg.summary())

X_train8 = X_train7.drop(['thoracic_slope'],axis=1)
X_test8 = X_test7.drop(['thoracic_slope'],axis=1)
logit = sm.Logit(y_train, X_train8.astype(float))
lg = logit.fit()
print(lg.summary())


Question7:

Marks: 2/2
Select the correct option for the following:
Train a decision tree model with default parameters and vary the depth from 1 to 8 (both values included) and compare the model performance at each value of depth
At depth = 1, the decision tree gives the highest recall among all the models on the training set.
At depth = 2, the decision tree gives the highest recall among all the models on the training set.
At depth = 5, the decision tree gives the highest recall among all the models on the training set.
At depth = 8, the decision tree gives the highest recall among all the models on the training set.
Ans: 1

score_DT = []
for i in range(1,9):
 dTree = DecisionTreeClassifier(max_depth=i,criterion = 'gini', random_state=1)
 dTree.fit(X_train, y_train)
 pred = dTree.predict(X_train)
 case = {'Depth':i,'Recall':recall_score(y_train,pred)}
 score_DT.append(case)
print(score_DT)
[{'Depth': 1, 'Recall': 0.6875}, {'Depth': 2, 'Recall': 0.8888888888888888}, {'Depth': 3, 'Recall': 0.8888888888888888}, {'Depth': 4, 'Recall': 0.9583333333333334}, {'Depth': 5, 'Recall': 0.9652777777777778}, {'Depth': 6, 'Recall': 0.9930555555555556}, {'Depth': 7, 'Recall': 0.9861111111111112}, {'Depth': 8, 'Recall': 1.0}]

Question8:
Plot the feature importance of the variables given by the model which gives the maximum value of recall on the training set in Q7. Which are the 2 most important variables respectively?
lumbar_lordosis_angle, sacrum_angle
degree_spondylolisthesis, pelvic tilt
scoliosis_slope, cervial_tilt
scoliosis_slope, cervial_tilt

Ans: degree_spondylolisthesis, pelvic tilt


Question9:
Perform hyperparmater tuning for Decision tree using GridSrearchCV.
Use the following list of hyperparameters and their values:
Maximum depth: [5,10,15, None], criterion: ['gini','entropy'], splitter: ['best','random'] Set cv = 3 in grid search Set scoring = 'recall' in grid search Which of the following statements is/are True?
A) GridSeachCV selects the max_depth as 10
B) GridSeachCV selects the criterion as 'gini'
C) GridSeachCV selects the splitter as 'random'
D) GridSeachCV selects the splitter as 'best'
E) GridSeachCV selects the max_depth as 5
F) GridSeachCV selects the criterion as 'entropy'
A, B, and C
B, C, and E
A, C, and F
D, E, and F

Ans: A, C, and F

# Choose the type of classifier. 
estimator = DecisionTreeClassifier(random_state=1)

# Grid of parameters to choose from
parameters = {'max_depth': [5,10,15,None], 
 'criterion' : ['gini','entropy'],
 'splitter' : ['best','random']
 }

# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring='recall',cv=3)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_

# Fit the best algorithm to the data. 
estimator.fit(X_train, y_train)

DecisionTreeClassifier(criterion='entropy', max_depth=10, random_state=1, splitter='random')

Question10:
Compare the model performance of a Decision Tree with default parameters and the tuned Decision tree built in Q9 on the test set.
Which of the following statements is/are True?
A) Recall Score of tuned model > Recall Score of decision tree with default parameters
B) Recall Score of tuned model < Recall Score of decision tree with default parameters
C) F1 Score of tuned model > F1 Score Score of decision tree with default parameters
D) F1 Score of tuned model < F1 Score of decision tree with default parameters
A and B
B and C
C and D
A and D

Ans: A and D

# Training decision tree with default parameters
model = DecisionTreeClassifier(random_state=1)
model.fit(X_train,y_train)

# Tuned model
estimator.fit(X_train, y_train)

# Checking model performance of Decision Tree with default parameters
print(recall_score(y_test,y_pred_test1))
print(metrics.f1_score(y_test,y_pred_test1))

# Checking model performance of tunedDecision Tree
print(recall_score(y_test,y_pred_test2))
print(metrics.f1_score(y_test,y_pred_test2))

paraDisE beyond the eArTH

Pages

Friday, 15 October 2021

Supervised Learning - Classification/ SLC Hands-On Quiz

An Exploratory Data Analysis on Lower Back Pain

Question Answers

Question1:

Ans: lumbar_lordosis_angle, sacral_slope

Question2:

Ans:

Question3:

Ans: Recall

Question4:

Ans: pelvic_incidence, pelvic tilt, sacral_slope

Question5:

Question6:

Question7:

Ans: 1

Question8:

Ans: degree_spondylolisthesis, pelvic tilt

Question9:

Ans: A, C, and F

Question10:

Ans: A and D

0 comments: