An Exploratory Data Analysis on Lower Back Pain
Question Answers
Question1:
Load the dataset and identify the variables that have a correlation greater than or equal to 0.7 with the ‘pelvic_incidence’ variable?
- pelvic tilt, pelvic_radius
- lumbar_lordosis_angle, sacral_slope
- Direct_tilt, sacrum_angle
- thoracic_slope, thoracic_slope
Ans: lumbar_lordosis_angle, sacral_slope
Question2:
Encode Status variable: Abnormal class to 1 and Normal to 0.
Split the data into a 70:30 ratio. What is the percentage of 0 and 1 classes in the test data (y_test)?
1: In a range of 0.1 to 0.2/ 0: In a range of 0.2 to 0.3
1: In a range of 0.5 to 0.6/ 0: In a range of 0.3 to 0.6
1: In a range of 0.6 to 0.7/ 0: In a range of 0.3 to 0.4
1: In a range of 0.7 to 0.8/ 0: In a range of 0.2 to 0.3
Ans:
1: In a range of 0.7 to 0.8
0: In a range of 0.2 to 0.3
dataset['Status'] = dataset['Status'].apply(lambda x: 1 if x=='Abnormal' else 0)
X = dataset.drop(['Status'], axis=1)
Y = dataset['Status']
#Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size=0.30, random_state = 1)
y_test.value_counts(normalize=True)
1 0.709677 0 0.290323 Name: Status, dtype: float64
Ans:
1: In a range of 0.7 to 0.8
0: In a range of 0.2 to 0.3
Question3:
Which metric is the most appropriate metric to evaluate the model according to the problem statement?
Accuracy, Recall, Precision, F1 score
Ans: Recall
Predicting a person doesn't have an abnormal spine and a person has an abnormal spine - A person who needs treatment will be missed. Hence, reducing such false negatives is important
Question4:
Check for multicollinearity in data and choose the variables which show high multicollinearity? (VIF value greater than 5)
- sacrum_angle, pelvic tilt, sacral_slope
- pelvic_slope, cervical_tilt, sacrum_angle
- pelvic_incidence, pelvic tilt, sacral_slope
- pelvic_incidence, pelvic tilt, lumbar_lordosis_angle
Ans: pelvic_incidence, pelvic tilt, sacral_slope
Question5:
How many minimum numbers of attributes will we need to drop to remove multicollinearity (or get a VIF value less than 5) from the data?
- 1
- 2
- 3
- 4
Question6:
Drop sacral_slope attribute and proceed to build a logistic regression model. Drop all the insignificant variables and keep only significant variables (p-value < 0.05).
How many significant variables are left in the final model excluding the constant?
- 1
- 2
- 3
- 4
Question7:
Marks: 2/2
Select the correct option for the following:
Train a decision tree model with default parameters and vary the depth from 1 to 8 (both values included) and compare the model performance at each value of depth
At depth = 1, the decision tree gives the highest recall among all the models on the training set.
At depth = 2, the decision tree gives the highest recall among all the models on the training set.
At depth = 5, the decision tree gives the highest recall among all the models on the training set.
At depth = 8, the decision tree gives the highest recall among all the models on the training set.
Ans: 1
score_DT = [] for i in range(1,9): dTree = DecisionTreeClassifier(max_depth=i,criterion = 'gini', random_state=1) dTree.fit(X_train, y_train) pred = dTree.predict(X_train) case = {'Depth':i,'Recall':recall_score(y_train,pred)} score_DT.append(case)
print(score_DT)
[{'Depth': 1, 'Recall': 0.6875}, {'Depth': 2, 'Recall': 0.8888888888888888}, {'Depth': 3, 'Recall': 0.8888888888888888}, {'Depth': 4, 'Recall': 0.9583333333333334}, {'Depth': 5, 'Recall': 0.9652777777777778}, {'Depth': 6, 'Recall': 0.9930555555555556}, {'Depth': 7, 'Recall': 0.9861111111111112}, {'Depth': 8, 'Recall': 1.0}]
Question8:
Plot the feature importance of the variables given by the model which gives the maximum value of recall on the training set in Q7. Which are the 2 most important variables respectively?
- lumbar_lordosis_angle, sacrum_angle
- degree_spondylolisthesis, pelvic tilt
- scoliosis_slope, cervial_tilt
- scoliosis_slope, cervial_tilt
Ans: degree_spondylolisthesis, pelvic tilt
Question9:
Perform hyperparmater tuning for Decision tree using GridSrearchCV.
Use the following list of hyperparameters and their values:
Maximum depth: [5,10,15, None], criterion: ['gini','entropy'], splitter: ['best','random'] Set cv = 3 in grid search Set scoring = 'recall' in grid search Which of the following statements is/are True?
A) GridSeachCV selects the max_depth as 10
B) GridSeachCV selects the criterion as 'gini'
C) GridSeachCV selects the splitter as 'random'
D) GridSeachCV selects the splitter as 'best'
E) GridSeachCV selects the max_depth as 5
F) GridSeachCV selects the criterion as 'entropy'
- A, B, and C
- B, C, and E
- A, C, and F
- D, E, and F
Ans: A, C, and F
# Grid of parameters to choose from
parameters = {'max_depth': [5,10,15,None],
'criterion' : ['gini','entropy'],
'splitter' : ['best','random']
}
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring='recall',cv=3)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)
DecisionTreeClassifier(criterion='entropy', max_depth=10, random_state=1, splitter='random')Question10:
Compare the model performance of a Decision Tree with default parameters and the tuned Decision tree built in Q9 on the test set.
Which of the following statements is/are True?
- A) Recall Score of tuned model > Recall Score of decision tree with default parameters
- B) Recall Score of tuned model < Recall Score of decision tree with default parameters
- C) F1 Score of tuned model > F1 Score Score of decision tree with default parameters
- D) F1 Score of tuned model < F1 Score of decision tree with default parameters
A and B
B and C
C and D
A and D