Showing posts with label Data Science and Business Analytics. Show all posts
Showing posts with label Data Science and Business Analytics. Show all posts

Saturday 16 October 2021

Supervised Learning - Classification/ Quiz - Decision Tree









Q No: 1

What is the final objective of Decision Tree?

  1. Maximise the Gini Index of the leaf nodes
  2. Minimise the homogeneity of the leaf nodes
  3. Maximise the heterogeneity of the leaf nodes
  4. Minimise the impurity of the leaf nodes

Ans: Minimise the impurity of the leaf nodes

In decision tree, after every split we hope to have lesser 'impurity' in the subsequent node. So that, eventually we end up with leaf nodes that have the least 'impurity'/entropy


Q No: 2

Decision Trees can be used to predict

  1. Continuous Target Variables
  2. Categorical Target Variables
  3. Random Variables
  4. Both Continuous and Categorical Target Variables

Ans: Both Continuous and Categorical Target Variables


Q No: 3

When we create a Decision Tree, how is the best split determined at each node?

  1. We split the data using the first independent variable and so on.
  2. The first split is determined randomly and from then on we start choosing the best split.
  3. We make at most 5 splits on the data using only one independent variable and choose the split that gives the highest Gini gain.
  4. We make all possible splits on the data using the independent variables and choose the split that gives the highest Gini gain.

Ans: We make all possible splits on the data using the independent variables and choose the split that gives the highest Gini gain.


Q No: 4

Which of the following is not true about Decision Trees

  1. Decision Trees tend to overfit the test data
  2. Decision Trees can be pruned to reduce overfitting
  3. Decision Trees would grow to maximum possible depth to achieve 100% purity in the leaf nodes, this generally leads to overfitting.
  4. Decision Trees can capture complex patterns in the data.

Ans: Decision Trees tend to overfit the test data


Q No: 5

If we increase the value of the hyperparameter min_samples_leaf from the default value, we would end up getting a ______________ tree than the tree with the default value.

  1. smaller
  2. bigger

Ans: smaller

min_samples_leaf = the minimum number of samples required at a leaf node

As the number of observations required in the leaf node increases, the size of the tree would decrease 


Q No: 6

Which of the following is a perfectly impure node?








  1. Node - 0
  2. Node - 1
  3. Node - 2
  4. None of these

Ans: Node - 1

Gini = 0.5 at Node 1

gini = 0 -> Perfectly Pure

gini = o.5 -> Perfectly Impure


Q No: 7

In a classification setting, if we do not limit the size of the decision tree it will only stop when all the leaves are:

  1. All leaves are at the same depth
  2. of the same size
  3. homogenous
  4. heterogenous

Ans: homogenous

The tree will stop splitting after the impurity in every leaf is zero


Q No: 8

Which of the following explains pre-pruning?

  1. Before pruning a decision tree, we need to create the tree. This process of creating the tree before pruning is known as pre-pruning.
  2. Starting with a full-grown tree and creating trees that are sequentially smaller is known as pre-pruning
  3. We stop the decision tree from growing to its full length by bounding the hyper parameters, this is known as pre-pruning.
  4. Building a decision tree on default hyperparameter values is known as pre-pruning.

Ans: We stop the decision tree from growing to its full length by bounding the hyper parameters, this is known as pre-pruning.


Q No: 9

Which of the following is the same across Classification and Regression Decision Trees?

  1. Type of predicted variable
  2. Impurity Measure/ Splitting Criteria
  3. max_depth parameter

Ans: max_depth parameter


Q No: 10

Select the correct order in which a decision tree is built:

  1. Calculate the Gini impurity after each split
  2. Decide the best split based on the lowest Gini impurity
  3. Repeat the complete process until the stopping criterion is reached or the tree has achieved homogeneity in leaves.
  4. Select an attribute of data and make all possible splits in data
  5. Repeat the steps for every attribute present in the data

  • 4,1,3,2,5
  • 4,1,5,2,3
  • 4,1,3,2,5
  • 4,1,5,3,2

Ans: 4,1,5,2,3

Friday 15 October 2021

EDA : Lower Back Pain

 Exploratory Data Analysis on Lower Back Pain

















Lower Back Pain

Lower back pain, also called lumbago, is not a disorder. It’s a symptom of several different types of medical problems. It usually results from a problem with one or more parts of the lower back, such as:

  • ligaments
  • muscles
  • nerves
  • the bony structures that make up the spine, called vertebral bodies or vertebrae

It can also be due to a problem with nearby organs, such as the kidneys.

According to the American Association of Neurological Surgeons, 75 to 85 percent of Americans will experience back pain in their lifetime. Of those, 50 percent will have more than one episode within a year. In 90 percent of all cases, the pain gets better without surgery. Talk to your doctor if you’re experiencing back pain.

In this Exploratory Data Analysis (EDA) I am going to use the Lower Back Pain Symptoms Dataset and try to find out ineresting insights of this dataset.


#pip install xgboost

if xgboost is throws errors

ModuleNotFoundError Traceback (most recent call last)


import os

os.getcwd()

os.chdir('C:\\Users\\kt.rinith\\Google Drive\\Training\\PGP-DSBA\\Jupiter Files')

# change working directory

dataset = pd.read_csv("backpain.csv")
dataset.head() # this will return top 5 rows 






# This command will remove the last column from our dataset.
#del dataset["Unnamed: 13"]
dataset.describe()





dataset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 310 entries, 0 to 309
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   pelvic_incidence          310 non-null    float64
 1   pelvic tilt               310 non-null    float64
 2   lumbar_lordosis_angle     310 non-null    float64
 3   sacral_slope              310 non-null    float64
 4   pelvic_radius             310 non-null    float64
 5   degree_spondylolisthesis  310 non-null    float64
 6   pelvic_slope              310 non-null    float64
 7   Direct_tilt               310 non-null    float64
 8   thoracic_slope            310 non-null    float64
 9   cervical_tilt             310 non-null    float64
 10  sacrum_angle              310 non-null    float64
 11  scoliosis_slope           310 non-null    float64
 12  Status                    310 non-null    object 
dtypes: float64(12), object(1)
memory usage: 31.6+ KB

dataset["Status"].value_counts().sort_index().plot.bar()


dataset.corr()




plt.subplots(figsize=(12,8))
sns.heatmap(dataset.corr())




sns.pairplot(dataset, hue="Status")



Visualize Features with Histogram: A Histogram is the most commonly used graph to show frequency distributions.
dataset.hist(figsize=(15,12),bins = 20, color="#007959AA")
plt.title("Features Distribution")
plt.show()




Detecting and Removing Outliers

plt.subplots(figsize=(15,6)) dataset.boxplot(patch_artist=True, sym="k.") plt.xticks(rotation=90)

(array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12]),
 [Text(1, 0, 'pelvic_incidence'),
  Text(2, 0, 'pelvic tilt'),
  Text(3, 0, 'lumbar_lordosis_angle'),
  Text(4, 0, 'sacral_slope'),
  Text(5, 0, 'pelvic_radius'),
  Text(6, 0, 'degree_spondylolisthesis'),
  Text(7, 0, 'pelvic_slope'),
  Text(8, 0, 'Direct_tilt'),
  Text(9, 0, 'thoracic_slope'),
  Text(10, 0, 'cervical_tilt'),
  Text(11, 0, 'sacrum_angle'),
  Text(12, 0, 'scoliosis_slope')])

Remove Outliers:
# we use tukey method to remove outliers.
# whiskers are set at 1.5 times Interquartile Range (IQR)
def remove_outlier(feature):
first_q = np.percentile(X[feature], 25)
third_q = np.percentile(X[feature], 75)
IQR = third_q - first_q
IQR *= 1.5
minimum = first_q - IQR # the acceptable minimum value
maximum = third_q + IQR # the acceptable maximum value

mean = X[feature].mean()
"""
# any value beyond the acceptance range are considered
as outliers.
# we replace the outliers with the mean value of that
feature.
"""
X.loc[X[feature] < minimum, feature] = mean
X.loc[X[feature] > maximum, feature] = mean

# taking all the columns except the last one
# last column is the label
X = dataset.iloc[:, :-1]for i in range(len(X.columns)):
remove_outlier(X.columns[i])


Feature Scaling:

Feature scaling though standardization (or Z-score normalization) can be an important preprocessing step for many machine learning algorithms. Our dataset contains features that vary highly in magnitudes, units and range. But since most of the machine learning algorithms use Euclidean distance between two data points in their computations, this will create a problem. To avoid this effect, we need to bring all features to the same level of magnitudes. This can be achieved 

scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(X)
scaled_df = pd.DataFrame(data = scaled_data, columns = X.columns)
scaled_df.head()

Label Encoding:

Certain algorithms like XGBoost can only have numerical values as their predictor variables. Hence we need to encode our categorical values. LabelEncoder from sklearn.preprocessing package encodes labels with values between 0 and n_classes-1.

label = dataset["class"]

encoder = LabelEncoder()

label = encoder.fit_transform(label)

Model Training and Evaluation:


X = scaled_df y = label X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=0) clf_gnb = GaussianNB() pred_gnb = clf_gnb.fit(X_train, y_train).predict(X_test) accuracy_score(pred_gnb, y_test) # Out []: 0.8085106382978723 clf_svc = SVC(kernel="linear") pred_svc = clf_svc.fit(X_train, y_train).predict(X_test) accuracy_score(pred_svc, y_test) # Out []: 0.7872340425531915 clf_xgb = XGBClassifier() pred_xgb = clf_xgb.fit(X_train, y_train).predict(X_test) accuracy_score(pred_xgb, y_test) # Out []: 0.8297872340425532

Feature Importance:

fig, ax = plt.subplots(figsize=(12, 6)) plot_importance(clf_xgb, ax=ax)




















Marginal plot

A marginal plot allows us to study the relationship between 2 numeric variables. The central chart displays their correlation.

Lets visualize the relationship between degree_spondylolisthesis and class:

sns.set(style="white", color_codes=True)

sns.jointplot(x=X["degree_spondylolisthesis"], y=label, kind='kde', color="skyblue")













































Supervised Learning - Classification/ SLC Hands-On Quiz

An Exploratory Data Analysis on Lower Back Pain

Question Answers

Question1:

Load the dataset and identify the variables that have a correlation greater than or equal to 0.7 with the ‘pelvic_incidence’ variable?

  • pelvic tilt, pelvic_radius
  • lumbar_lordosis_angle, sacral_slope
  • Direct_tilt, sacrum_angle
  • thoracic_slope, thoracic_slope

Ans: lumbar_lordosis_angle, sacral_slope


plt.figure(figsize=(10,5))
sns.heatmap(dataset.corr()[dataset.corr()>=0.7],annot=True,vmax=1,vmin=-1,cmap='Spectral');












Question2:

Encode Status variable: Abnormal class to 1 and Normal to 0.

Split the data into a 70:30 ratio. What is the percentage of 0 and 1 classes in the test data (y_test)?

1: In a range of 0.1 to 0.2/ 0: In a range of 0.2 to 0.3

1: In a range of 0.5 to 0.6/ 0: In a range of 0.3 to 0.6

1: In a range of 0.6 to 0.7/ 0: In a range of 0.3 to 0.4

1: In a range of 0.7 to 0.8/ 0: In a range of 0.2 to 0.3


Ans: 

1: In a range of 0.7 to 0.8

0: In a range of  0.2 to 0.3

dataset['Status'] = dataset['Status'].apply(lambda x: 1 if x=='Abnormal' else 0)

X = dataset.drop(['Status'], axis=1)

Y = dataset['Status']

#Splitting data in train and test sets

X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size=0.30, random_state = 1)

y_test.value_counts(normalize=True)


1    0.709677
0    0.290323
Name: Status, dtype: float64

Ans:

1: In a range of 0.7 to 0.8

0: In a range of 0.2 to 0.3



Question3:

Which metric is the most appropriate metric to evaluate the model according to the problem statement? 

Accuracy, Recall, Precision, F1 score


Ans: Recall

Predicting a person doesn't have an abnormal spine and a person has an abnormal spine - A person who needs treatment will be missed. Hence, reducing such false negatives is important

Question4:

Check for multicollinearity in data and choose the variables which show high multicollinearity? (VIF value greater than 5)

  • sacrum_angle, pelvic tilt, sacral_slope
  • pelvic_slope, cervical_tilt, sacrum_angle
  • pelvic_incidence, pelvic tilt, sacral_slope
  • pelvic_incidence, pelvic tilt, lumbar_lordosis_angle

Ans: pelvic_incidence, pelvic tilt, sacral_slope


#dataframe with numerical column only 
num_feature_set = X_train.copy() 
num_feature_set = add_constant(num_feature_set) 
num_feature_set = num_feature_set.astype(float)

# Calculating VIF
vif_series = pd.Series([variance_inflation_factor(num_feature_set.values,i) for i in range(num_feature_set.shape[1])],index=num_feature_set.columns, dtype = float)
print('Series before feature selection: \n\n{}\n'.format(vif_series))

Question5:

How many minimum numbers of attributes will we need to drop to remove multicollinearity (or get a VIF value less than 5) from the data?

  • 1
  • 2
  • 3
  • 4


Ans: 1
# Dropping first variable with high VIF 
num_feature_set1 = num_feature_set.drop(['pelvic_incidence'],axis=1)
 
# Checking VIF value 
vif_series1 = pd.Series([variance_inflation_factor(num_feature_set1.values,i) for i in range(num_feature_set1.shape[1])],index=num_feature_set1.columns, dtype = float) print('Series before feature selection: \n\n{}\n'.format(vif_series1))

# Dropping second variable with high VIF 
num_feature_set2 = num_feature_set.drop(['pelvic tilt'],axis=1) 

# Checking VIF value 
vif_series2 = pd.Series([variance_inflation_factor(num_feature_set2.values,i) for i in range(num_feature_set2.shape[1])],index=num_feature_set2.columns, dtype = float) print('Series before feature selection: \n\n{}\n'.format(vif_series2))


Question6:

Drop sacral_slope attribute and proceed to build a logistic regression model. Drop all the insignificant variables and keep only significant variables (p-value < 0.05).

How many significant variables are left in the final model excluding the constant?

  • 1
  • 2
  • 3
  • 4

Ans: 2

# Dropping sacral slope
X_train, X_test, y_train, y_test = train_test_split(num_feature_set3, Y, test_size=0.30, random_state = 1) # Iteratively dropping variables with a high p-value X_train2 = X_train.drop(['pelvic_slope'],axis=1) X_test2 = X_test.drop(['pelvic_slope'],axis=1) logit = sm.Logit(y_train, X_train2.astype(float)) lg = logit.fit() print(lg.summary()) X_train3 = X_train2.drop(['scoliosis_slope'],axis=1) X_test3 = X_test2.drop(['scoliosis_slope'],axis=1) logit = sm.Logit(y_train, X_train3.astype(float)) lg = logit.fit() print(lg.summary()) X_train4 = X_train3.drop(['cervical_tilt'],axis=1) X_test4 = X_test3.drop(['cervical_tilt'],axis=1) logit = sm.Logit(y_train, X_train4.astype(float)) lg = logit.fit() print(lg.summary()) X_train5 = X_train4.drop(['Direct_tilt'],axis=1) X_test5 = X_test4.drop(['Direct_tilt'],axis=1) logit = sm.Logit(y_train, X_train5.astype(float)) lg = logit.fit() print(lg.summary()) X_train6 = X_train5.drop(['lumbar_lordosis_angle'],axis=1) X_test6 = X_test5.drop(['lumbar_lordosis_angle'],axis=1) logit = sm.Logit(y_train, X_train6.astype(float)) lg = logit.fit() print(lg.summary()) X_train7 = X_train6.drop(['sacrum_angle'],axis=1) X_test7 = X_test6.drop(['sacrum_angle'],axis=1) logit = sm.Logit(y_train, X_train7.astype(float)) lg = logit.fit() print(lg.summary()) X_train8 = X_train7.drop(['thoracic_slope'],axis=1) X_test8 = X_test7.drop(['thoracic_slope'],axis=1) logit = sm.Logit(y_train, X_train8.astype(float)) lg = logit.fit() print(lg.summary())


Question7:


Marks: 2/2

Select the correct option for the following:

Train a decision tree model with default parameters and vary the depth from 1 to 8 (both values included) and compare the model performance at each value of depth

At depth = 1, the decision tree gives the highest recall among all the models on the training set.

At depth = 2, the decision tree gives the highest recall among all the models on the training set.

At depth = 5, the decision tree gives the highest recall among all the models on the training set.

At depth = 8, the decision tree gives the highest recall among all the models on the training set.

Ans: 1


score_DT = [] for i in range(1,9): dTree = DecisionTreeClassifier(max_depth=i,criterion = 'gini', random_state=1) dTree.fit(X_train, y_train) pred = dTree.predict(X_train) case = {'Depth':i,'Recall':recall_score(y_train,pred)} score_DT.append(case)

print(score_DT)

[{'Depth': 1, 'Recall': 0.6875}, {'Depth': 2, 'Recall': 0.8888888888888888}, {'Depth': 3, 'Recall': 0.8888888888888888}, {'Depth': 4, 'Recall': 0.9583333333333334}, {'Depth': 5, 'Recall': 0.9652777777777778}, {'Depth': 6, 'Recall': 0.9930555555555556}, {'Depth': 7, 'Recall': 0.9861111111111112}, {'Depth': 8, 'Recall': 1.0}]


Question8:

Plot the feature importance of the variables given by the model which gives the maximum value of recall on the training set in Q7. Which are the 2 most important variables respectively?

  • lumbar_lordosis_angle, sacrum_angle
  • degree_spondylolisthesis, pelvic tilt
  • scoliosis_slope, cervial_tilt
  • scoliosis_slope, cervial_tilt

Ans: degree_spondylolisthesis, pelvic tilt



Question9:

Perform hyperparmater tuning for Decision tree using GridSrearchCV.

Use the following list of hyperparameters and their values:

Maximum depth: [5,10,15, None], criterion: ['gini','entropy'], splitter: ['best','random'] Set cv = 3 in grid search Set scoring = 'recall' in grid search Which of the following statements is/are True?

A) GridSeachCV selects the max_depth as 10

B) GridSeachCV selects the criterion as 'gini'

C) GridSeachCV selects the splitter as 'random'

D) GridSeachCV selects the splitter as 'best'

E) GridSeachCV selects the max_depth as 5

F) GridSeachCV selects the criterion as 'entropy'

  • A, B, and C
  • B, C, and E
  • A, C, and F
  • D, E, and F

Ans: A, C, and F


# Choose the type of classifier. estimator = DecisionTreeClassifier(random_state=1)

# Grid of parameters to choose from
parameters = {'max_depth': [5,10,15,None], 
 'criterion' : ['gini','entropy'],
 'splitter' : ['best','random']
 }

# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring='recall',cv=3)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_

# Fit the best algorithm to the data. 
estimator.fit(X_train, y_train)

DecisionTreeClassifier(criterion='entropy', max_depth=10, random_state=1, splitter='random')

Question10:

Compare the model performance of a Decision Tree with default parameters and the tuned Decision tree built in Q9 on the test set.

Which of the following statements is/are True?

  • A) Recall Score of tuned model > Recall Score of decision tree with default parameters
  • B) Recall Score of tuned model < Recall Score of decision tree with default parameters
  • C) F1 Score of tuned model > F1 Score Score of decision tree with default parameters
  • D) F1 Score of tuned model < F1 Score of decision tree with default parameters

A and B

B and C

C and D

A and D


Ans: A and D


# Training decision tree with default parameters model = DecisionTreeClassifier(random_state=1) model.fit(X_train,y_train)

# Tuned model estimator.fit(X_train, y_train)

# Checking model performance of Decision Tree with default parameters print(recall_score(y_test,y_pred_test1)) print(metrics.f1_score(y_test,y_pred_test1))

# Checking model performance of tunedDecision Tree print(recall_score(y_test,y_pred_test2)) print(metrics.f1_score(y_test,y_pred_test2))