EDA : Lower Back Pain

 Exploratory Data Analysis on Lower Back Pain

Lower Back Pain

Lower back pain, also called lumbago, is not a disorder. It’s a symptom of several different types of medical problems. It usually results from a problem with one or more parts of the lower back, such as:

  • ligaments
  • muscles
  • nerves
  • the bony structures that make up the spine, called vertebral bodies or vertebrae

It can also be due to a problem with nearby organs, such as the kidneys.

According to the American Association of Neurological Surgeons, 75 to 85 percent of Americans will experience back pain in their lifetime. Of those, 50 percent will have more than one episode within a year. In 90 percent of all cases, the pain gets better without surgery. Talk to your doctor if you’re experiencing back pain.

In this Exploratory Data Analysis (EDA) I am going to use the Lower Back Pain Symptoms Dataset and try to find out ineresting insights of this dataset.

#pip install xgboost

if xgboost is throws errors

ModuleNotFoundError Traceback (most recent call last)

import os


os.chdir('C:\\Users\\kt.rinith\\Google Drive\\Training\\PGP-DSBA\\Jupiter Files')

# change working directory

dataset = pd.read_csv("backpain.csv")
dataset.head() # this will return top 5 rows 

# This command will remove the last column from our dataset.
#del dataset["Unnamed: 13"]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 310 entries, 0 to 309
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   pelvic_incidence          310 non-null    float64
 1   pelvic tilt               310 non-null    float64
 2   lumbar_lordosis_angle     310 non-null    float64
 3   sacral_slope              310 non-null    float64
 4   pelvic_radius             310 non-null    float64
 5   degree_spondylolisthesis  310 non-null    float64
 6   pelvic_slope              310 non-null    float64
 7   Direct_tilt               310 non-null    float64
 8   thoracic_slope            310 non-null    float64
 9   cervical_tilt             310 non-null    float64
 10  sacrum_angle              310 non-null    float64
 11  scoliosis_slope           310 non-null    float64
 12  Status                    310 non-null    object 
dtypes: float64(12), object(1)
memory usage: 31.6+ KB




sns.pairplot(dataset, hue="Status")

Visualize Features with Histogram: A Histogram is the most commonly used graph to show frequency distributions.
dataset.hist(figsize=(15,12),bins = 20, color="#007959AA")
plt.title("Features Distribution")

Detecting and Removing Outliers

plt.subplots(figsize=(15,6)) dataset.boxplot(patch_artist=True, sym="k.") plt.xticks(rotation=90)

(array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12]),
 [Text(1, 0, 'pelvic_incidence'),
  Text(2, 0, 'pelvic tilt'),
  Text(3, 0, 'lumbar_lordosis_angle'),
  Text(4, 0, 'sacral_slope'),
  Text(5, 0, 'pelvic_radius'),
  Text(6, 0, 'degree_spondylolisthesis'),
  Text(7, 0, 'pelvic_slope'),
  Text(8, 0, 'Direct_tilt'),
  Text(9, 0, 'thoracic_slope'),
  Text(10, 0, 'cervical_tilt'),
  Text(11, 0, 'sacrum_angle'),
  Text(12, 0, 'scoliosis_slope')])

Remove Outliers:
# we use tukey method to remove outliers.
# whiskers are set at 1.5 times Interquartile Range (IQR)
def remove_outlier(feature):
first_q = np.percentile(X[feature], 25)
third_q = np.percentile(X[feature], 75)
IQR = third_q - first_q
IQR *= 1.5
minimum = first_q - IQR # the acceptable minimum value
maximum = third_q + IQR # the acceptable maximum value

mean = X[feature].mean()
# any value beyond the acceptance range are considered
as outliers.
# we replace the outliers with the mean value of that
X.loc[X[feature] < minimum, feature] = mean
X.loc[X[feature] > maximum, feature] = mean

# taking all the columns except the last one
# last column is the label
X = dataset.iloc[:, :-1]for i in range(len(X.columns)):

Feature Scaling:

Feature scaling though standardization (or Z-score normalization) can be an important preprocessing step for many machine learning algorithms. Our dataset contains features that vary highly in magnitudes, units and range. But since most of the machine learning algorithms use Euclidean distance between two data points in their computations, this will create a problem. To avoid this effect, we need to bring all features to the same level of magnitudes. This can be achieved 

scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(X)
scaled_df = pd.DataFrame(data = scaled_data, columns = X.columns)

Label Encoding:

Certain algorithms like XGBoost can only have numerical values as their predictor variables. Hence we need to encode our categorical values. LabelEncoder from sklearn.preprocessing package encodes labels with values between 0 and n_classes-1.

label = dataset["class"]

encoder = LabelEncoder()

label = encoder.fit_transform(label)

Model Training and Evaluation:

X = scaled_df y = label X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=0) clf_gnb = GaussianNB() pred_gnb =, y_train).predict(X_test) accuracy_score(pred_gnb, y_test) # Out []: 0.8085106382978723 clf_svc = SVC(kernel="linear") pred_svc =, y_train).predict(X_test) accuracy_score(pred_svc, y_test) # Out []: 0.7872340425531915 clf_xgb = XGBClassifier() pred_xgb =, y_train).predict(X_test) accuracy_score(pred_xgb, y_test) # Out []: 0.8297872340425532

Feature Importance:

fig, ax = plt.subplots(figsize=(12, 6)) plot_importance(clf_xgb, ax=ax)

Marginal plot

A marginal plot allows us to study the relationship between 2 numeric variables. The central chart displays their correlation.

Lets visualize the relationship between degree_spondylolisthesis and class:

sns.set(style="white", color_codes=True)

sns.jointplot(x=X["degree_spondylolisthesis"], y=label, kind='kde', color="skyblue")

Supervised Learning - Classification/ SLC Hands-On Quiz

An Exploratory Data Analysis on Lower Back Pain

Question Answers


Load the dataset and identify the variables that have a correlation greater than or equal to 0.7 with the ‘pelvic_incidence’ variable?

  • pelvic tilt, pelvic_radius
  • lumbar_lordosis_angle, sacral_slope
  • Direct_tilt, sacrum_angle
  • thoracic_slope, thoracic_slope

Ans: lumbar_lordosis_angle, sacral_slope



Encode Status variable: Abnormal class to 1 and Normal to 0.

Split the data into a 70:30 ratio. What is the percentage of 0 and 1 classes in the test data (y_test)?

1: In a range of 0.1 to 0.2/ 0: In a range of 0.2 to 0.3

1: In a range of 0.5 to 0.6/ 0: In a range of 0.3 to 0.6

1: In a range of 0.6 to 0.7/ 0: In a range of 0.3 to 0.4

1: In a range of 0.7 to 0.8/ 0: In a range of 0.2 to 0.3


1: In a range of 0.7 to 0.8

0: In a range of  0.2 to 0.3

dataset['Status'] = dataset['Status'].apply(lambda x: 1 if x=='Abnormal' else 0)

X = dataset.drop(['Status'], axis=1)

Y = dataset['Status']

#Splitting data in train and test sets

X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size=0.30, random_state = 1)


1    0.709677
0    0.290323
Name: Status, dtype: float64


1: In a range of 0.7 to 0.8

0: In a range of 0.2 to 0.3


Which metric is the most appropriate metric to evaluate the model according to the problem statement? 

Accuracy, Recall, Precision, F1 score

Ans: Recall

Predicting a person doesn't have an abnormal spine and a person has an abnormal spine - A person who needs treatment will be missed. Hence, reducing such false negatives is important


Check for multicollinearity in data and choose the variables which show high multicollinearity? (VIF value greater than 5)

  • sacrum_angle, pelvic tilt, sacral_slope
  • pelvic_slope, cervical_tilt, sacrum_angle
  • pelvic_incidence, pelvic tilt, sacral_slope
  • pelvic_incidence, pelvic tilt, lumbar_lordosis_angle

Ans: pelvic_incidence, pelvic tilt, sacral_slope

#dataframe with numerical column only 
num_feature_set = X_train.copy() 
num_feature_set = add_constant(num_feature_set) 
num_feature_set = num_feature_set.astype(float)

# Calculating VIF
vif_series = pd.Series([variance_inflation_factor(num_feature_set.values,i) for i in range(num_feature_set.shape[1])],index=num_feature_set.columns, dtype = float)
print('Series before feature selection: \n\n{}\n'.format(vif_series))


How many minimum numbers of attributes will we need to drop to remove multicollinearity (or get a VIF value less than 5) from the data?

  • 1
  • 2
  • 3
  • 4

Ans: 1
# Dropping first variable with high VIF 
num_feature_set1 = num_feature_set.drop(['pelvic_incidence'],axis=1)
# Checking VIF value 
vif_series1 = pd.Series([variance_inflation_factor(num_feature_set1.values,i) for i in range(num_feature_set1.shape[1])],index=num_feature_set1.columns, dtype = float) print('Series before feature selection: \n\n{}\n'.format(vif_series1))

# Dropping second variable with high VIF 
num_feature_set2 = num_feature_set.drop(['pelvic tilt'],axis=1) 

# Checking VIF value 
vif_series2 = pd.Series([variance_inflation_factor(num_feature_set2.values,i) for i in range(num_feature_set2.shape[1])],index=num_feature_set2.columns, dtype = float) print('Series before feature selection: \n\n{}\n'.format(vif_series2))


Drop sacral_slope attribute and proceed to build a logistic regression model. Drop all the insignificant variables and keep only significant variables (p-value < 0.05).

How many significant variables are left in the final model excluding the constant?

  • 1
  • 2
  • 3
  • 4

Ans: 2

# Dropping sacral slope
X_train, X_test, y_train, y_test = train_test_split(num_feature_set3, Y, test_size=0.30, random_state = 1) # Iteratively dropping variables with a high p-value X_train2 = X_train.drop(['pelvic_slope'],axis=1) X_test2 = X_test.drop(['pelvic_slope'],axis=1) logit = sm.Logit(y_train, X_train2.astype(float)) lg = print(lg.summary()) X_train3 = X_train2.drop(['scoliosis_slope'],axis=1) X_test3 = X_test2.drop(['scoliosis_slope'],axis=1) logit = sm.Logit(y_train, X_train3.astype(float)) lg = print(lg.summary()) X_train4 = X_train3.drop(['cervical_tilt'],axis=1) X_test4 = X_test3.drop(['cervical_tilt'],axis=1) logit = sm.Logit(y_train, X_train4.astype(float)) lg = print(lg.summary()) X_train5 = X_train4.drop(['Direct_tilt'],axis=1) X_test5 = X_test4.drop(['Direct_tilt'],axis=1) logit = sm.Logit(y_train, X_train5.astype(float)) lg = print(lg.summary()) X_train6 = X_train5.drop(['lumbar_lordosis_angle'],axis=1) X_test6 = X_test5.drop(['lumbar_lordosis_angle'],axis=1) logit = sm.Logit(y_train, X_train6.astype(float)) lg = print(lg.summary()) X_train7 = X_train6.drop(['sacrum_angle'],axis=1) X_test7 = X_test6.drop(['sacrum_angle'],axis=1) logit = sm.Logit(y_train, X_train7.astype(float)) lg = print(lg.summary()) X_train8 = X_train7.drop(['thoracic_slope'],axis=1) X_test8 = X_test7.drop(['thoracic_slope'],axis=1) logit = sm.Logit(y_train, X_train8.astype(float)) lg = print(lg.summary())


Marks: 2/2

Select the correct option for the following:

Train a decision tree model with default parameters and vary the depth from 1 to 8 (both values included) and compare the model performance at each value of depth

At depth = 1, the decision tree gives the highest recall among all the models on the training set.

At depth = 2, the decision tree gives the highest recall among all the models on the training set.

At depth = 5, the decision tree gives the highest recall among all the models on the training set.

At depth = 8, the decision tree gives the highest recall among all the models on the training set.

Ans: 1

score_DT = [] for i in range(1,9): dTree = DecisionTreeClassifier(max_depth=i,criterion = 'gini', random_state=1), y_train) pred = dTree.predict(X_train) case = {'Depth':i,'Recall':recall_score(y_train,pred)} score_DT.append(case)


[{'Depth': 1, 'Recall': 0.6875}, {'Depth': 2, 'Recall': 0.8888888888888888}, {'Depth': 3, 'Recall': 0.8888888888888888}, {'Depth': 4, 'Recall': 0.9583333333333334}, {'Depth': 5, 'Recall': 0.9652777777777778}, {'Depth': 6, 'Recall': 0.9930555555555556}, {'Depth': 7, 'Recall': 0.9861111111111112}, {'Depth': 8, 'Recall': 1.0}]


Plot the feature importance of the variables given by the model which gives the maximum value of recall on the training set in Q7. Which are the 2 most important variables respectively?

  • lumbar_lordosis_angle, sacrum_angle
  • degree_spondylolisthesis, pelvic tilt
  • scoliosis_slope, cervial_tilt
  • scoliosis_slope, cervial_tilt

Ans: degree_spondylolisthesis, pelvic tilt


Perform hyperparmater tuning for Decision tree using GridSrearchCV.

Use the following list of hyperparameters and their values:

Maximum depth: [5,10,15, None], criterion: ['gini','entropy'], splitter: ['best','random'] Set cv = 3 in grid search Set scoring = 'recall' in grid search Which of the following statements is/are True?

A) GridSeachCV selects the max_depth as 10

B) GridSeachCV selects the criterion as 'gini'

C) GridSeachCV selects the splitter as 'random'

D) GridSeachCV selects the splitter as 'best'

E) GridSeachCV selects the max_depth as 5

F) GridSeachCV selects the criterion as 'entropy'

  • A, B, and C
  • B, C, and E
  • A, C, and F
  • D, E, and F

Ans: A, C, and F

# Choose the type of classifier. estimator = DecisionTreeClassifier(random_state=1)

# Grid of parameters to choose from
parameters = {'max_depth': [5,10,15,None], 
 'criterion' : ['gini','entropy'],
 'splitter' : ['best','random']

# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring='recall',cv=3)
grid_obj =, y_train)

# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_

# Fit the best algorithm to the data., y_train)

DecisionTreeClassifier(criterion='entropy', max_depth=10, random_state=1, splitter='random')


Compare the model performance of a Decision Tree with default parameters and the tuned Decision tree built in Q9 on the test set.

Which of the following statements is/are True?

  • A) Recall Score of tuned model > Recall Score of decision tree with default parameters
  • B) Recall Score of tuned model < Recall Score of decision tree with default parameters
  • C) F1 Score of tuned model > F1 Score Score of decision tree with default parameters
  • D) F1 Score of tuned model < F1 Score of decision tree with default parameters

A and B

B and C

C and D

A and D

Ans: A and D

# Training decision tree with default parameters model = DecisionTreeClassifier(random_state=1),y_train)

# Tuned model, y_train)

# Checking model performance of Decision Tree with default parameters print(recall_score(y_test,y_pred_test1)) print(metrics.f1_score(y_test,y_pred_test1))

# Checking model performance of tunedDecision Tree print(recall_score(y_test,y_pred_test2)) print(metrics.f1_score(y_test,y_pred_test2))

Monday, 24 May 2021

Some of the worst cable management hell and why is it important


Cables here, cables there, cables everywhere! 

Before I discuss solutions to help you get more organized, let’s look at some examples of horrible cable management. Be warned: some of these examples may just make you cry; 

Can you find the hidden equipment in this mess?

One of the leading Data Centre I visited had this bad cable management and we had to wait for another two weeks to decommission riverbed wan accelerator appliance! Guess what. To pull out the customer appliance they obviously had to plan for a production downtime.

If you dread walking into your server room to troubleshoot a network issue because of bad cable management or worse, dread having to give higher-ups a tour of your facilities, then it’s about time to straighten up your cable management system.

Some internet glimpses for some of the worst cable hell/ wiring ever seen.


Here are some things you can do now to avoid joining the terrible cable management hall of fame photos I just highlighted above.

Proper cable management will not only support existing infrastructure, but will also allow to accommodate future growth. 

Consider these tips for your next project:

  • Before purchasing or installing cable products, determine the amount of cabling and connections required. Be sure to allow room for access and growth.
  • Be sure to follow industry standards, such as ANSI/TIA and ISO/IEC, as well as any federal, state or local regulations. This will help ensure a safe, failure-free installation that will minimize system downtime.
  • Plan for change by organizing cable properly and labeling cable that may need to be quickly and easily identified. Also, try to avoid blocking access to equipment inside and outside the racks.
  • Be sure to use sweeping 90-degree bends when transitioning from the pathway support to the racks.
  • Density is very important in data center cabinets and racks, so keep in mind how many rack spaces are being utilized with horizontal wire managers.
  • Select a vertical cable manager that can accommodate all of the cable feeding from the horizontal managers. Use waterfalls and spools to help manage multiple cables and to help with maintaining proper bend radius on copper and fiber cables.
  • Using a 50% cable fill when selecting vertical and horizontal cable management. This allows sufficient space for maintaining cable bend radius for patch cords.


Making our installations more efficient is one of the most beneficial tasks a person should consider. Not only does it save time but can decrease issues down the line. This is the plus side of proper cable management. Cable management is the organization of electrical or optical cables in a cabinet or an installation. The term comes from the goal of planning. Cable installations vary from job to job but for the most part you can see how difficult it is to properly situate each cable to make it easy to work with. Problems can happen down the line with too many cables around each other with possible issues of unplugging or identifying which cable is the cause. This is why cable management is very crucial to a smooth work place and installation.


Proper cable management can increase safety measures in the work place. Fire is a cause for concern after cable installation and loose cable can become tangled with each other possibly creating a spark. This spark can then turn into a fire damaging your network, data center and building and ofcoure financial loss! There is also the chance of someone coming by where the cables are installed and tripping or catching on the cables resulting in an injury. You never know what might happen and it's best to keep a clean and organized setup

Air Flow

An important aspect to cables longevity is the abundance of air flow during installation. The more air flow the better is the goal when cable is connected/running. This increases energy efficiency as well. Keeping temperatures low and consistent is beneficial to cables structure and performance. Increased temperatures can damage the cables jacket and do harm to its inner workings. Keeping your cables tied together and out of the way will open up airways to get to the cables to prevent temperatures from possibly increasing surrounding the cables.


Correct cable management can make life easier when going back to troubleshoot the problem with your cable. Organizing your network with various colors can help you trouble shoot problems down the line and can help in managing future additions. Plus, you'll get major props from others for a well managed setup.

OneDrive Files Accidental Deletion and Recovery within 93 Days!


If you have accidentally deleted your OneDrive files, then no need to worry. You can recover it from your OneDrive's recycle bin. You might also receive a warning email from SharePoint Online ( Microsoft support team similar to the one listed below

Files are permanently removed from the online recycle bin 93 days after they're deleted

Hi Rinith KT,

We noticed that you recently deleted a large number of files from your OneDrive.

When files are deleted, they're stored in your recycle bin and can be restored within 93 days. After 93 days, deleted files are gone forever.

If you want to restore these files, go to the recycle bin. Select what you want to restore, and click the Restore button.

Ignore this mail if you meant to get rid of these files.

Learn more about deleting and restoring files.