Exploratory Data Analysis on Lower Back Pain
Lower Back Pain
Lower back pain, also called lumbago, is not a disorder. It’s a symptom of several different types of medical problems. It usually results from a problem with one or more parts of the lower back, such as:
- ligaments
- muscles
- nerves
- the bony structures that make up the spine, called vertebral bodies or vertebrae
It can also be due to a problem with nearby organs, such as the kidneys.
According to the American Association of Neurological Surgeons, 75 to 85 percent of Americans will experience back pain in their lifetime. Of those, 50 percent will have more than one episode within a year. In 90 percent of all cases, the pain gets better without surgery. Talk to your doctor if you’re experiencing back pain.
In this Exploratory Data Analysis (EDA) I am going to use the Lower Back Pain Symptoms Dataset and try to find out ineresting insights of this dataset.
#pip install xgboost
if xgboost is throws errors
ModuleNotFoundError Traceback (most recent call last)
import os
os.getcwd()
os.chdir('C:\\Users\\kt.rinith\\Google Drive\\Training\\PGP-DSBA\\Jupiter Files')
# change working directory
<class 'pandas.core.frame.DataFrame'> RangeIndex: 310 entries, 0 to 309 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 pelvic_incidence 310 non-null float64 1 pelvic tilt 310 non-null float64 2 lumbar_lordosis_angle 310 non-null float64 3 sacral_slope 310 non-null float64 4 pelvic_radius 310 non-null float64 5 degree_spondylolisthesis 310 non-null float64 6 pelvic_slope 310 non-null float64 7 Direct_tilt 310 non-null float64 8 thoracic_slope 310 non-null float64 9 cervical_tilt 310 non-null float64 10 sacrum_angle 310 non-null float64 11 scoliosis_slope 310 non-null float64 12 Status 310 non-null object dtypes: float64(12), object(1) memory usage: 31.6+ KB
dataset["Status"].value_counts().sort_index().plot.bar()
dataset.corr()
plt.subplots(figsize=(12,8))
sns.heatmap(dataset.corr())
sns.pairplot(dataset, hue="Status")
Visualize Features with Histogram: A Histogram is the most commonly used graph to show frequency distributions.
dataset.hist(figsize=(15,12),bins = 20, color="#007959AA")
plt.title("Features Distribution")
plt.show()
Detecting and Removing Outliers
plt.subplots(figsize=(15,6)) dataset.boxplot(patch_artist=True, sym="k.") plt.xticks(rotation=90)(array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]), [Text(1, 0, 'pelvic_incidence'), Text(2, 0, 'pelvic tilt'), Text(3, 0, 'lumbar_lordosis_angle'), Text(4, 0, 'sacral_slope'), Text(5, 0, 'pelvic_radius'), Text(6, 0, 'degree_spondylolisthesis'), Text(7, 0, 'pelvic_slope'), Text(8, 0, 'Direct_tilt'), Text(9, 0, 'thoracic_slope'), Text(10, 0, 'cervical_tilt'), Text(11, 0, 'sacrum_angle'), Text(12, 0, 'scoliosis_slope')])Remove Outliers:
# we use tukey method to remove outliers.
# whiskers are set at 1.5 times Interquartile Range (IQR)def remove_outlier(feature):
first_q = np.percentile(X[feature], 25)
third_q = np.percentile(X[feature], 75)
IQR = third_q - first_q
IQR *= 1.5 minimum = first_q - IQR # the acceptable minimum value
maximum = third_q + IQR # the acceptable maximum value
mean = X[feature].mean() """
# any value beyond the acceptance range are considered
as outliers. # we replace the outliers with the mean value of that
feature.
""" X.loc[X[feature] < minimum, feature] = mean
X.loc[X[feature] > maximum, feature] = mean
# taking all the columns except the last one
# last column is the labelX = dataset.iloc[:, :-1]for i in range(len(X.columns)):
remove_outlier(X.columns[i])After removing Outliers:
features distribution after removing outliers
Feature Scaling:
Feature scaling though standardization (or Z-score normalization) can be an important preprocessing step for many machine learning algorithms. Our dataset contains features that vary highly in magnitudes, units and range. But since most of the machine learning algorithms use Euclidean distance between two data points in their computations, this will create a problem. To avoid this effect, we need to bring all features to the same level of magnitudes. This can be achieved
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(X)
scaled_df = pd.DataFrame(data = scaled_data, columns = X.columns)
scaled_df.head()