Lower back pain, also called lumbago, is not a disorder. It’s a symptom of several different types of medical problems. It usually results from a problem with one or more parts of the lower back, such as:
ligaments
muscles
nerves
the bony structures that make up the spine, called vertebral bodies or vertebrae
It can also be due to a problem with nearby organs, such as the kidneys.
According to the American Association of Neurological Surgeons, 75 to 85 percent of Americans will experience back pain in their lifetime. Of those, 50 percent will have more than one episode within a year. In 90 percent of all cases, the pain gets better without surgery. Talk to your doctor if you’re experiencing back pain.
In this Exploratory Data Analysis (EDA) I am going to use the Lower Back Pain Symptoms Dataset and try to find out ineresting insights of this dataset.
# we use tukey method to remove outliers. # whiskers are set at 1.5 times Interquartile Range (IQR)def remove_outlier(feature): first_q = np.percentile(X[feature], 25) third_q = np.percentile(X[feature], 75) IQR = third_q - first_q IQR *= 1.5 minimum = first_q - IQR # the acceptable minimum value maximum = third_q + IQR # the acceptable maximum value
mean = X[feature].mean() """ # any value beyond the acceptance range are considered as outliers. # we replace the outliers with the mean value of that feature. """ X.loc[X[feature] < minimum, feature] = mean X.loc[X[feature] > maximum, feature] = mean # taking all the columns except the last one # last column is the labelX = dataset.iloc[:, :-1]for i in range(len(X.columns)): remove_outlier(X.columns[i])
After removing Outliers:
features distribution after removing outliers
Feature Scaling:
Feature scaling though standardization (or Z-score normalization) can be an important preprocessing step for many machine learning algorithms. Our dataset contains features that vary highly in magnitudes, units and range. But since most of the machine learning algorithms use Euclidean distance between two data points in their computations, this will create a problem. To avoid this effect, we need to bring all features to the same level of magnitudes. This can be achieved
Certain algorithms like XGBoost can only have numerical values as their predictor variables. Hence we need to encode our categorical values. LabelEncoder from sklearn.preprocessing package encodes labels with values between 0 and n_classes-1.
label = dataset["class"]
encoder = LabelEncoder()
label = encoder.fit_transform(label)
Model Training and Evaluation:
X = scaled_df
y = label
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=0)
clf_gnb = GaussianNB()
pred_gnb = clf_gnb.fit(X_train, y_train).predict(X_test)
accuracy_score(pred_gnb, y_test)
# Out []: 0.8085106382978723
clf_svc = SVC(kernel="linear")
pred_svc = clf_svc.fit(X_train, y_train).predict(X_test)
accuracy_score(pred_svc, y_test)
# Out []: 0.7872340425531915
clf_xgb = XGBClassifier()
pred_xgb = clf_xgb.fit(X_train, y_train).predict(X_test)
accuracy_score(pred_xgb, y_test)
# Out []: 0.8297872340425532
0 comments:
Post a Comment