NEWSLETTER OF ZENITH GAVELS CLUB | DAZZLE 2022 | VOL 1 ISSUE 1
Prepared, Edited and Published by Venus Rinith - Newsletter Editor and VPPR 2021-2022
QR Code to read this news letter
NEWSLETTER OF ZENITH GAVELS CLUB | DAZZLE 2022 | VOL 1 ISSUE 1
Prepared, Edited and Published by Venus Rinith - Newsletter Editor and VPPR 2021-2022
QR Code to read this news letter
We had migrated all our employees to exchange online and decommissioned our on-prem exchange servers which was hosting user databases. Retained on-prem exchange server 2013 CU22 just for SMTP application relay purpose. Also have created a new on-prem database with 2 user mailbox. However, with that user we were not able to login to the ECP on-prem.
No issues in login to the online Exchange Admin Centre.
What errors do you see?
:-( Something went wrong We can't get that information right now. Please try again later
Exchange server 2013 Cu22 on Windows 2012R2.
Our emails (...domain) have been migrated to exchange online.
We are using exchange on-prem for the application servers hosted on Azure to relay to onprem-exchange. Noticed that on-prem ECP wasn't accessible. Single database and two mailbox accounts on on-prem exchange.
What have you tried to troubleshoot this?
Verified on-prem database and 2 users accounts already available on on-prem exchange. but cannot login via https://localhost/ecp
You cannot access ECP, it is getting redirected to office 365 while accessing local host/ecp.
We checked the HTTP redirect on Default Frontend, there are no settings found.
We checked the HTTP redirect on ECP, we found no settings
We found HTTP redirect configured for OWA, redirected to office 365 portal
We unchecked the settings and were able to access ECP successfully.
Windows security baseline recommend configuring a threshold of 10 invalid sign-in attempts
Account lockout threshold (Windows 10) - Windows security | Microsoft Docs
There is a built-in tool called “Resultant Set of Policy” (RSoP) that simulates the policy settings applied to computers and users using Group Policy. It acts as a query engine that polls existing policies based on site, domain, domain controller, and organizational unit, and then reports the results of those queries.
To launch Resultant Set of Policy, press Win + R to fire up the Run dialog box, type rsop.msc, and press Enter.
The tool fires up and scans the
active policies and displays them within the tool. You will still need to go
through the folders to find out each active policy applied to the account and
computer.
GPResult
Alternatively, there is also a
command line called GPResult that you can also use to collect active
Group Policy settings. Simply open a Command Prompt and run the following
command.
gpresult /scope
user /v
This is to search and show all the
active policies applied to the current user. To find all policies applied to
the PC, run the following instead in an elevated Command Prompt window.
gpresult /scope
computer /v
Even more, you can use GPResult to gather Group Policy information applied to certain user account from a remote computer, such as below:
gpresult /c computername /u username /p password /user targetusername /scope user /r
Or, all Group Policies applied to a remote computer:
gpresult /c computername /u username /p password /scope computer /r
Note that the switch /r is to display RSoP
summary data while /v is to display verbose policy
information.
Q No: 1
What is the final objective of Decision Tree?
Ans: Minimise the impurity of the leaf nodes
In decision tree, after every split we hope to have lesser 'impurity' in the subsequent node. So that, eventually we end up with leaf nodes that have the least 'impurity'/entropy
Q No: 2
Decision Trees can be used to predict
Ans: Both Continuous and Categorical Target Variables
Q No: 3
When we create a Decision Tree, how is the best split determined at each node?
Ans: We make all possible splits on the data using the independent variables and choose the split that gives the highest Gini gain.
Q No: 4
Which of the following is not true about Decision Trees
Ans: Decision Trees tend to overfit the test data
Q No: 5
If we increase the value of the hyperparameter min_samples_leaf from the default value, we would end up getting a ______________ tree than the tree with the default value.
Ans: smaller
min_samples_leaf = the minimum number of samples required at a leaf node
As the number of observations required in the leaf node increases, the size of the tree would decrease
Q No: 6
Which of the following is a perfectly impure node?
Ans: Node - 1
Gini = 0.5 at Node 1
gini = 0 -> Perfectly Pure
gini = o.5 -> Perfectly Impure
Q No: 7
In a classification setting, if we do not limit the size of the decision tree it will only stop when all the leaves are:
Ans: homogenous
The tree will stop splitting after the impurity in every leaf is zero
Q No: 8
Which of the following explains pre-pruning?
Ans: We stop the decision tree from growing to its full length by bounding the hyper parameters, this is known as pre-pruning.
Q No: 9
Which of the following is the same across Classification and Regression Decision Trees?
Ans: max_depth parameter
Q No: 10
Select the correct order in which a decision tree is built:
Ans: 4,1,5,2,3
Exploratory Data Analysis on Lower Back Pain
Lower back pain, also called lumbago, is not a disorder. It’s a symptom of several different types of medical problems. It usually results from a problem with one or more parts of the lower back, such as:
It can also be due to a problem with nearby organs, such as the kidneys.
According to the American Association of Neurological Surgeons, 75 to 85 percent of Americans will experience back pain in their lifetime. Of those, 50 percent will have more than one episode within a year. In 90 percent of all cases, the pain gets better without surgery. Talk to your doctor if you’re experiencing back pain.
In this Exploratory Data Analysis (EDA) I am going to use the Lower Back Pain Symptoms Dataset and try to find out ineresting insights of this dataset.
#pip install xgboost
ModuleNotFoundError Traceback (most recent call last)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 310 entries, 0 to 309 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 pelvic_incidence 310 non-null float64 1 pelvic tilt 310 non-null float64 2 lumbar_lordosis_angle 310 non-null float64 3 sacral_slope 310 non-null float64 4 pelvic_radius 310 non-null float64 5 degree_spondylolisthesis 310 non-null float64 6 pelvic_slope 310 non-null float64 7 Direct_tilt 310 non-null float64 8 thoracic_slope 310 non-null float64 9 cervical_tilt 310 non-null float64 10 sacrum_angle 310 non-null float64 11 scoliosis_slope 310 non-null float64 12 Status 310 non-null object dtypes: float64(12), object(1) memory usage: 31.6+ KB
dataset["Status"].value_counts().sort_index().plot.bar()
dataset.corr()
plt.subplots(figsize=(12,8))
sns.heatmap(dataset.corr())
sns.pairplot(dataset, hue="Status")
Visualize Features with Histogram: A Histogram is the most commonly used graph to show frequency distributions.
dataset.hist(figsize=(15,12),bins = 20, color="#007959AA")
plt.title("Features Distribution")
plt.show()
Detecting and Removing Outliers
plt.subplots(figsize=(15,6)) dataset.boxplot(patch_artist=True, sym="k.") plt.xticks(rotation=90)(array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]), [Text(1, 0, 'pelvic_incidence'), Text(2, 0, 'pelvic tilt'), Text(3, 0, 'lumbar_lordosis_angle'), Text(4, 0, 'sacral_slope'), Text(5, 0, 'pelvic_radius'), Text(6, 0, 'degree_spondylolisthesis'), Text(7, 0, 'pelvic_slope'), Text(8, 0, 'Direct_tilt'), Text(9, 0, 'thoracic_slope'), Text(10, 0, 'cervical_tilt'), Text(11, 0, 'sacrum_angle'), Text(12, 0, 'scoliosis_slope')])Remove Outliers:
# we use tukey method to remove outliers.
# whiskers are set at 1.5 times Interquartile Range (IQR)def remove_outlier(feature):
first_q = np.percentile(X[feature], 25)
third_q = np.percentile(X[feature], 75)
IQR = third_q - first_q
IQR *= 1.5 minimum = first_q - IQR # the acceptable minimum value
maximum = third_q + IQR # the acceptable maximum value
mean = X[feature].mean() """
# any value beyond the acceptance range are considered
as outliers. # we replace the outliers with the mean value of that
feature.
""" X.loc[X[feature] < minimum, feature] = mean
X.loc[X[feature] > maximum, feature] = mean
# taking all the columns except the last one
# last column is the labelX = dataset.iloc[:, :-1]for i in range(len(X.columns)):
remove_outlier(X.columns[i])After removing Outliers:
features distribution after removing outliers
Feature Scaling:
Feature scaling though standardization (or Z-score normalization) can be an important preprocessing step for many machine learning algorithms. Our dataset contains features that vary highly in magnitudes, units and range. But since most of the machine learning algorithms use Euclidean distance between two data points in their computations, this will create a problem. To avoid this effect, we need to bring all features to the same level of magnitudes. This can be achieved
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(X)
scaled_df = pd.DataFrame(data = scaled_data, columns = X.columns)
scaled_df.head()
Load the dataset and identify the variables that have a correlation greater than or equal to 0.7 with the ‘pelvic_incidence’ variable?
Encode Status variable: Abnormal class to 1 and Normal to 0.
Split the data into a 70:30 ratio. What is the percentage of 0 and 1 classes in the test data (y_test)?
1: In a range of 0.1 to 0.2/ 0: In a range of 0.2 to 0.3
1: In a range of 0.5 to 0.6/ 0: In a range of 0.3 to 0.6
1: In a range of 0.6 to 0.7/ 0: In a range of 0.3 to 0.4
1: In a range of 0.7 to 0.8/ 0: In a range of 0.2 to 0.3
Ans:
1: In a range of 0.7 to 0.8
0: In a range of 0.2 to 0.3
dataset['Status'] = dataset['Status'].apply(lambda x: 1 if x=='Abnormal' else 0)
X = dataset.drop(['Status'], axis=1)
Y = dataset['Status']
#Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size=0.30, random_state = 1)
y_test.value_counts(normalize=True)
1 0.709677 0 0.290323 Name: Status, dtype: float64
Ans:
1: In a range of 0.7 to 0.8
0: In a range of 0.2 to 0.3
Question3:
Which metric is the most appropriate metric to evaluate the model according to the problem statement?
Accuracy, Recall, Precision, F1 score
Ans: Recall
Predicting a person doesn't have an abnormal spine and a person has an abnormal spine - A person who needs treatment will be missed. Hence, reducing such false negatives is important
Question4:
Check for multicollinearity in data and choose the variables which show high multicollinearity? (VIF value greater than 5)
How many minimum numbers of attributes will we need to drop to remove multicollinearity (or get a VIF value less than 5) from the data?
Drop sacral_slope attribute and proceed to build a logistic regression model. Drop all the insignificant variables and keep only significant variables (p-value < 0.05).
How many significant variables are left in the final model excluding the constant?
Marks: 2/2
Select the correct option for the following:
Train a decision tree model with default parameters and vary the depth from 1 to 8 (both values included) and compare the model performance at each value of depth
At depth = 1, the decision tree gives the highest recall among all the models on the training set.
At depth = 2, the decision tree gives the highest recall among all the models on the training set.
At depth = 5, the decision tree gives the highest recall among all the models on the training set.
At depth = 8, the decision tree gives the highest recall among all the models on the training set.
score_DT = [] for i in range(1,9): dTree = DecisionTreeClassifier(max_depth=i,criterion = 'gini', random_state=1) dTree.fit(X_train, y_train) pred = dTree.predict(X_train) case = {'Depth':i,'Recall':recall_score(y_train,pred)} score_DT.append(case)
print(score_DT)
[{'Depth': 1, 'Recall': 0.6875}, {'Depth': 2, 'Recall': 0.8888888888888888}, {'Depth': 3, 'Recall': 0.8888888888888888}, {'Depth': 4, 'Recall': 0.9583333333333334}, {'Depth': 5, 'Recall': 0.9652777777777778}, {'Depth': 6, 'Recall': 0.9930555555555556}, {'Depth': 7, 'Recall': 0.9861111111111112}, {'Depth': 8, 'Recall': 1.0}]
Plot the feature importance of the variables given by the model which gives the maximum value of recall on the training set in Q7. Which are the 2 most important variables respectively?
Perform hyperparmater tuning for Decision tree using GridSrearchCV.
Use the following list of hyperparameters and their values:
Maximum depth: [5,10,15, None], criterion: ['gini','entropy'], splitter: ['best','random'] Set cv = 3 in grid search Set scoring = 'recall' in grid search Which of the following statements is/are True?
A) GridSeachCV selects the max_depth as 10
B) GridSeachCV selects the criterion as 'gini'
C) GridSeachCV selects the splitter as 'random'
D) GridSeachCV selects the splitter as 'best'
E) GridSeachCV selects the max_depth as 5
F) GridSeachCV selects the criterion as 'entropy'
# Grid of parameters to choose from
parameters = {'max_depth': [5,10,15,None],
'criterion' : ['gini','entropy'],
'splitter' : ['best','random']
}
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring='recall',cv=3)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)
DecisionTreeClassifier(criterion='entropy', max_depth=10, random_state=1, splitter='random')Question10:
Compare the model performance of a Decision Tree with default parameters and the tuned Decision tree built in Q9 on the test set.
Which of the following statements is/are True?
A and B
B and C
C and D
A and D