6
features. During the training process, the method of grid
search with cross-validation was used to tune the model
hyper-parameters. There are various hyper-parameters for
each machine learning algorithm, and they control the
training process and affect the performance. The grid search
cross-validation method allows the search of the best hyper-
parameters by exploring different combinations of the pro-
vided hyper-parameter values. The cross-validation method
involves randomly splitting the training dataset into K
groups (5, by default). For each iteration, one group of data
is used as the validation dataset, while the remaining groups
are taken as the training dataset. The classifier with one
group of hyper-parameters is fitted with the training data
and is evaluated with the testing data to obtain an evalua-
tion score. After K iterations, the average evaluation score
is used to quantify the performance of the hyper-parameter
combination. After comparing the average evaluation score
for all the hyper-parameter combinations, the best hyper-
parameters can be selected and the classifier with the best
performance can be used as the trained classifier.
PERFORMANCE EVALUATION
The performance of the trained model is assessed with the
testing data using evaluation metrics. Table 3 summarizes
multiple methods that can be used to evaluate the perfor-
mance of machine learning models, such as the confusion
matrix, precision, recall, and F1-score. A confusion matrix
for binary classification provides a summary of the pre-
dicted and the true labels of the narratives, showing the
counts of true positives, true negatives, false positives,
and false negatives. The F1-score combines precision and
recall into a single value, providing a balanced measure of
a model’s performance. The F1-score is particularly useful
when dealing with imbalanced datasets.
RESULTS AND DISCUSSION
The training dataset was used to train various machine
learning models and the testing dataset was used to evaluate
the performance of these models. Figure 5 shows the results
of the F1-score for some of the machine learning models
used in this study to classify ground-fall narratives into five
categories. The model performance and confusion matrix
for only three models-Multinomial Naïve Bayes, Random
Forest, and Logistic Regression-are compared in Table 4.
The LabelEncorder within the scikit-learn library was used
to encode the five ground-fall categories in a numerical for-
mat (Pedregosa et al. 2011). After encoding, ‘roof fall’ is
represented with ‘4’, ‘3’ for ‘rib fall’, ‘2’ for ‘outburst’, ‘1’
for ‘highwall’, and ‘0’ for ‘face’. The leading diagonal of
the confusion matrix includes the correct prediction for the
ground-fall narratives. Generally, the model performance
is better as the count on the leading diagonal increases and
the off-diagonal decreases.
Table 4 shows that the logistic regression model has
the highest F1-scores for each category and overall. In
general the higher the F1-score the better the model. As
shown in Table 4, the logistic regression model successfully
classified 98% of the roof-fall incidents, 85% of the rib-
fall incidents, 73% of the rock-outburst incidents, 78%
of the highwall-failure incidents, and 72% of the face-fall
incidents. According to Figure 4, about 88% of the inci-
dents were manually classified as “roof fall.” Hence, if no
machine learning model was used and the null hypothesis
was assumed such that all ground-fall incidents were classi-
fied as “roof fall,” the success rate would be 100% for roof
Table 3. Performance evaluation for a machine learning model
Confusion matrix
Precision TP/(TP+FP)
Recall TP/(TP+FN)
F1_Score 2*(Precision*Recall)/(Precision +Recall)
Previous Page Next Page