5
time-consuming especially if thousands of these narratives
will be processed. Machine learning models are heavily
used in many engineering disciplines and applications. In
this study, the authors explored the capability of machine
learning to handle large data and to conduct a supervised
machine learning to categorize ground-fall narratives.
FEATURE EXTRACTION
During feature extraction, the ground-fall narratives are
transformed into a numerical format that machine learn-
ing algorithms can process. This step is commonly known
as text vectorization. A common method to vectorize the
text (ground-fall narratives) is the Term Frequency-Inverse
Document Frequency (TF-IDF). Higher TF-IDF indi-
cates that a term is more discriminative for a particular
document. The TF measures the frequency of a term in a
document, representing how often the term appears in a
document relative to the total number of words in that doc-
ument. The IDF measures the uniqueness of a term across
a collection of documents by penalizing terms that occur
frequently in the entire document collection and assigning
a higher weight to terms that appear in a smaller number of
documents. Terms that appear in many documents have a
lower IDF score, while terms that appear in few documents
have a higher IDF score.
MODEL TRAINING DATA
The authors manually classified 8,013 ground-fall incidents
into five categories (Rashed et al. 2022). Figure 4 shows
the number of ground-fall incidents associated with each
ground-fall category, these classified incidents have been
used to train and test the machine learning models. The
8,013 ground-fall events were split into a training dataset
with 70% of the data (5,609 cases) and a testing dataset with
the remaining 30% data (2,404 cases). The TfidfVectorizer
in the sklearn library was used to vectorize the ground-fall
narratives in the training and testing dataset separately. The
range of n-value for different n-grams to be extracted using
the TfidfVectorizer is (1, 2), which means that single and
double words were extracted from the narratives. The maxi-
mum number of features used in this study was 2,000, the
maximum feature parameter controls the dimensions of the
TF-IDF matrix and directly affects the memory and com-
putational requirements of the algorithm.
Figure 4 shows that the dataset is imbalanced an
imbalanced dataset refers to a situation where the distribu-
tion of the target classes is not uniform, resulting in one or
more classes having significantly fewer samples than others.
The roof fall cases are the dominant category in the dataset.
Imbalanced data would potentially impact the performance
of the machine learning model, such that the model would
be biased toward the dominant category because it is more
common in the dataset consequently, the machine learn-
ing model would perform poorly on minority classes due
to their limited representation in the dataset. Since there
are very limited cases for the category of “face,” “highwall,”
“outburst,” and ‘rib fall” in the dataset, oversampling these
minority classes was used to balance the class distribution.
The RandomOverSampler within the Imbalanced-Learn
library was used for oversampling the minority classes in
this study (Lemaˆıtre et al., 2017). The next step is to train
a classifier using the labeled training data and the extracted
Figure 4. Ground-fall classifications associated in the 2010-2019 dataset
Previous Page Next Page