4
TEXT CLASSIFICATION USING MACHINE
LEARNING
Text classification is a task in natural language processing
(NLP) that involves automatically assigning predefined
categories or labels to text documents. In this study, the
text documents are the ground-fall narratives while the
predefined categories are the five ground-fall categories
(roof fall, rib fall, face fall, highwall failure, and outburst).
The supervised machine learning model to conduct text
classification typically involves five main steps shown in
Figure 3. Several Python packages were used to implement
the machine learning model for text classification, such as
Pandas, Numpy, Sklearn (Pedregosa et al. 2011).
PREPROCESSING THE MSHA DATA
To improve the performance of the machine learning
models and to make the models more interpretable and
robust, it is important to preprocess the raw data before
feeding it to the machine learning models. Preprocessing
the data includes performing multiple tasks, such as data
cleaning, data reduction, and data transformation. Before
implementing the machine learning models on MSHA
data, a preprocessing step was conducted for instance, the
MSHA dataset was filtered to obtain only ground-fall inci-
dents, see Figure 1, such that the other reported incidents
were excluded from the analysis the variable names in the
MSHA dataset were shortened to facilitate working with
the data using the Pandas library Handling the null values
was conducted, for example a few narratives were empty,
they were excluded from the analysis. The data types of all
variables were checked to avoid unexpected errors when
conducting mathematical or statistical operations. The
data types for some variables were transformed to facilitate
working with the machine learning models. For example,
the data type of the manually classified narratives was
transformed from categorical data type to numerical data
type, because machine learning models use mathematical
equations and categorical data is not accepted. As shown in
Figure 2 the length of the ground-fall narratives varies with
a mean of 40 words and a maximum of 92 words. Hence,
manually processing the MSHA narratives to extract
some information or to conduct a classification would be
Raw data preprocessing
the data
Feature
extraction
Model
training data
Model
evaluation
Model
deployment
Figure 3. Steps associated with supervised learning for text classification
Figure 2. Ground-fall narrative length (number of words) in MSHA dataset between 2010 and 2019
Previous Page Next Page