5
through a series of rigorous steps. These steps included sta-
tistical comparisons before and after data imputation and
cross-validation techniques to confirm that the preprocess-
ing did not alter the data distribution. Specifically, key
statistical measures such as mean, median, variance, and
standard deviation were compared across the datasets to
ensure consistency.
The preprocessing of the flotation dataset then under-
went multiple iterations, each improving data quality
and addressing identified challenges. The initial iteration
focused on data integration and alignment, merging PI
(minutely intervals) and LI (two to four-hour intervals)
datasets through linear interpolation and backward-filling,
resulting in a cohesive time series. Missing values were
imputed using K-Nearest Neighbors (KNN), and non-uni-
form data was interpolated to minutely intervals to support
effective time-series analysis. Cross-correlation techniques
and recursive feature selection were tested for feature engi-
neering but ultimately excluded after evaluation showed
superior results without them.
In the second iteration, we refined consistency by
retaining only reliable data within the optimal production
range identified from 2023 data, clipped using min-max
values, with outliers removed via the IQR method. In the
final iteration, we replaced faulty data points with NaNs
and filled them in using the median, chosen for its stability
in the presence of outliers. We also conducted a statisti-
cal comparison to confirm that the modified data remained
consistent with the original distribution. This progression
resulted in a dataset with enhanced data quality and robust-
ness, balancing volume with reliability to support accurate
predictive modeling.
Model Development
A Random Forest Regressor was chosen after testing
multiple models, including Extreme Gradient Boosting
Regressor (XGBoostRegressor), as it demonstrated superior
performance in handling non-linear relationships in high-
dimensional datasets. Grid search cross-validation was used
to optimize hyperparameters, improving the model’s pre-
dictive accuracy and reducing errors.
Model Evaluation and Testing
The model’s performance was evaluated using Mean
Absolute Error (MAE), Mean Squared Error (MSE), and
Root Mean Squared Error (RMSE). As shown in the
residual plot in Figure 4, the residuals were minimal and
centered around zero, confirming the model’s accuracy.
Additionally, feature importance analysis for the ten most
important features demonstrated that variables such as feed
in P2O5, CaO, SiO2, and PSD had the highest impact
on model predictions, see Figure 5. This analysis under-
scores the inherent impact of these variables on predictive
accuracy, validating their importance within the model’s
structure.
RESULTS
The Flotation Prediction Model demonstrated notable
improvements in flotation recovery rates and chemical
optimization through real-time AI-based recommenda-
tions, achieving an average Mean Absolute Error (MAE)
of 0.0018 across validation and testing datasets (Table 1).
An evaluation with subject matter experts (SMEs) set a
key performance indicator: the percentage of data points
where the difference between predicted and actual P2O5
values was less than 0.5. The model met this benchmark
with +90% accuracy, as shown in the Prediction Accuracy
Plot (Figure 6), confirming its practical reliability. Feature
importance analysis identified feed-in P2O5, CaO, SiO2,
and PSD as the most influential predictors, highlighting
the importance of monitoring feed characteristics for con-
sistent recovery rates. Residual analysis, with residuals cen-
tered around zero, further validated the model’s precision in
Figure 4. Residual plot
Figure 5. Feature importance bar chart
through a series of rigorous steps. These steps included sta-
tistical comparisons before and after data imputation and
cross-validation techniques to confirm that the preprocess-
ing did not alter the data distribution. Specifically, key
statistical measures such as mean, median, variance, and
standard deviation were compared across the datasets to
ensure consistency.
The preprocessing of the flotation dataset then under-
went multiple iterations, each improving data quality
and addressing identified challenges. The initial iteration
focused on data integration and alignment, merging PI
(minutely intervals) and LI (two to four-hour intervals)
datasets through linear interpolation and backward-filling,
resulting in a cohesive time series. Missing values were
imputed using K-Nearest Neighbors (KNN), and non-uni-
form data was interpolated to minutely intervals to support
effective time-series analysis. Cross-correlation techniques
and recursive feature selection were tested for feature engi-
neering but ultimately excluded after evaluation showed
superior results without them.
In the second iteration, we refined consistency by
retaining only reliable data within the optimal production
range identified from 2023 data, clipped using min-max
values, with outliers removed via the IQR method. In the
final iteration, we replaced faulty data points with NaNs
and filled them in using the median, chosen for its stability
in the presence of outliers. We also conducted a statisti-
cal comparison to confirm that the modified data remained
consistent with the original distribution. This progression
resulted in a dataset with enhanced data quality and robust-
ness, balancing volume with reliability to support accurate
predictive modeling.
Model Development
A Random Forest Regressor was chosen after testing
multiple models, including Extreme Gradient Boosting
Regressor (XGBoostRegressor), as it demonstrated superior
performance in handling non-linear relationships in high-
dimensional datasets. Grid search cross-validation was used
to optimize hyperparameters, improving the model’s pre-
dictive accuracy and reducing errors.
Model Evaluation and Testing
The model’s performance was evaluated using Mean
Absolute Error (MAE), Mean Squared Error (MSE), and
Root Mean Squared Error (RMSE). As shown in the
residual plot in Figure 4, the residuals were minimal and
centered around zero, confirming the model’s accuracy.
Additionally, feature importance analysis for the ten most
important features demonstrated that variables such as feed
in P2O5, CaO, SiO2, and PSD had the highest impact
on model predictions, see Figure 5. This analysis under-
scores the inherent impact of these variables on predictive
accuracy, validating their importance within the model’s
structure.
RESULTS
The Flotation Prediction Model demonstrated notable
improvements in flotation recovery rates and chemical
optimization through real-time AI-based recommenda-
tions, achieving an average Mean Absolute Error (MAE)
of 0.0018 across validation and testing datasets (Table 1).
An evaluation with subject matter experts (SMEs) set a
key performance indicator: the percentage of data points
where the difference between predicted and actual P2O5
values was less than 0.5. The model met this benchmark
with +90% accuracy, as shown in the Prediction Accuracy
Plot (Figure 6), confirming its practical reliability. Feature
importance analysis identified feed-in P2O5, CaO, SiO2,
and PSD as the most influential predictors, highlighting
the importance of monitoring feed characteristics for con-
sistent recovery rates. Residual analysis, with residuals cen-
tered around zero, further validated the model’s precision in
Figure 4. Residual plot
Figure 5. Feature importance bar chart