856 XXXI International Mineral Processing Congress 2024 Proceedings/Washington, DC/Sep 29–Oct 3
methods. On-policy reinforcement learning algorithms,
such as Proximal Policy Optimization (PPO), directly learn
from and improve the policy that is used to make deci-
sions, meaning they evaluate and improve the same policy
that determines the action. In contrast, off-policy methods
like Deep Deterministic Policy Gradient (DDPG) learn a
policy different from the one used to generate the data. This
allows off-policy methods to learn from past experiences
stored in a replay buffer, potentially making them more
data-efficient as they can reuse this information for mul-
tiple updates. On the edge device, chosen for its suitability
for industrial application, the training of each model took
approximately 1 to 1.25 hours. All models performed com-
mendably, achieving mean rewards as high as 1,043, which
is significant when compared to the maximum achievable
reward of 1,400. Whereas this maximum reward is a theo-
retically calculated value, representing the ideal scenario
where the particle size consistently meets the target size, the
mill operates at the maximum allowable feed rate, and the
amount of reject material remains within acceptable limits.
This value serves as a benchmark for evaluating the perfor-
mance of the reinforcement learning algorithms against the
best possible outcome in controlled conditions. Figure 4
exemplifies these results, showcasing the SAC algorithm’s
training outcomes as a representative sample of all models.
The policy evaluation results are summarized in
Table 1, highlighting the training time, mean reward, stan-
dard deviation, and overall score (0.8 × Mean Reward 0.2
× Standard Deviation Reward) for each algorithm:
During training, every 50 timesteps (equivalent to 50
minutes), the operational values, akin to different recipes,
were altered to test the flexibility of the control algorithms.
The adaptability and efficiency of these algorithms, particu-
larly in responding to changes in operational targets, are
exemplified in Figure 5, which contrasts the target product
size with the actual product size achieved.
Figure 4. Reward curve for Soft Actor-Critic over 200,000 epochs, illustrating an initial surge and logarithmic growth,
culminating in lower final rewards
Table 1. Policy evaluation results
Algorithm Training Time Mean Reward ± Std Reward Score
PPO 1hr 16 min 816,739 ± 35,205 646,350
A2C 1hr 15 min 780,319 ± 57,866 612,682
DDPG 1hr 1,019,611 ± 20,606 811,568
SAC 1hr 16 min 1,027,977 ± 28,815 816,618
TD3 51 min 966,746 ± 47,008 763,995
Previous Page Next Page