0.05
0.10
0.15
0.20
0.25
0.30
0.35
0
5
10
15
20
Training Iteration
Mean Reward
ES multi-GPU
GRPO (baseline)