0.05 0.10 0.15 0.20 0.25 0.30 0.35 0 5 10 15 20 Training Iteration Mean Reward ES multi-GPU GRPO (baseline)