What is GRPO?
April 18, 2025
PPO can utilize a good initial policy function, and it’s training is stable. So, it can be used to training a pretrained LLM.
However, PPO needs to learn a baseline function in order to compute the advantage estimate.
\[\max_{\theta} C_{\epsilon}\left( \frac{\pi_{\theta}(a_{t}|s_{t})}{\pi_{\theta_{old}}(a_{t}|s_{t})},\hat{A}_{t} \right)\]GRPO, used in Deepseek, simply replaces the baseline function with a normalization of rewards.
\[\max_{\theta} C_{\epsilon}\left( \frac{\pi_{\theta}(a_{t}|s_{t})}{\pi_{\theta_{old}}(a_{t}|s_{t})},\frac{r_{i}-\bar{r}}{\sigma(r)} \right)\]In intuition, it captures somewhat similar signal of advantage estimate, because it measures how good the action is compared to the average actions generated from $\pi_{\theta}$.
Reference
Shao, Z., et al. “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,” in arXiv preprint arXiv:2402.03300, 2024.