What is GRPO?

April 18, 2025

PPO can utilize a good initial policy function, and it’s training is stable. So, it can be used to training a pretrained LLM.

However, PPO needs to learn a baseline function in order to compute the advantage estimate.

\[\max_{\theta} C_{\epsilon}\left( \frac{\pi_{\theta}(a_{t}|s_{t})}{\pi_{\theta_{old}}(a_{t}|s_{t})},\hat{A}_{t} \right)\]

GRPO, used in Deepseek, simply replaces the baseline function with a normalization of rewards.

\[\max_{\theta} C_{\epsilon}\left( \frac{\pi_{\theta}(a_{t}|s_{t})}{\pi_{\theta_{old}}(a_{t}|s_{t})},\frac{r_{i}-\bar{r}}{\sigma(r)} \right)\]

In intuition, it captures somewhat similar signal of advantage estimate, because it measures how good the action is compared to the average actions generated from $\pi_{\theta}$.

Reference

Shao, Z., et al. “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,” in arXiv preprint arXiv:2402.03300, 2024.

Joonkyu Min

What is GRPO?