Joonkyu Min

I am an undergraduate student at Seoul National University, majoring in Electrical and Computer Engineering. I am currently an intern at CLVR Lab, where I work on robot safety under supervision of Joseph J. Lim.

Email  /  GitHub  /  LinkedIn  / 


What is GRPO?

April 18, 2025

PPO can utilize a good initial policy function, and it’s training is stable. So, it can be used to training a pretrained LLM.

However, PPO needs to learn a baseline function in order to compute the advantage estimate.

\[\max_{\theta} C_{\epsilon}\left( \frac{\pi_{\theta}(a_{t}|s_{t})}{\pi_{\theta_{old}}(a_{t}|s_{t})},\hat{A}_{t} \right)\]

GRPO, used in Deepseek, simply replaces the baseline function with a normalization of rewards.

\[\max_{\theta} C_{\epsilon}\left( \frac{\pi_{\theta}(a_{t}|s_{t})}{\pi_{\theta_{old}}(a_{t}|s_{t})},\frac{r_{i}-\bar{r}}{\sigma(r)} \right)\]

In intuition, it captures somewhat similar signal of advantage estimate, because it measures how good the action is compared to the average actions generated from $\pi_{\theta}$.


Reference

Shao, Z., et al. “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,” in arXiv preprint arXiv:2402.03300, 2024.


Design and source code from Jon Barron's website