What is PPO?

April 17, 2025

TRPO improves training stability and robustness compared to previous methods, but it is hard to implement.

PPO gives a alternative approach to implement the ideas of TRPO.

The first version of PPO is to convert the constraint into a penalty term inside the objective. Instead of doing a constrained optimization,

\[\max_{\theta}\mathbb{E}_{\tau\sim \theta_{old}}\left[ \sum_{t}\gamma^t{\frac{\pi_{\theta}(a_{t}|s_{t})}{\pi_{\theta_{old}}(a_{t}|s_{t})}}\hat{A}_{t} -\beta(D_{KL}(\pi_{\theta}(\cdot|s_{t})||\pi_{\theta_{old}}(\cdot|s_{t})) - \delta)\right]\]

This makes the implementation much simpler.

Another version is the clipped surrogate objective.

\[C_{\epsilon}\left( \frac{\pi_{\theta}(a_{t}|s_{t})}{\pi_{\theta_{old}}(a_{t}|s_{t})},\hat{A}_{t} \right)=\min\left( \frac{\pi_{\theta}(a_{t}|s_{t})}{\pi_{\theta_{old}}(a_{t}|s_{t})}\hat{A}_{t}, \text{clip}^{1+\epsilon}_{1-\epsilon}\left( \frac{\pi_{\theta}(a_{t}|s_{t})}{\pi_{\theta_{old}}(a_{t}|s_{t})} \right)\hat{A}_{t} \right)\]

This objective means to maximize the $\frac{\pi_{\theta}(a_{t}\mid s_{t})}{\pi_{\theta_{old}}(a_{t}\mid s_{t})}\hat{A}_{t}$, but only inside a small region.

If $\hat{A}>0$, we update up to $1+\epsilon$, and if $\hat{A}<0$, we update up to $1-\epsilon$.

PPO can be useful in utilizing a good initial policy function, which fits to training a pretrained LLM. PPO variants such as GRPO(from Deepseek) is used to train LLM these days.

Reference

Schulman, J., et al. “Proximal policy optimization algorithms,” in arXiv preprint arXiv:1707.06347, 2017.

Joonkyu Min

What is PPO?