Joonkyu Min

I am an undergraduate student at Seoul National University, majoring in Electrical and Computer Engineering. I am currently an intern at CLVR Lab, where I work on robot safety under supervision of Joseph J. Lim.

Email  /  GitHub  /  LinkedIn  / 


What is TRPO?

April 16, 2025

In policy gradient methods, we use the collected trajectory only once to estimate the $Q$ value and update the parameter.

\[\begin{align} \nabla_{\theta}U(\theta) & =\mathbb{E}\left[\sum_{t} \nabla_{\theta}\log \pi_{\theta}(a_{t}|s_{t})\gamma^t(\hat{Q}-V_{\phi}(s_{t})) \right] \end{align}\]

How can we utilize the trajectory more efficiently, and get additional information? If we reuse the trajectory, it would be produced from a different distribution because the policy changed. Therefore, we need to apply importance sampling in our objective.

First, consider the difference of the vanilla objective between $\theta$ and $\theta_{old}$.

\[\begin{align} J(\theta)-J(\theta_{old}) & =\mathbb{E}_{\tau\sim \theta}\left[ \sum_{t}\gamma^tr_{t}-V^{\pi_{old}}(s_{0}) \right] \\ & =\mathbb{E}_{\tau\sim \theta}\left[ \sum_{t}(\gamma^tr_{t}+\gamma^{t+1} V^{\pi_{old}}(s_{t+1})-\gamma^tV^{\pi_{old}}(s_{t})) \right] \\ & =\mathbb{E}_{\tau\sim \theta}\left[ \sum_{t}\gamma^t(Q^{\pi_{old}}(s_{t},a_{t})-V^{\pi_{old}}(s_{t})) \right] \\ & =\mathbb{E}_{\tau\sim \theta}\left[ \sum_{t}\gamma^tA^{\pi_{old}}(s_{t},a_{t}) \right] \\ \end{align}\]

By importance sampling,

\[\begin{align} J(\theta)-J(\theta_{old}) =\mathbb{E}_{\tau\sim \theta, a_{t}'\sim \pi_{\theta_{old}}}\left[ \sum_{t}\gamma^t{\frac{\pi_{\theta}(a_{t}'|s_{t})}{\pi_{\theta_{old}}(a_{t}'|s_{t})}}A^{\pi_{old}}(s_{t},a_{t}') \right] \\ \end{align}\]

In TRPO, we approximate the trajectory from $\theta_{old}$. This is the surrogate objective used in TRPO.

\[K(\theta;\theta_{old})=\mathbb{E}_{\tau\sim \theta_{old}}\left[ \sum_{t}\gamma^t{\frac{\pi_{\theta}(a_{t}|s_{t})}{\pi_{\theta_{old}}(a_{t}|s_{t})}}A^{\pi_{old}}(s_{t},a_{t}) \right] \\\]

This approximation is accurate up to first order. Therefore, we should ensure that $\pi_{\theta} \approx \pi_{\theta_{old}}$, handled by a constraint via KL divergence of the policy.

\[D_{KL}(\pi_{\theta}||\pi_{\theta_{old}})\le \delta\]

This forms a constrained optimization problem bia linear quadratic programming.


Reference

Schulman, J., et al, “Trust region policy optimization,” in International conference on machine learning, 2015, pp. 1889–1897.


Design and source code from Jon Barron's website