What is TRPO?

April 16, 2025

In policy gradient methods, we use the collected trajectory only once to estimate the $Q$ value and update the parameter.

\[\begin{align} \nabla_{\theta}U(\theta) & =\mathbb{E}\left[\sum_{t} \nabla_{\theta}\log \pi_{\theta}(a_{t}|s_{t})\gamma^t(\hat{Q}-V_{\phi}(s_{t})) \right] \end{align}\]

How can we utilize the trajectory more efficiently, and get additional information? If we reuse the trajectory, it would be produced from a different distribution because the policy changed. Therefore, we need to apply importance sampling in our objective.

First, consider the difference of the vanilla objective between $\theta$ and $\theta_{old}$.

\[\begin{align} J(\theta)-J(\theta_{old}) & =\mathbb{E}_{\tau\sim \theta}\left[ \sum_{t}\gamma^tr_{t}-V^{\pi_{old}}(s_{0}) \right] \\ & =\mathbb{E}_{\tau\sim \theta}\left[ \sum_{t}(\gamma^tr_{t}+\gamma^{t+1} V^{\pi_{old}}(s_{t+1})-\gamma^tV^{\pi_{old}}(s_{t})) \right] \\ & =\mathbb{E}_{\tau\sim \theta}\left[ \sum_{t}\gamma^t(Q^{\pi_{old}}(s_{t},a_{t})-V^{\pi_{old}}(s_{t})) \right] \\ & =\mathbb{E}_{\tau\sim \theta}\left[ \sum_{t}\gamma^tA^{\pi_{old}}(s_{t},a_{t}) \right] \\ \end{align}\]

By importance sampling,

\[\begin{align} J(\theta)-J(\theta_{old}) =\mathbb{E}_{\tau\sim \theta, a_{t}'\sim \pi_{\theta_{old}}}\left[ \sum_{t}\gamma^t{\frac{\pi_{\theta}(a_{t}'|s_{t})}{\pi_{\theta_{old}}(a_{t}'|s_{t})}}A^{\pi_{old}}(s_{t},a_{t}') \right] \\ \end{align}\]

In TRPO, we approximate the trajectory from $\theta_{old}$. This is the surrogate objective used in TRPO.

\[K(\theta;\theta_{old})=\mathbb{E}_{\tau\sim \theta_{old}}\left[ \sum_{t}\gamma^t{\frac{\pi_{\theta}(a_{t}|s_{t})}{\pi_{\theta_{old}}(a_{t}|s_{t})}}A^{\pi_{old}}(s_{t},a_{t}) \right] \\\]

This approximation is accurate up to first order. Therefore, we should ensure that $\pi_{\theta} \approx \pi_{\theta_{old}}$, handled by a constraint via KL divergence of the policy.

\[D_{KL}(\pi_{\theta}||\pi_{\theta_{old}})\le \delta\]

This forms a constrained optimization problem bia linear quadratic programming.

Reference

Schulman, J., et al, “Trust region policy optimization,” in International conference on machine learning, 2015, pp. 1889–1897.

Joonkyu Min

What is TRPO?