What is DDPG?
April 19, 2025
DDPG is an continuous actor-critic method that uses Q-function. It enables off-policy training, which allows high sample efficiency.
Instead of using the stochastic policy, consider a deterministic policy $a=\mu_{\theta}(s)$. Then, the deterministic policy gradient is computed by
\[\begin{align} J(\mu_{\theta}) & =\mathbb{E}_{s\sim \rho^{\mu}} [Q^{\mu}(s,\mu_{\theta}(s))]\\ \nabla_{\theta} J(\mu_{\theta}) & =\mathbb{E}_{s\sim \rho^{\mu}}[\nabla_{\theta}\mu_{\theta}(s)\nabla_{\mu_{\theta}(s)}Q^{\pi}(s,\mu_{\theta}(s))] \\ & =\mathbb{E}_{s\sim \rho^{\mu}}[\nabla_{\theta}\mu_{\theta}(s)\nabla_{a}Q^{\mu}(s,a)\big|_{a=\mu_{\theta}(s)}] \\ \end{align}\]Since the expectation doesn’t depend on the action generated from the policy, off-policy RL is possible in practice. This makes DDPG to have high sample efficiency.
In the overall algorithm, for exploration, random noise is mixed in action generation. We update the Q-function (critic function) similar as DQN.
\[\begin{align} L=\mathbb{E}[(Q(s_{t},a_{t})-r_{t}-\gamma Q_{\text{target}}(s_{t+1},\mu_{\text{target}}(s_{t+1})))^{2}] \end{align}\]Then, we update the actor policy with the deterministic policy gradient. The target networks are updated by Polyak averaging.
Reference
Lillicrap, T., et al. “Continuous control with deep reinforcement learning,” in arXiv preprint arXiv:1509.02971, 2015.