Joonkyu Min

I am an undergraduate student at Seoul National University, majoring in Electrical and Computer Engineering. I am currently an intern at CLVR Lab, where I work on robot safety under supervision of Joseph J. Lim.

Email  /  GitHub  /  LinkedIn  / 


What is DDPG?

April 19, 2025

DDPG is an continuous actor-critic method that uses Q-function. It enables off-policy training, which allows high sample efficiency.

Instead of using the stochastic policy, consider a deterministic policy $a=\mu_{\theta}(s)$. Then, the deterministic policy gradient is computed by

\[\begin{align} J(\mu_{\theta}) & =\mathbb{E}_{s\sim \rho^{\mu}} [Q^{\mu}(s,\mu_{\theta}(s))]\\ \nabla_{\theta} J(\mu_{\theta}) & =\mathbb{E}_{s\sim \rho^{\mu}}[\nabla_{\theta}\mu_{\theta}(s)\nabla_{\mu_{\theta}(s)}Q^{\pi}(s,\mu_{\theta}(s))] \\ & =\mathbb{E}_{s\sim \rho^{\mu}}[\nabla_{\theta}\mu_{\theta}(s)\nabla_{a}Q^{\mu}(s,a)\big|_{a=\mu_{\theta}(s)}] \\ \end{align}\]

Since the expectation doesn’t depend on the action generated from the policy, off-policy RL is possible in practice. This makes DDPG to have high sample efficiency.

In the overall algorithm, for exploration, random noise is mixed in action generation. We update the Q-function (critic function) similar as DQN.

\[\begin{align} L=\mathbb{E}[(Q(s_{t},a_{t})-r_{t}-\gamma Q_{\text{target}}(s_{t+1},\mu_{\text{target}}(s_{t+1})))^{2}] \end{align}\]

Then, we update the actor policy with the deterministic policy gradient. The target networks are updated by Polyak averaging.


Reference

Lillicrap, T., et al. “Continuous control with deep reinforcement learning,” in arXiv preprint arXiv:1509.02971, 2015.


Design and source code from Jon Barron's website