What is SAC?
April 20, 2025
SAC is an off-policy actor-critic method based on maximum entropy RL. For stochastic policy agent, adding an additional entropy objective helps the agent’s exploration.
The following objective is typically used in maximum entropy RL.
\[J(\pi)=\sum^T_{t=0}\mathbb{E}_{(s_{t}, a_{t})\sim \rho_{\pi}}[r(s_{t},a_{t})+ \beta H(\pi(\cdot\mid s_{t}))]\]SAC also iteratively repeats the policy evaluation and policy improvement as in standard actor-critic methods.
However, because we use the maximum entropy objective, the Bellman update is defined differently. Soft Bellman operator for Q-function is defined from soft value function,
\[\begin{align} T^\pi Q(s,a) & =r(s,a)+\gamma \mathbb{E}_{s'\sim P}[V_{soft}(s')] \\ V_{soft}(s') & = \mathbb{E}_{a'\sim \pi(\cdot \mid s')}[Q(s',a')-\beta \log \pi(a'\mid s')] \\ \end{align}\]which is also guaranteed that the Q-function converges.
Soft policy evaluation is done by bootstrapping using this soft Bellman operator.
The optimal policy for the fixed Q-function is the softmax over Q-values, also known as Boltzmann distribution.
\[\begin{align} V^{*}(s’) & = \max_{\pi} \mathbb{E}_{a’ \sim \pi} \left[ Q(s’,a’) - \beta \log \pi(a’|s’) \right] \\ \pi^*(a’|s’) & = \frac{ \exp\left( Q(s’,a’) / \beta \right) }{ Z(s')} \end{align}\]Therefore, Soft policy improvement is done by updating the actor to be close to the softmax distribution of Q-function.
\[\begin{align} \arg\min_{\pi}D_{KL}\left( \pi(\cdot\mid s)\parallel \frac{\exp(Q^{\pi_{k}}(s,\cdot)/\beta)}{Z(s)} \right) \end{align}\]Reference
Haarnoja, T., et al, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in International conference on machine learning, 2018, pp. 1861–1870.