What is Safe RL?
May 07, 2025
When we train or deploy RL agent to the real world, safety becomes a crucial part to consider. In order to train a safe RL agent, a problem formulation of CMDP(Constrained MDP) is used, where a safety constraint is added. On top of a vanilla MDP $(S, A, P, \gamma, R)$, there is an additional cost function $C(s,a,s’) \ge 0$, which penalizes unsafe regions. The safety constraint can be defined in different ways with the cost, such as expectation of cost or risk measures such as Conditional value at risk(CVaR).
\[\begin{align} J_{C} & = \sum_{t}\gamma^{t}C(s_{t}, a_t, s_{t+1}) \\ Q^{\pi}_{risk} & = \mathbb{E}^{\pi}\left[ J_{C} \right]\le \epsilon \\ \text{CVaR}_{\alpha}(J_{C}) & = \mathbb{E}^{\pi}\left[ J_{C}\mid J_{C} \ge \text{VaR}_\alpha(J_{C}) \right]\le \epsilon \end{align}\]The goal of CMDP is to maximize return while satisfying constraints.
\[\begin{align} \max_{\pi}\sum\gamma^{t}R_{t}\ s.t.\ S\left( \sum\gamma^{t}C_{t} \right)\leq \epsilon \end{align}\]One way for solving CMDP is to train a safety monitoring module, or safety critic that evaluates whether the state, action is potentially unsafe. For example, Recovery RL trains a $\hat{Q}_{risk}$ from offline data, and utilize it during online exploration. If the agent enters the unsafe region, the agent executes a recovery policy or a task policy depending on the safety critic.
Another way to solve CMDP is to perform the constrained optimization problem on the policy itself. Similar to the idea of TRPO, we can approximate the problem into linear quadratic programming within the trust region. Or we can convert the constraint into a penalty term in the objective by Lagrangian method.
Reference
K. Srinivasan, B. Eysenbach, S. Ha, J. Tan, and C. Finn, “Learning to be safe: Deep rl with a safety critic,” arXiv preprint arXiv:2010.14603, 2020.
B. Thananjeyan, A. Balakrishna, S. Nair, M. Luo, K. Srinivasan, M. Hwang, J. E. Gonzalez, J. Ibarz, C. Finn, and K. Goldberg, “Recovery rl: Safe reinforcement learning with learned recovery zones,” IEEE Robotics and Automation Letters, vol. 6, no. 3, pp. 4915–4922, 2021.