Joonkyu Min

I am an undergraduate student at Seoul National University, majoring in Electrical and Computer Engineering. I am currently an intern at CLVR Lab, where I work on robot safety under supervision of Joseph J. Lim.

Email  /  GitHub  /  LinkedIn  / 


What is a MDP?

April 10, 2025

MDP is Markov Decision Process, which is a sequential decision problem modeling that holds a Markov property.

The basic components of MDP is

  • state: $S$
  • action: $A$
  • transition model: $P(s’\mid s,a)$
  • reward: $r(s, a, s’)$
  • discount factor: $\gamma \in[0,1)$ , it gives smaller weights to future rewards

The states should be fully observable in MDP.

The problem setting of which the states are not fully observable is called POMDP, which can be modified into MDP by using belief states.

MDP is the basic assumption in Reinforcement Learning. The main goal of RL is to find the optimal control that maximizes the expected sum of rewards.

In other words, we find a policy $\pi:S\rightarrow A$ that maximizes the following utility objective, which is dicounted sum of rewards.

\[\begin{align} G^\pi_t &= \sum^\infty_{i=0}\gamma^i r_{t+i+1} \le \frac{\sup r}{1-\gamma} \\ \pi_{s}^* &= \arg \max_{\pi}G^\pi(s) \end{align}\]

The policy $\pi^*$ is called the optimal policy.


Reference

R. Sutton, A. Barto, Reinforcement Learning: An Introduction. The MIT Press, 2018.


Design and source code from Jon Barron's website