What is a MDP?
April 10, 2025
MDP is Markov Decision Process, which is a sequential decision problem modeling that holds a Markov property.
The basic components of MDP is
- state: $S$
- action: $A$
- transition model: $P(s’\mid s,a)$
- reward: $r(s, a, s’)$
- discount factor: $\gamma \in[0,1)$ , it gives smaller weights to future rewards
The states should be fully observable in MDP.
The problem setting of which the states are not fully observable is called POMDP, which can be modified into MDP by using belief states.
MDP is the basic assumption in Reinforcement Learning. The main goal of RL is to find the optimal control that maximizes the expected sum of rewards.
In other words, we find a policy $\pi:S\rightarrow A$ that maximizes the following utility objective, which is dicounted sum of rewards.
\[\begin{align} G^\pi_t &= \sum^\infty_{i=0}\gamma^i r_{t+i+1} \le \frac{\sup r}{1-\gamma} \\ \pi_{s}^* &= \arg \max_{\pi}G^\pi(s) \end{align}\]The policy $\pi^*$ is called the optimal policy.
Reference
R. Sutton, A. Barto, Reinforcement Learning: An Introduction. The MIT Press, 2018.