Techniques & Methods

Markov Decision Process

An MDP formalizes decision-making as a tuple of states, actions, transition probabilities, and rewards. An agent observes the current state, takes an action, transitions to a new state with some probability, and receives a reward. The goal is to learn a policy maximizing cumulative reward.

MDPs provide the mathematical foundation for all model-based reinforcement learning. The Markov property—that future states depend only on the current state, not on history—is the key simplifying assumption that makes MDPs tractable.

Authority Links

Markov Decision Process — Wikipedia

Mathematical definition and properties of MDPs.

IBM — Reinforcement Learning

How MDPs formalize the reinforcement learning problem.

Related Terms

Techniques & Methods

Reinforcement Learning

An agent learns by taking actions in an environment and receiving rewards or penalties.

Techniques & Methods

Proximal Policy Optimization (PPO)

RL algorithm that balances exploration and exploitation by constraining policy update size.

Techniques & Methods

Offline Reinforcement Learning

Learning optimal policies from fixed historical datasets without interacting with a live environment.

Model Components

Reward Models

Models trained to score AI outputs based on human preferences for use in reinforcement learning.

Masked Language Modeling Machine Translation