Techniques & Methods
Markov Decision Process
An MDP formalizes decision-making as a tuple of states, actions, transition probabilities, and rewards. An agent observes the current state, takes an action, transitions to a new state with some probability, and receives a reward. The goal is to learn a policy maximizing cumulative reward.
MDPs provide the mathematical foundation for all model-based reinforcement learning. The Markov property—that future states depend only on the current state, not on history—is the key simplifying assumption that makes MDPs tractable.
Authority Links
Related Terms
Techniques & Methods
Reinforcement Learning
An agent learns by taking actions in an environment and receiving rewards or penalties.
Techniques & Methods
Proximal Policy Optimization (PPO)
RL algorithm that balances exploration and exploitation by constraining policy update size.
Techniques & Methods
Offline Reinforcement Learning
Learning optimal policies from fixed historical datasets without interacting with a live environment.
Model Components
Reward Models
Models trained to score AI outputs based on human preferences for use in reinforcement learning.

