Techniques & Methods
Proximal Policy Optimization (PPO)
PPO is a policy gradient algorithm used in reinforcement learning that improves training stability by clipping policy updates to prevent them from changing too drastically in a single step. This "proximal" constraint keeps learning stable without requiring complex second-order optimization.
PPO is the standard RL algorithm used in RLHF for aligning LLMs. After a reward model scores outputs, PPO updates the language model's policy to increase the probability of generating high-reward responses.
Authority Links
Related Terms
Techniques & Methods
Reinforcement Learning
An agent learns by taking actions in an environment and receiving rewards or penalties.
Techniques & Methods
Reinforcement Learning from Human Feedback (RLHF)
Training technique that refines AI models using feedback from human evaluators on output quality.
Model Components
Reward Models
Models trained to score AI outputs based on human preferences for use in reinforcement learning.
Techniques & Methods
Offline Reinforcement Learning
Learning optimal policies from fixed historical datasets without interacting with a live environment.

