Techniques & Methods

Proximal Policy Optimization (PPO)

PPO is a policy gradient algorithm used in reinforcement learning that improves training stability by clipping policy updates to prevent them from changing too drastically in a single step. This "proximal" constraint keeps learning stable without requiring complex second-order optimization.

PPO is the standard RL algorithm used in RLHF for aligning LLMs. After a reward model scores outputs, PPO updates the language model's policy to increase the probability of generating high-reward responses.

Authority Links

PPO Paper — arXiv

Original OpenAI paper introducing Proximal Policy Optimization.

RL — Wikipedia

Overview of PPO algorithm and its applications.

Related Terms

Techniques & Methods

Reinforcement Learning

An agent learns by taking actions in an environment and receiving rewards or penalties.

Techniques & Methods

Reinforcement Learning from Human Feedback (RLHF)

Training technique that refines AI models using feedback from human evaluators on output quality.

Model Components

Reward Models

Models trained to score AI outputs based on human preferences for use in reinforcement learning.

Techniques & Methods

Offline Reinforcement Learning

Learning optimal policies from fixed historical datasets without interacting with a live environment.

Query Prompt Injection