Kubnal Bridge

Techniques & Methods

Proximal Policy Optimization (PPO)

PPO is a policy gradient algorithm used in reinforcement learning that improves training stability by clipping policy updates to prevent them from changing too drastically in a single step. This "proximal" constraint keeps learning stable without requiring complex second-order optimization.

PPO is the standard RL algorithm used in RLHF for aligning LLMs. After a reward model scores outputs, PPO updates the language model's policy to increase the probability of generating high-reward responses.

Authority Links

Related Terms