Model Components
Reward Models
Reward models are trained on human preference data—pairs of model outputs where annotators indicate which is better—to learn a scalar quality score. This score serves as the reward signal in RLHF, guiding the language model's policy toward higher-quality, more aligned outputs.
Reward model quality is critical to RLHF success: a flawed reward model will cause the policy to be optimized in unintended directions ("reward hacking"). Constitutional AI and process reward models are innovations aimed at improving reward model reliability.
Authority Links
Related Terms
Techniques & Methods
Reinforcement Learning from Human Feedback (RLHF)
Training technique that refines AI models using feedback from human evaluators on output quality.
Techniques & Methods
Reinforcement Learning
An agent learns by taking actions in an environment and receiving rewards or penalties.
Techniques & Methods
Proximal Policy Optimization (PPO)
RL algorithm that balances exploration and exploitation by constraining policy update size.
Techniques & Methods
AI Alignment
The research field and engineering practice of building AI systems that reliably pursue goals humans actually want, remain controllable, and avoid harmful side effects — operationalized through RLHF, Constitutional AI, evaluations, and interpretability.

