Model Components

Reward Models

Reward models are trained on human preference data—pairs of model outputs where annotators indicate which is better—to learn a scalar quality score. This score serves as the reward signal in RLHF, guiding the language model's policy toward higher-quality, more aligned outputs.

Reward model quality is critical to RLHF success: a flawed reward model will cause the policy to be optimized in unintended directions ("reward hacking"). Constitutional AI and process reward models are innovations aimed at improving reward model reliability.

Authority Links

Reward Model — Wikipedia

How reward models function in RLHF pipelines.

InstructGPT — arXiv

Reward model training and use in aligning GPT via RLHF.

Related Terms

Techniques & Methods

Reinforcement Learning from Human Feedback (RLHF)

Training technique that refines AI models using feedback from human evaluators on output quality.

Techniques & Methods

Reinforcement Learning

An agent learns by taking actions in an environment and receiving rewards or penalties.

Techniques & Methods

Proximal Policy Optimization (PPO)

RL algorithm that balances exploration and exploitation by constraining policy update size.

Techniques & Methods

AI Alignment

The research field and engineering practice of building AI systems that reliably pursue goals humans actually want, remain controllable, and avoid harmful side effects — operationalized through RLHF, Constitutional AI, evaluations, and interpretability.

Sequence-to-Sequence (Seq2Seq) Models Retrieval Model