Techniques & Methods

Reinforcement Learning from Human Feedback (RLHF)

RLHF involves collecting human preference data—evaluators comparing pairs of model outputs and selecting the better one—then training a reward model on those preferences. The LLM is then fine-tuned using reinforcement learning (typically PPO) to maximize the reward model's score.

RLHF is the primary technique used to align modern LLMs like ChatGPT, Claude, and Gemini with human values and preferences. It transforms pre-trained models from pattern-completers into helpful, harmless assistants.

Authority Links

InstructGPT / RLHF Paper — arXiv

Seminal paper on aligning GPT with human preferences via RLHF.

IBM — RLHF

How reinforcement learning from human feedback works.

Related Terms

Techniques & Methods

Reinforcement Learning

An agent learns by taking actions in an environment and receiving rewards or penalties.

Techniques & Methods

Proximal Policy Optimization (PPO)

RL algorithm that balances exploration and exploitation by constraining policy update size.

Model Components

Reward Models

Models trained to score AI outputs based on human preferences for use in reinforcement learning.

Techniques & Methods

Supervised Fine-Tuning

Refining a pre-trained model's performance on a specific task using labeled example data.

Response Quality Reinforcement Learning