Techniques & Methods
Reinforcement Learning from Human Feedback (RLHF)
RLHF involves collecting human preference data—evaluators comparing pairs of model outputs and selecting the better one—then training a reward model on those preferences. The LLM is then fine-tuned using reinforcement learning (typically PPO) to maximize the reward model's score.
RLHF is the primary technique used to align modern LLMs like ChatGPT, Claude, and Gemini with human values and preferences. It transforms pre-trained models from pattern-completers into helpful, harmless assistants.
Authority Links
Related Terms
Techniques & Methods
Reinforcement Learning
An agent learns by taking actions in an environment and receiving rewards or penalties.
Techniques & Methods
Proximal Policy Optimization (PPO)
RL algorithm that balances exploration and exploitation by constraining policy update size.
Model Components
Reward Models
Models trained to score AI outputs based on human preferences for use in reinforcement learning.
Techniques & Methods
Supervised Fine-Tuning
Refining a pre-trained model's performance on a specific task using labeled example data.

