Kubnal Bridge

Techniques & Methods

Reinforcement Learning from Human Feedback (RLHF)

RLHF involves collecting human preference data—evaluators comparing pairs of model outputs and selecting the better one—then training a reward model on those preferences. The LLM is then fine-tuned using reinforcement learning (typically PPO) to maximize the reward model's score.

RLHF is the primary technique used to align modern LLMs like ChatGPT, Claude, and Gemini with human values and preferences. It transforms pre-trained models from pattern-completers into helpful, harmless assistants.

Authority Links

Related Terms