Applications

InstructGPT

InstructGPT, developed by OpenAI and described in a 2022 paper, demonstrated that fine-tuning GPT-3 with RLHF dramatically improved its ability to follow diverse instructions, reduce harmful outputs, and produce honest responses—even with a smaller 1.3B parameter model outperforming the 175B GPT-3 base model on user preference.

InstructGPT established the SFT + RLHF training paradigm that has become standard for aligning LLMs. It directly preceded ChatGPT and influenced alignment approaches at Anthropic (Constitutional AI) and Google DeepMind.

Authority Links

InstructGPT Paper — arXiv

Original InstructGPT paper on RLHF-based instruction following.

InstructGPT — Wikipedia

Overview of InstructGPT and its role in modern LLM alignment.

Related Terms

Techniques & Methods

Reinforcement Learning from Human Feedback (RLHF)

Training technique that refines AI models using feedback from human evaluators on output quality.

Techniques & Methods

Supervised Fine-Tuning

Refining a pre-trained model's performance on a specific task using labeled example data.

Techniques & Methods

AI Alignment

The research field and engineering practice of building AI systems that reliably pursue goals humans actually want, remain controllable, and avoid harmful side effects — operationalized through RLHF, Constitutional AI, evaluations, and interpretability.

Model Components

Generative Pre-trained Transformer (GPT)

A family of decoder-only Transformer language models — pioneered by OpenAI — that combines large-scale unsupervised pre-training on text with task-specific alignment to produce general-purpose text generation.

AI Agents Yeoman's Work