Kubnal Bridge

Model Components

Reward Models

Reward models are trained on human preference data—pairs of model outputs where annotators indicate which is better—to learn a scalar quality score. This score serves as the reward signal in RLHF, guiding the language model's policy toward higher-quality, more aligned outputs.

Reward model quality is critical to RLHF success: a flawed reward model will cause the policy to be optimized in unintended directions ("reward hacking"). Constitutional AI and process reward models are innovations aimed at improving reward model reliability.

Authority Links

Related Terms