Techniques & Methods
Adversarial Training
Adversarial training adds deliberately challenging or misleading examples to the training set to improve model robustness. In computer vision, adversarial examples are images with imperceptible perturbations that fool classifiers; in NLP, they include paraphrases, grammatical variations, or intentionally misleading queries.
In GANs, adversarial training describes the competition between generator and discriminator networks. In safety-focused LLM development, adversarial training (red-teaming) involves humans crafting inputs to elicit harmful outputs, which are then used to improve refusal behavior.
Authority Links
Related Terms
Model Components
Generative Adversarial Network (GAN)
Framework training two competing networks—a generator and discriminator—to produce realistic synthetic data.
Techniques & Methods
AI Alignment
The research field and engineering practice of building AI systems that reliably pursue goals humans actually want, remain controllable, and avoid harmful side effects — operationalized through RLHF, Constitutional AI, evaluations, and interpretability.
Techniques & Methods
Training
Teaching a model to make accurate predictions by exposing it to large datasets.
Core Concepts
Bias
Preconceived notions in AI models that affect decision-making and fairness.

