Miscellaneous
Training Data
Training data is the primary input to machine learning: it provides the examples from which the model learns patterns, relationships, and representations. For supervised learning, it consists of input-output pairs; for unsupervised pre-training, it is typically raw, unlabeled text at internet scale.
Training data quality—accuracy, diversity, representativeness, and cleanliness—is the single largest determinant of model quality. Data curation, deduplication, and filtering have become as important as architecture choices for frontier model development.
Authority Links
Related Terms
Miscellaneous
Validation Data
A held-out data split used during training to tune hyperparameters and monitor generalization.
Miscellaneous
Test Data
A held-out dataset used only once at the end to evaluate final model performance unbiasedly.
Miscellaneous
Dataset
An organized collection of data examples prepared for training, evaluating, or testing AI models.
Miscellaneous
Label
Annotation indicating the correct output or category for a training example in supervised learning.

