Miscellaneous
Dataset
Datasets are the raw material of machine learning. A well-designed dataset includes diverse, representative examples covering the intended input distribution, accurate labels (for supervised tasks), and clean preprocessing. Public benchmark datasets (ImageNet, GLUE, SQuAD) have standardized progress measurement across the field.
Dataset curation for LLMs involves web scraping, filtering (removing duplicates, toxic content, low-quality text), deduplication, and quality scoring. The composition, scale, and diversity of pre-training data are major determinants of model capability.
Authority Links
Related Terms
Miscellaneous
Training Data
The labeled or unlabeled dataset used to fit a model's parameters during the learning process.
Miscellaneous
Validation Data
A held-out data split used during training to tune hyperparameters and monitor generalization.
Miscellaneous
Test Data
A held-out dataset used only once at the end to evaluate final model performance unbiasedly.
Techniques & Methods
Data Augmentation
Increasing training dataset size and diversity by creating modified copies of existing data.

