Miscellaneous

Dataset

Datasets are the raw material of machine learning. A well-designed dataset includes diverse, representative examples covering the intended input distribution, accurate labels (for supervised tasks), and clean preprocessing. Public benchmark datasets (ImageNet, GLUE, SQuAD) have standardized progress measurement across the field.

Dataset curation for LLMs involves web scraping, filtering (removing duplicates, toxic content, low-quality text), deduplication, and quality scoring. The composition, scale, and diversity of pre-training data are major determinants of model capability.

Authority Links

Dataset — Wikipedia

Definition, types, and importance of datasets in ML.

Hugging Face Datasets

Repository of public ML datasets for training and evaluation.

Related Terms

Miscellaneous

Training Data

The labeled or unlabeled dataset used to fit a model's parameters during the learning process.

Miscellaneous

Validation Data

A held-out data split used during training to tune hyperparameters and monitor generalization.

Miscellaneous

Test Data

A held-out dataset used only once at the end to evaluate final model performance unbiasedly.

Techniques & Methods

Data Augmentation

Increasing training dataset size and diversity by creating modified copies of existing data.

Knowledge Base Data Science