Techniques & Methods

Data Augmentation

Data augmentation artificially expands training data by applying transformations: flipping and cropping images, adding noise to audio, paraphrasing sentences, or back-translating text. It improves model robustness and reduces overfitting, especially when original data is scarce.

In NLP, augmentation techniques include synonym replacement, random insertion/deletion, back-translation, and using LLMs to generate paraphrases. For LLM pre-training, data augmentation is less common given the abundance of internet text, but it is critical in specialized low-resource domains.

Authority Links

Data Augmentation — Wikipedia

Techniques and benefits of data augmentation in ML.

NLP Augmentation — arXiv

Survey of text data augmentation techniques for NLP tasks.

Related Terms

Miscellaneous

Training Data

The labeled or unlabeled dataset used to fit a model's parameters during the learning process.

Core Concepts

Overfitting

Model learns detail and noise in training data too thoroughly, reducing generalization.

Techniques & Methods

Training

Teaching a model to make accurate predictions by exposing it to large datasets.

Miscellaneous

Dataset

An organized collection of data examples prepared for training, evaluating, or testing AI models.

Data Mining Coreference Resolution