Miscellaneous

Corpus

A corpus is a structured collection of texts—books, web pages, code, scientific papers, social media—used as training material for language models or as a reference for linguistic analysis. Pre-training corpora for frontier LLMs span trillions of tokens drawn from diverse web sources.

Corpus composition has profound effects on model behavior: a corpus dominated by English produces a model with weaker multilingual capabilities; one heavy in code produces stronger coding performance. Careful corpus curation is a competitive differentiator in LLM development.

Authority Links

Text Corpus — Wikipedia

Definition and types of text corpora in NLP.

Common Crawl

The largest publicly available web corpus used in LLM pre-training.

Related Terms

Miscellaneous

Training Data

The labeled or unlabeled dataset used to fit a model's parameters during the learning process.

Miscellaneous

Dataset

An organized collection of data examples prepared for training, evaluating, or testing AI models.

Techniques & Methods

Pre-training

Initial phase where a model learns general representations from large datasets before task-specific fine-tuning.

Model Components

Language Model

AI system that assigns probabilities to sequences of words and can generate coherent text.

Data Privacy Deployment