Miscellaneous
Corpus
A corpus is a structured collection of texts—books, web pages, code, scientific papers, social media—used as training material for language models or as a reference for linguistic analysis. Pre-training corpora for frontier LLMs span trillions of tokens drawn from diverse web sources.
Corpus composition has profound effects on model behavior: a corpus dominated by English produces a model with weaker multilingual capabilities; one heavy in code produces stronger coding performance. Careful corpus curation is a competitive differentiator in LLM development.
Authority Links
Related Terms
Miscellaneous
Training Data
The labeled or unlabeled dataset used to fit a model's parameters during the learning process.
Miscellaneous
Dataset
An organized collection of data examples prepared for training, evaluating, or testing AI models.
Techniques & Methods
Pre-training
Initial phase where a model learns general representations from large datasets before task-specific fine-tuning.
Model Components
Language Model
AI system that assigns probabilities to sequences of words and can generate coherent text.

