Kubnal Bridge

Miscellaneous

Corpus

A corpus is a structured collection of texts—books, web pages, code, scientific papers, social media—used as training material for language models or as a reference for linguistic analysis. Pre-training corpora for frontier LLMs span trillions of tokens drawn from diverse web sources.

Corpus composition has profound effects on model behavior: a corpus dominated by English produces a model with weaker multilingual capabilities; one heavy in code produces stronger coding performance. Careful corpus curation is a competitive differentiator in LLM development.

Authority Links

Related Terms