Core Concepts

Token

Tokenization splits text into tokens—sub-word units that balance vocabulary size with linguistic coverage. GPT-4 uses roughly 1 token per 4 characters of English text, so 1,000 words ≈ 750 tokens.

Token counts determine API costs, context window usage, and generation speed. Concise, well-structured content is processed more efficiently and fits more meaning into limited context windows.

Authority Links

Tokenization — Wikipedia

Technical explanation of tokenization in NLP.

OpenAI Tokenizer

Interactive tool to visualize how GPT models tokenize text.

Related Terms

Model Components

Context Window

The maximum number of tokens a language model can process in a single inference pass — everything the model "sees" at once, including system prompt, conversation history, retrieved documents, and the response being generated.

Model Components

Large Language Model (LLM)

A transformer-based neural network with billions to trillions of parameters, trained on broad text corpora to predict the next token and able to generate, summarize, classify, and reason over natural language.

Turing Test Supervised Learning