Model Components

Context Window

The context window is the maximum number of tokens a language model can process in a single inference pass. It includes everything the model "sees" at once: the system prompt, the conversation history, any retrieved documents inserted by RAG, the user's current message, and the response the model is generating. When the context window is exceeded, earlier content has to be dropped or summarized — what falls out of the window is functionally forgotten.

Context window sizes have grown dramatically. GPT-3 (2020) was 4K tokens. GPT-4 launched at 8K and 32K variants in 2023. Modern frontier models offer much larger windows: Claude Sonnet/Opus at 1M tokens, Gemini Pro at 1M-2M tokens, GPT-4-Turbo and GPT-5 at 128K-1M. A 1M-token context window can fit roughly 750,000 words — about 7-8 full novels — in a single inference pass.

Bigger isn't automatically better. Multiple studies — most notably "Lost in the Middle" (Liu et al., 2023) — have shown that LLMs perform best on information placed at the start and end of a context window, and worst on information buried in the middle. Even with a 1M-token window, performance on information at position 400K can be substantially worse than at position 1K or 999K. The "U-shaped attention curve" is consistent across model families.

Cost and latency scale with context length, sometimes nonlinearly. Doubling input tokens typically doubles input cost; for some architectures, attention computation scales O(n²) with sequence length, making very long contexts slow. The practical implication: most production systems use shorter focused contexts (4K-32K) for routine queries and reserve large contexts for tasks like long-document analysis where they're actually needed.

Context engineering — deciding what to include in the context window — is therefore as important as prompt engineering. The right system prompt, the right retrieved chunks, and the right conversation history selection often matters more than raw window size. RAG systems specifically depend on context engineering: retrieve too few chunks and the model lacks grounding; retrieve too many and the relevant chunk gets lost in the middle.

Why it matters in GEO / AI search

In GEO, the context window is where the actual citation decision happens. When ChatGPT Search retrieves your page, the relevant chunks land in the model's context alongside other sources, the user's query, and the system prompt instructing the model to cite. Whether your chunk gets cited depends on how it ranks against the competition within that context — not how it ranks in some abstract authority sense.

"Lost in the middle" affects publishers directly. If 8 sources are retrieved and your chunk lands at position 4, statistically you're less likely to be cited than chunks at positions 1 or 8 — even if your chunk is more relevant. The defense is content density: a chunk that's unambiguously the best answer in its position will get cited regardless. Vague or generic chunks lose to position bias more often.

Long-context models also enable new content strategies. With a 1M-token window, an AI agent can ingest your entire docs site in one pass and answer detailed questions about the corpus as a whole. Publishers who structure their content as a coherent corpus (consistent voice, internal cross-links, clear taxonomy) extract more value from long-context AI than publishers with fragmented content. Site-wide structural coherence becomes a GEO asset.

Examples

Typical retrieval depth

A RAG system retrieves top-5 chunks of 500 tokens each, plus a 1K-token system prompt and 500-token user query. Total context: ~4K tokens — well within any model's window, leaving plenty of room for the response.

Long-document analysis

A user pastes a 200-page legal contract into Claude with a 1M-token window. The model can answer questions referencing any clause without RAG infrastructure — long context replaces retrieval for single-document tasks.

Lost-in-the-middle failure

A user provides 50 retrieved sources to ChatGPT for a research synthesis. The model accurately cites sources 1-5 and 45-50, but largely ignores sources 20-30 even though some contain the most relevant data. Fix: rank sources by relevance and place the best ones at start and end of the context.

Conversation overflow

A long ChatGPT conversation hits the context window limit. The system silently drops the earliest messages — including possibly the original system prompt or key user instructions. The model "forgets" what it was told to do, leading to drift. Mitigation: explicitly re-state key instructions periodically.

Authority Links

Context Window — Wikipedia (LLM)

How context windows define LLM memory and processing limits.

Lost in the Middle — arXiv 2307.03172

Liu et al. research on LLM performance degradation for middle-context information.

Anthropic — Long Context

Practical patterns for using long context windows effectively.

Related Terms

Core Concepts

Token

Smallest processing unit in NLP: a word, word part, or character.

Model Components

Large Language Model (LLM)

A transformer-based neural network with billions to trillions of parameters, trained on broad text corpora to predict the next token and able to generate, summarize, classify, and reason over natural language.

Model Components

Maximum Response Length

The upper limit on the number of tokens a model can generate in a single response.

Techniques & Methods

Attention

Core mechanism in transformers that dynamically weights the importance of different input positions.

Discriminator (in GAN)Contextual Embeddings