Model Components

Maximum Response Length

Maximum response length (max_tokens or max_new_tokens in APIs) caps the output size per request. It prevents runaway generation and controls API costs. Setting it too low truncates responses; too high increases cost and latency.

In production, appropriate maximum length depends on the task: question answering needs shorter limits than document generation. Context window size minus input length determines the theoretical maximum; practical limits are set lower for cost control.

Authority Links

Anthropic API — Max Tokens

How maximum response length is set in the Anthropic API.

OpenAI API — Max Tokens

max_tokens parameter documentation for OpenAI chat completions.

Related Terms

Model Components

Context Window

The maximum number of tokens a language model can process in a single inference pass — everything the model "sees" at once, including system prompt, conversation history, retrieved documents, and the response being generated.

Core Concepts

Token

Smallest processing unit in NLP: a word, word part, or character.

Techniques & Methods

Inference

Using a trained AI model to generate predictions or responses on new, unseen data.

Techniques & Methods

Generation

Producing new text, code, or content based on learned patterns and a given input prompt.

Model Generative Pre-trained Transformer (GPT)