Model Components
Maximum Response Length
Maximum response length (max_tokens or max_new_tokens in APIs) caps the output size per request. It prevents runaway generation and controls API costs. Setting it too low truncates responses; too high increases cost and latency.
In production, appropriate maximum length depends on the task: question answering needs shorter limits than document generation. Context window size minus input length determines the theoretical maximum; practical limits are set lower for cost control.
Authority Links
Related Terms
Model Components
Context Window
The maximum number of tokens a language model can process in a single inference pass — everything the model "sees" at once, including system prompt, conversation history, retrieved documents, and the response being generated.
Core Concepts
Token
Smallest processing unit in NLP: a word, word part, or character.
Techniques & Methods
Inference
Using a trained AI model to generate predictions or responses on new, unseen data.
Techniques & Methods
Generation
Producing new text, code, or content based on learned patterns and a given input prompt.

