Kubnal Bridge

Model Components

Maximum Response Length

Maximum response length (max_tokens or max_new_tokens in APIs) caps the output size per request. It prevents runaway generation and controls API costs. Setting it too low truncates responses; too high increases cost and latency.

In production, appropriate maximum length depends on the task: question answering needs shorter limits than document generation. Context window size minus input length determines the theoretical maximum; practical limits are set lower for cost control.

Authority Links

Related Terms