Kubnal Bridge

Techniques & Methods

Inference

Inference (also called prediction or forward pass) is the process of running trained model weights on new inputs to produce outputs. It is distinct from training, which updates weights. Inference latency, throughput, and cost are critical production concerns for LLM deployments.

Optimizing inference involves techniques like quantization (reducing weight precision), KV caching (storing attention computations), speculative decoding (using a smaller model to draft tokens), and batching requests. These reduce cost and latency without degrading quality.

Authority Links

Related Terms