Techniques & Methods

Inference

Inference (also called prediction or forward pass) is the process of running trained model weights on new inputs to produce outputs. It is distinct from training, which updates weights. Inference latency, throughput, and cost are critical production concerns for LLM deployments.

Optimizing inference involves techniques like quantization (reducing weight precision), KV caching (storing attention computations), speculative decoding (using a smaller model to draft tokens), and batching requests. These reduce cost and latency without degrading quality.

Authority Links

Inference in ML — Wikipedia

How ML models perform inference on new data.

Hugging Face — Inference Optimization

Techniques for optimizing LLM inference speed and cost.

Related Terms

Techniques & Methods

Training

Teaching a model to make accurate predictions by exposing it to large datasets.

Model Components

Large Language Model (LLM)

A transformer-based neural network with billions to trillions of parameters, trained on broad text corpora to predict the next token and able to generate, summarize, classify, and reason over natural language.

Model Components

Context Window

The maximum number of tokens a language model can process in a single inference pass — everything the model "sees" at once, including system prompt, conversation history, retrieved documents, and the response being generated.

Techniques & Methods

Decoding Rules

Guidelines and algorithms that control how language models translate internal representations into output tokens.

Information Extraction Heuristics