Techniques & Methods
Inference
Inference (also called prediction or forward pass) is the process of running trained model weights on new inputs to produce outputs. It is distinct from training, which updates weights. Inference latency, throughput, and cost are critical production concerns for LLM deployments.
Optimizing inference involves techniques like quantization (reducing weight precision), KV caching (storing attention computations), speculative decoding (using a smaller model to draft tokens), and batching requests. These reduce cost and latency without degrading quality.
Authority Links
Related Terms
Techniques & Methods
Training
Teaching a model to make accurate predictions by exposing it to large datasets.
Model Components
Large Language Model (LLM)
A transformer-based neural network with billions to trillions of parameters, trained on broad text corpora to predict the next token and able to generate, summarize, classify, and reason over natural language.
Model Components
Context Window
The maximum number of tokens a language model can process in a single inference pass — everything the model "sees" at once, including system prompt, conversation history, retrieved documents, and the response being generated.
Techniques & Methods
Decoding Rules
Guidelines and algorithms that control how language models translate internal representations into output tokens.

