Techniques & Methods
Attention
Attention in transformers computes query, key, and value projections for each token, then uses dot-product similarity between queries and keys to weight the value vectors. Multi-head attention runs this process in parallel with different learned projections, capturing diverse relationship types simultaneously.
Attention is both the key innovation of modern AI and its primary scaling bottleneck: standard attention is O(n²) in sequence length, making long contexts expensive. Efficient attention variants (Flash Attention, Sparse Attention) address this computational challenge.
Authority Links
Related Terms
Techniques & Methods
Self-Attention
Mechanism allowing a model to weigh the importance of each part of an input relative to all other parts.
Techniques & Methods
Attention Mechanism
Neural network technique enabling models to focus on the most relevant parts of input when producing each output.
Model Components
Transformer
A neural-network architecture, introduced by Vaswani et al. in 2017, that uses self-attention and parallel computation across all sequence positions — the foundation under virtually every frontier language and multimodal model in production today.
Model Components
Context Window
The maximum number of tokens a language model can process in a single inference pass — everything the model "sees" at once, including system prompt, conversation history, retrieved documents, and the response being generated.

