Techniques & Methods

Attention

Attention in transformers computes query, key, and value projections for each token, then uses dot-product similarity between queries and keys to weight the value vectors. Multi-head attention runs this process in parallel with different learned projections, capturing diverse relationship types simultaneously.

Attention is both the key innovation of modern AI and its primary scaling bottleneck: standard attention is O(n²) in sequence length, making long contexts expensive. Efficient attention variants (Flash Attention, Sparse Attention) address this computational challenge.

Authority Links

Transformer Attention — Wikipedia

How attention works within the transformer architecture.

Flash Attention — arXiv

IO-aware exact attention algorithm enabling longer context windows.

Related Terms

Techniques & Methods

Self-Attention

Mechanism allowing a model to weigh the importance of each part of an input relative to all other parts.

Techniques & Methods

Attention Mechanism

Neural network technique enabling models to focus on the most relevant parts of input when producing each output.

Model Components

Transformer

A neural-network architecture, introduced by Vaswani et al. in 2017, that uses self-attention and parallel computation across all sequence positions — the foundation under virtually every frontier language and multimodal model in production today.

Model Components

Context Window

The maximum number of tokens a language model can process in a single inference pass — everything the model "sees" at once, including system prompt, conversation history, retrieved documents, and the response being generated.

Attention Mechanism AI Alignment