Techniques & Methods
Self-Attention
Self-attention computes a representation of each token in a sequence by attending to all other tokens, weighted by their relevance. This allows the model to capture long-range dependencies—understanding that "it" refers to "the bank" several sentences earlier.
Self-attention is the core innovation of the transformer architecture, enabling parallel processing of sequences and scaling to thousands of tokens. Multi-head attention extends this by learning multiple attention patterns simultaneously.
Authority Links
Related Terms
Techniques & Methods
Attention
Core mechanism in transformers that dynamically weights the importance of different input positions.
Techniques & Methods
Attention Mechanism
Neural network technique enabling models to focus on the most relevant parts of input when producing each output.
Model Components
Transformer
A neural-network architecture, introduced by Vaswani et al. in 2017, that uses self-attention and parallel computation across all sequence positions — the foundation under virtually every frontier language and multimodal model in production today.
Model Components
Transformer Decoder
Transformer component that generates output sequences by attending to encoded inputs and prior outputs.

