Techniques & Methods

Self-Attention

Self-attention computes a representation of each token in a sequence by attending to all other tokens, weighted by their relevance. This allows the model to capture long-range dependencies—understanding that "it" refers to "the bank" several sentences earlier.

Self-attention is the core innovation of the transformer architecture, enabling parallel processing of sequences and scaling to thousands of tokens. Multi-head attention extends this by learning multiple attention patterns simultaneously.

Authority Links

Attention Is All You Need — arXiv

Original transformer paper introducing self-attention mechanisms.

Self-Attention — Wikipedia

How attention mechanisms work in transformer models.

Related Terms

Techniques & Methods

Attention

Core mechanism in transformers that dynamically weights the importance of different input positions.

Techniques & Methods

Attention Mechanism

Neural network technique enabling models to focus on the most relevant parts of input when producing each output.

Model Components

Transformer

A neural-network architecture, introduced by Vaswani et al. in 2017, that uses self-attention and parallel computation across all sequence positions — the foundation under virtually every frontier language and multimodal model in production today.

Model Components

Transformer Decoder

Transformer component that generates output sequences by attending to encoded inputs and prior outputs.

Semantic Annotation Scaling Laws