Kubnal Bridge

Techniques & Methods

Self-Attention

Self-attention computes a representation of each token in a sequence by attending to all other tokens, weighted by their relevance. This allows the model to capture long-range dependencies—understanding that "it" refers to "the bank" several sentences earlier.

Self-attention is the core innovation of the transformer architecture, enabling parallel processing of sequences and scaling to thousands of tokens. Multi-head attention extends this by learning multiple attention patterns simultaneously.

Authority Links

Related Terms