The attention mechanism lets neural networks focus on the most relevant parts of input data when producing each output element. In language models, it enables each word to 'attend to' other words in the sequence to understand context and relationships.

How Attention Mechanism Works

In the sentence 'The cat sat on the mat because it was tired,' attention helps the model determine that 'it' refers to 'cat' by calculating high attention scores between those words. Without attention, models struggle with these long-range relationships.

Key Concepts

  • Query, Key, Value — The three matrices in attention — Query asks 'what am I looking for?', Key says 'what do I contain?', Value provides the actual information
  • Attention Score — A weight indicating how relevant each input element is to the current output — calculated via dot product
  • Scaled Dot-Product — The specific attention formula: softmax(QK^T / sqrt(d_k)) * V — scaled to prevent extreme values

Frequently Asked Questions

Is attention the same as self-attention?

Self-attention is a specific type where a sequence attends to itself. Cross-attention attends to a different sequence (like in translation, where the decoder attends to the encoder output).

Why is attention important for AI?

Attention solved the bottleneck of fixed-length representations in sequence models. It lets models dynamically focus on relevant information, enabling the breakthroughs behind GPT, BERT, and all modern LLMs.