Attention is all you need
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin.
“Attention is all you need.”
Advances in neural information processing systems
30 (2017)
https://dl.acm.org/doi/pdf/10.5555/3295222.3295349
The Transformer model, introduced in this paper, replaces recurrence and convolution in sequence models with a structure based entirely on attention mechanisms. By doing so, it achieves faster training and better performance on machine translation tasks, setting a new standard for how neural networks handle sequences.
The New Hotness
Previous models for tasks like translation leaned heavily on recurrent neural networks (RNNs)1 or convolutional layers2 to process input and output sequences. The key innovation here is to drop both of those entirely. The Transformer uses only attention3—specifically self-attention and encoder-decoder attention—alongside simple feedforward layers. This design allows the model to process sequence elements in parallel instead of sequentially.
Key Insight
The core idea is that sequence models don’t need to process data one step at a time to capture relationships between elements. Attention mechanisms can directly connect all parts of a sequence to one another, regardless of their positions. This not only simplifies the architecture but also shortens the “path” between distant positions, making it easier to learn long-range dependencies.
How It Works
The Transformer follows an encoder-decoder structure, where both encoder and decoder stacks use multi-head self-attention layers combined with position-wise feedforward networks.
- Self-attention allows each token in the sequence to weigh the relevance of every other token.
- Scaled dot-product attention computes these weights efficiently using matrix multiplications.
- Multi-head attention runs several attention operations in parallel, letting the model focus on different types of relationships at once.
- Since the model has no inherent sense of token order (because there’s no recurrence), positional encodings are added to the inputs to provide information about sequence order.
Results
The Transformer achieves state-of-the-art results on machine translation benchmarks:
A BLEU score of 28.4 on English-to-German translation, outperforming previous models including ensembles.
A BLEU score of 41.8 on English-to-French, again setting a new high mark.
It reaches these results with significantly less training time—about 3.5 days on 8 GPUs, much faster than RNN-based systems. The model also generalizes well to tasks beyond translation, such as English constituency parsing, where it competes with or exceeds prior methods.
Why It Matters
This architecture dramatically improves training speed by allowing full parallelization across sequence elements—something RNNs fundamentally can’t do. It also simplifies the model structure while improving the ability to model long-range dependencies. These qualities make the Transformer a general-purpose backbone for sequence processing tasks, not just translation.
The Transformer laid the groundwork for nearly every modern language model that followed, including BERT, GPT, and T5. By showing that attention alone could outperform more complex architectures, this paper fundamentally shifted how researchers approach sequence modeling, and its influence now extends well beyond NLP, into areas like computer vision and protein folding.
A type of neural network designed for sequence data, where the output at each step depends on previous steps via hidden state connections. This sequential structure makes RNNs hard to parallelize and prone to issues like vanishing gradients. ↩︎
Neural network layers that process data using sliding filters (kernels) across the input, commonly used in image and sequence models to capture local patterns or features. ↩︎
A method for computing weighted combinations of input elements, where the weights reflect how relevant each input is to the current output. Concretely, given a query vector and a set of key-value pairs, attention computes a score for each key relative to the query (often using a dot product). These scores are passed through a softmax function to produce weights that sum to one. The final output is a weighted sum of the value vectors, using these weights. This allows the model to “attend” to different parts of the input depending on context—effectively learning which words (or features) to focus on when generating each part of the output.
Example: In machine translation, when generating the French word for dog, the model can use attention to assign higher weights to the English word dog in the input sentence, rather than treating all input words equally. ↩︎