Provided proper attribution is provided, Google hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works.

[NO \title GIVEN] [NO \author GIVEN] May 20, 2026 ======================

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Introduction

Background

Model Architecture

Why Self-Attention

Training

Results

Conclusion

In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention.

For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, we achieve a new state of the art. In the former task our best model outperforms even all previously reported ensembles.

We are excited about the future of attention-based models and plan to apply them to other tasks. We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs such as images, audio and video. Making generation less sequential is another research goals of ours.

The code we used to train and evaluate our models is available athttps://github.com/tensorflow/tensor2tensor.

We are grateful to Nal Kalchbrenner and Stephan Gouws for their fruitful comments, corrections and inspiration.

JimmyLei Ba, JamieRyan Kiros, and GeoffreyE Hinton.Layer normalization., 2016.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio.Neural machine translation by jointly learning to align and translate., abs/1409.0473, 2014.

Denny Britz, Anna Goldie, MinhThang Luong, and QuocV. Le.Massive exploration of neural machine translation architectures., abs/1703.03906, 2017.

Jianpeng Cheng, LiDong, and Mirella Lapata.Long short-term memory-networks for machine reading., 2016.

Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio.Learning phrase representations using rnn encoder-decoder for statistical machine translation., abs/1406.1078, 2014.

Francois Chollet.Xception: Deep learning with depthwise separable convolutions., 2016.

Junyoung Chung,aglar Glehre, Kyunghyun Cho, and Yoshua Bengio.Empirical evaluation of gated recurrent neural networks on sequence modeling., abs/1412.3555, 2014.

Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and NoahA. Smith.Recurrent neural network grammars.In, 2016.

Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and YannN. Dauphin.Convolutional sequence to sequence learning., 2017.

Alex Graves.Generating sequences with recurrent neural networks., 2013.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition.In, pages 770778, 2016.

Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, and Jrgen Schmidhuber.Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, 2001.

Sepp Hochreiter and Jrgen Schmidhuber.Long short-term memory., 9(8):17351780, 1997.

Zhongqiang Huang and Mary Harper.Self-traininggrammars with latent annotations across languages.In, pages 832841. ACL, August 2009.

Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu.Exploring the limits of language modeling., 2016.

ukasz Kaiser and Samy Bengio.Can active memory replace attention?In, 2016.

Łukasz Kaiser and Ilya Sutskever.Neurals learn algorithms.In, 2016.

Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron vanden Oord, Alex Graves, and Koray Kavukcuoglu.Neural machine translation in linear time., 2017.

Yoon Kim, Carl Denton, Luong Hoang, and AlexanderM. Rush.Structured attention networks.In, 2017.

Diederik Kingma and Jimmy Ba.Adam: A method for stochastic optimization.In, 2015.

Oleksii Kuchaiev and Boris Ginsburg.Factorization tricks fornetworks., 2017.

Zhouhan Lin, Minwei Feng, Cicero Nogueirados Santos, MoYu, Bing Xiang, Bowen Zhou, and Yoshua Bengio.A structured self-attentive sentence embedding., 2017.

Minh-Thang Luong, QuocV. Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser.Multi-task sequence to sequence learning., 2015.

Minh-Thang Luong, Hieu Pham, and ChristopherD Manning.Effective approaches to attention-based neural machine translation., 2015.

MitchellP Marcus, MaryAnn Marcinkiewicz, and Beatrice Santorini.Building a large annotated corpus of english: The penn treebank., 19(2):313330, 1993.

David McClosky, Eugene Charniak, and Mark Johnson.Effective self-training for parsing.In, pages 152159. ACL, June 2006.

Ankur Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit.A decomposable attention model.In, 2016.

Romain Paulus, Caiming Xiong, and Richard Socher.A deep reinforced model for abstractive summarization., 2017.

Slav Petrov, Leon Barrett, Romain Thibaux, and Dan Klein.Learning accurate, compact, and interpretable tree annotation.In, pages 433440. ACL, July 2006.

Ofir Press and Lior Wolf.Using the output embedding to improve language models., 2016.

Rico Sennrich, Barry Haddow, and Alexandra Birch.Neural machine translation of rare words with subword units., 2015.

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean.Outrageously large neural networks: The sparsely-gated mixture-of-experts layer., 2017.

Nitish Srivastava, GeoffreyE Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.Dropout: a simple way to prevent neural networks from overfitting., 15(1):19291958, 2014.

Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus.End-to-end memory networks.In C.Cortes, N.D. Lawrence, D.D. Lee, M.Sugiyama, and R.Garnett, editors,, pages 24402448. Curran Associates, Inc., 2015.

Ilya Sutskever, Oriol Vinyals, and QuocVV Le.Sequence to sequence learning with neural networks.In, pages 31043112, 2014.

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna.Rethinking the inception architecture for computer vision., abs/1512.00567, 2015.

, Koo, Petrov, Sutskever, and Hinton.Grammar as a foreign language.In, 2015.

Yonghui Wu, Mike Schuster, Zhifeng Chen, QuocV Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, etal.Google’s neural machine translation system: Bridging the gap between human and machine translation., 2016.

Jie Zhou, Ying Cao, Xuguang Wang, Peng Li, and Wei Xu.Deep recurrent models with fast-forward connections for neural machine translation., abs/1606.04199, 2016.

Muhua Zhu, Yue Zhang, Wenliang Chen, Min Zhang, and Jingbo Zhu.Fast and accurate shift-reduce constituent parsing.In, pages 434443. ACL, August 2013.