Attention to Transformers

Transformers are popular nowadays. They are becoming state of the art in all domains since they first debuted in the paper “Attention is All You Need” by Vaswani et al. [1]. Although we just call them “transformer”, it is the attention mechanism as we first saw in the paper “Neural machine translation by jointly learning to align and translate” by Bahdanau et al. [2]. In this article, I’d like to explain what is attention mechanism, how transformers are effective in what they are doing, and the latest advancements in transformers.

Let’s start with the attention mechanism. In the original paper, attention mechanism defined as the probability (alpha_t) of generating expected output (y_t) and deciding next state (s_t) of an embedding (h_t) w.r.t state at previous time step (s_{t-1}) . In other words, out of all current embeddings, how much attention should we pay for a specific one for generating expected output while moving onto the next state.

Although the original attention paper was on the NLP domain, the same method applies for computer vision tasks, too. Indeed, the first paper that is using the same method is Xu et al.’s “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention” [3]. I hear some are saying Mnih et al.’s famous paper “Recurrent Models of Visual Attention” [4] and Ba et al.’s “Multiple Object Recognition with Visual Attention” [5] are also mentioning about attention, but they have a different method, which is now we call “hard-attention” as described in Xu et al.

Behold The Transformer!

So the answer to that question is “the transformer”. Without using an encoder-decoder recurrent network, the transformer architecture uses only the attention mechanism (hence the name of the paper). But why and how is this architecture more efficient than the original attention-based encoder-decoder recurrent network model?

Why It Is Efficient?

  1. Reducing the computational complexity
  2. Maximizing the parallelization of operations
  3. Maximizing the path length between long-range dependencies.

To understand what these three items mean, we should take a look at self-attention.


An example of the self-attention matrix.

Note that, this is a set, and the order of items in a set is not important! So the original paper also brings in another technique to consider the position of a token. This technique is called positional encoding, which we will cover in a bit…

A Big Oh!

Remember Remember The Transformer

How It Works?

The Transformer architecture as shown in the original paper [1]

Positional Embedding

Positional embedding for 100 input tokens w.r.t 512-dimension.

Query, Key, and Value

hardmaru says this is the most important equation in deep learning since 2018.

Multi-head Attention

Multi-head attention parallelizes self-attention computation.

Wrapping Up

Transformers Roll Out!

  • In the NLP domain, GPT-2 [6], BERT [7], and, lately, GPT-3 [8]became state of the art. No need to mention how popular they became.
  • In the computer vision domain, Image Transformer [9] and recently Vision Transformer [10] are worth mentioning.

Note: this is far from a comprehensive list, and I skipped a ton of great papers.



[2] Bahdanau, D., Cho, K. H., & Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. ArXiv E-Prints, arXiv:1409.0473.

[3] Xu, K., Ba, J. L., Kiros, R., Cho, K., Courville, A. C., Salakhutdinov, R., Zemel, R. S., & Bengio, Y. (2015). Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. CoRR, abs/1502.0, 2048–2057.

[4] Mnih, V., Heess, N., Graves, A., & kavukcuoglu, K. (2014). Recurrent models of visual attention. Advances in Neural Information Processing Systems, 3(January), 2204–2212.

[5] Ba, J. L., Mnih, V., & Kavukcuoglu, K. (2015). Multiple object recognition with visual attention. 3rd International Conference on Learning Representations, ICLR 2015 — Conference Track Proceedings, 1–10.

[6] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language Models are Unsupervised Multitask Learners.

[7] Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186.

[8] Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language models are few-shot learners. In arXiv. arXiv.

[9] Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Shazeer, N., Ku, A., & Tran, D. (2018). Image transformer. 35th International Conference on Machine Learning, ICML 2018, 9, 6453–6462.

[10] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.

PhD. AI/ML Researcher.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store