What is “Attention”?| From the Basics NLP Part 2/4

Rajdeep Borgohain
4 min readFeb 24, 2022
Photo by Kelly Sikkema on Unsplash

The problem with the encoder-decoder model arises when we start working with lengthy sentences or long paragraphs. The context-vector fails to capture the meaning of each input word of the sentence and fails to understand the essence of translation from one language to another.

Table of Contents

  1. How humans translate language
  2. Basic Intuition of Attention
  3. Working of Attention
  4. How to compute alpha_ij’s?
  5. Conclusion

1. How humans translate language

The humans read a set of few words present in the sentence, and then they translate them to the other language and keep repeating this process until it gets completed. They are giving attention to a subset of words present in the sentence, then translating them, and then moving their attention to the following subset of words.

2. Basic Intuition of Attention

We use the bi-directional RNN(without the softmax part) in the encoder network and unidirectional RNN in the decoder network, the reason for using bi-directional is because the output will depend on a set of words and the decoder is unidirectional as it will generate one word followed by another word. The blue cells in the encoder network are the forward units, and the green cells are the backward units. The output from the hidden vectors of h¹ — > and h¹< — gets concatenated (h^t= h¹ — > + h¹< — )

3. Working of Attention

Let us consider y1, y2, y3,y4 are the output of the decoder model which is are the words in the Assamese language. Now, the output y2 wouldn’t depend only on the words x1 and x2 but also on the words after x2, which are x3 and x4. To build that understanding, we are using bi-directional RNN, which will help us capture the information from the last word to the first.

Now the context-vector C” is an input to the decoder network, for every output that gets generated, has an input context-vector. Each output from the encoder network will connect to all the context vectors for generating output in the decoder network. For example:

The context-vector C1 is the weighted sum of the outputs that we get from the encoder model. Tx is a hyperparameter, in this example, we consider it as 4, and as a regularization we want the values of alpha_i to be non-negative and the sum of alpha_i must be equal to 1. These constraints will stop the values of aplha_i from reaching -infinity or + infinity.

4. How to compute alpha_ji’s?

The big question here is to design alpha_ji’s so that it satisfies the constraints mentioned above. Now let us define alpha_ji as:

Now, e_ji is a function of S_j-1 and h_i, so the alpha_ji depends on e_ji and the e_ji function depends on the output of the encoder model at i_th time h_i and the decoder input S_j-1. The function is just a simple Neural Network and with the help of backpropagation, we will get the required weights.

5. Conclusion

As the sentence length increases the BLEU score(translation accuracy) for the RNNsearch-50 is consistent when using Tx = 50.

The major drawback is the time complexity which is O(length of the input sentence x length of the output sentence)

--

--