Attention

The context vector was a bottleneck for the RNN models. It made it challenging for models to process long sentences. In RNN models a single hidden state was passed between the encoder and the decoder.

In attention models the encoder passes a lot more data to the decoder.Instead of passing the last hidden state of the encoding stage, the encoder passes all the hidden states to the decoder.

The decoder will look at the set of hidden states it received. It will give each hidden state a score. Multiply each hidden state by its softmaxed score. This scoring is done at each time step on the decoder side.

This is how the deocder works:-

  1. The attention decoder RNN takes in the embedding of the token, and an initial decoder hidden state.
  2. The RNN processes its inputs, producing an output and a new hidden state vector (h4). The output is discarded. 3.Attention Step: We use the encoder hidden states and the h4 vector to calculate a context vector (C4) for this time step.
  3. We concatenate h4 and C4 into one vector.
  4. We pass this vector through a feedforward neural network (one trained jointly with the model).
  5. The output of the feedforward neural networks indicates the output word of this time step.
  6. Repeat for the next time steps

Reference

Jay Alammar Blog post