Description
Hi,
I believe that the code/notebook example of a sequence to sequence model that uses attention should be changed.
Attention (as described in Bahdanau 2014) serves as a mechanism for looking at encoder hidden states (now often referred to as cross-attention when talking about Transformers). Theoretically, its justification is that it allows a variable size representation of information depending on input sequence length (which is variable), so that there is not a representation bottleneck at the output of the final hidden layer for the encoder.
The tutorial/code currently computes attention by using the decoder's hidden state after the first input and the embedding of the decoder's first output. I believe that this is incorrect.
I think it should instead be using the decoder's prior hidden state and the encoder's hidden states so that the model knows what to focus on from the input sequence for the current timestep of the decoder's output.
I can submit a PR to change this in the tutorial if correct; thanks.
cc @pytorch/team-text-core @Nayef211