Sequence to sequence encoder-decoder attention not looking at encoder hidden states

Hi,

I believe that the [code/notebook example](https://github.com/pytorch/tutorials/blob/master/intermediate_source/seq2seq_translation_tutorial.py) of a sequence to sequence model that uses attention should be changed.

Attention (as described in [Bahdanau 2014](https://arxiv.org/abs/1409.0473)) serves as a mechanism for looking at encoder hidden states (now often referred to as cross-attention when talking about Transformers). Theoretically, its justification is that it allows a variable size representation of information depending on input sequence length (which is variable), so that there is not a representation bottleneck at the output of the final hidden layer for the encoder.

The tutorial/code currently computes attention by using the decoder's hidden state after the first input and the embedding of the decoder's first output. I believe that this is incorrect. 

I think it should instead be using the decoder's prior hidden state and the encoder's hidden states so that the model knows what to focus on from the input sequence for the current timestep of the decoder's output.

I can submit a PR to change this in the tutorial if correct; thanks.

cc @pytorch/team-text-core @Nayef211

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sequence to sequence encoder-decoder attention not looking at encoder hidden states #2121

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Sequence to sequence encoder-decoder attention not looking at encoder hidden states #2121

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions