Description
Add Link
Link to the tutorial:
https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html
Describe the bug
The tutorial was markedly changed in June 2023, see commit 6c03bb3 which aimed at fixing the implementation of attention among other things (#2468). In doing so, several other things have been changed:
- adding dataloader which returns a batch of zero-padded sequences to train the network
- the
foward()
function of the Decoder process input one word at the time in parallel for all sentences
in the batch until MAX_LENGTH is reached.
I am not a torch expert but I think that the embedding layers in the encoder and decoder should have been modified to recognize padding (padding_idx=0 is missing). Using zero-padded sequence as input might also have other implications during learning but I am not sure. Can you confirm that the implementation is correct?
As a result of these change, the text does not describe well the code. I think that it would be nice to include a discussion of zero-padding and the implications of using batches on the code in the tutorial. I am also curious if there is really a gain in using a batch since most sentences are short.
Finally, I found a mention in the text about using teacher_forcing_ratio
which is not included in the code. The tutorial or the code need to be adjusted.
If this is useful, I found another implementation of the same tutorial which seems to be a fork from a previous version (it was archived in 2021):
- It does not does not use batches
- It includes
teacher_forcing_ratio
to select the amount of forced teaching - It implements both Luong et al and Bahdanau et al. models of attention
Describe your environment
I appreciate this tutorial as it provides a simple introduction to Seq2Seq models with a small dataset. I am actually trying to port this tutorial in R with torch package.
cc @albanD