Skip to content
This repository was archived by the owner on Jul 7, 2023. It is now read-only.
This repository was archived by the owner on Jul 7, 2023. It is now read-only.

Unable to reproduce WMT En2De results #317

Open
@edunov

Description

@edunov

I tried to reproduce results from the paper on WMT En2De, base model. In my experiments I tried both BPE and word piece model. Here are the steps I made to train models:

# For BPE model I used this setup
PROBLEM=translate_ende_wmt_bpe32k
# For word piece model I used this setup
PROBLEM=translate_ende_wmt32k

MODEL=transformer
HPARAMS=transformer_base

DATA_DIR=$HOME/t2t_data
TMP_DIR=$HOME/tmp/t2t_datagen
TRAIN_DIR=$HOME/t2t_train/$PROBLEM/$MODEL-$HPARAMS

mkdir -p $DATA_DIR $TMP_DIR $TRAIN_DIR

-datagen \
  --data_dir=$DATA_DIR \
  --tmp_dir=$TMP_DIR \
  --problem=$PROBLEM

-trainer \
  --data_dir=$DATA_DIR \
  --problems=$PROBLEM \
  --model=$MODEL \
  --hparams_set=$HPARAMS \
  --output_dir=$TRAIN_DIR`

I trained both models till the trainer finished (~1 day). The last update for BPE model was:

INFO:tensorflow:Validation (step 250000): loss = 1.73099, metrics-translate_ende_wmt_bpe32k/accuracy = 0.633053, metrics-translate_ende_wmt_bpe32k/accuracy_per_sequence = 0.0, metrics-│
translate_ende_wmt_bpe32k/accuracy_top5 = 0.819939, metrics-translate_ende_wmt_bpe32k/approx_bleu_score = 0.306306, metrics-translate_ende_wmt_bpe32k/neg_log_perplexity = -1.98039, met│
rics-translate_ende_wmt_bpe32k/rouge_2_fscore = 0.38708, metrics-translate_ende_wmt_bpe32k/rouge_L_fscore = 0.589309, global_step = 249006

INFO:tensorflow:Saving dict for global step 250000: global_step = 250000, loss = 1.7457, metrics-translate_ende_wmt_bpe32k/accuracy = 0.638563, metrics-translate_ende_wmt_bpe32k/accura│
cy_per_sequence = 0.0, metrics-translate_ende_wmt_bpe32k/accuracy_top5 = 0.823388, metrics-translate_ende_wmt_bpe32k/approx_bleu_score = 0.290224, metrics-translate_ende_wmt_bpe32k/neg│
_log_perplexity = -1.93242, metrics-translate_ende_wmt_bpe32k/rouge_2_fscore = 0.373072, metrics-translate_ende_wmt_bpe32k/rouge_L_fscore = 0.574759


For word piece model the last update was:

INFO:tensorflow:Validation (step 250000): loss = 1.56711, metrics-translate_ende_wmt32k/accuracy = 0.655595, metrics-translate_ende_wmt32k/accuracy_per_sequence = 0.0360065, metrics-tr│
anslate_ende_wmt32k/accuracy_top5 = 0.836071, metrics-translate_ende_wmt32k/approx_bleu_score = 0.358524, metrics-translate_ende_wmt32k/neg_log_perplexity = -1.84754, metrics-translate│
_ende_wmt32k/rouge_2_fscore = 0.440053, metrics-translate_ende_wmt32k/rouge_L_fscore = 0.628949, global_step = 248578

INFO:tensorflow:Saving dict for global step 250000: global_step = 250000, loss = 1.57279, metrics-translate_ende_wmt32k/accuracy = 0.65992, metrics-translate_ende_wmt32k/accuracy_per_s│
equence = 0.00284091, metrics-translate_ende_wmt32k/accuracy_top5 = 0.841923, metrics-translate_ende_wmt32k/approx_bleu_score = 0.368791, metrics-translate_ende_wmt32k/neg_log_perplexi│
ty = -1.80413, metrics-translate_ende_wmt32k/rouge_2_fscore = 0.445689, metrics-translate_ende_wmt32k/rouge_L_fscore = 0.636854

Then I tried to evaluate both models on newstest2013, newstest2014, newstest2015. Here are the commands that I used (I'm mostly following steps from here https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/get_ende_bleu.sh)

For BPE model:

YEAR=2013
#YEAR=2014
#YEAR=2015
BEAM_SIZE=5
ALPHA=0.6
t2t-decoder   --data_dir=$DATA_DIR   \
    --problems=$PROBLEM   --model=$MODEL   \
    --hparams_set=$HPARAMS   --output_dir=$TRAIN_DIR   \
    --decode_beam_size=$BEAM_SIZE   --decode_alpha=$ALPHA   \
    --decode_from_file=/tmp/t2t_datagen/newstest${YEAR}.tok.bpe.32000.en

#Tokenize reference
perl ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l de < /tmp/t2t_datagen/newstest${YEAR}.de > /tmp/t2t_datagen/newstest${YEAR}.de.tok
#Do compound splitting on the reference
perl -ple 's{(\S)-(\S)}{$1 ##AT##-##AT## $2}g' < /tmp/t2t_datagen/newstest${YEAR}.de.tok > /tmp/t2t_datagen/newstest${YEAR}.de.atat

#Remove BPE tokenization
cat /tmp/t2t_datagen/newstest${YEAR}.tok.bpe.32000.en.transformer.transformer_base.beam5.alpha0.6.decodes | sed 's/@@ //g' > /tmp/t2t_datagen/newstest${YEAR}.tok.bpe.32000.en.transformer.transformer_base.beam5.alpha0.6.words
#Do compound splitting on the translation
perl -ple 's{(\S)-(\S)}{$1 ##AT##-##AT## $2}g' < /tmp/t2t_datagen/newstest${YEAR}.tok.bpe.32000.en.transformer.transformer_base.beam5.alpha0.6.words > /tmp/t2t_datagen/newstest${YEAR}.tok.bpe.32000.en.transformer.transformer_base.beam5.alpha0.6.atat
#Score
perl ~/mosesdecoder/scripts/generic/multi-bleu.perl /tmp/t2t_datagen/newstest${YEAR}.de.atat < /tmp/t2t_datagen/newstest${YEAR}.tok.bpe.32000.en.transformer.transformer_base.beam5.alpha0.6.atat

For word piece model:

YEAR=2013
#YEAR=2014
#YEAR=2015 
BEAM_SIZE=5
ALPHA=0.6
t2t-decoder   --data_dir=$DATA_DIR   --problems=$PROBLEM   \
    --model=$MODEL   --hparams_set=$HPARAMS   \
    --output_dir=$TRAIN_DIR   --decode_beam_size=$BEAM_SIZE   \
    --decode_alpha=$ALPHA   --decode_from_file=/tmp/t2t_datagen/newstest${YEAR}.en

#Tokenize the reference
perl ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l de < /tmp/t2t_datagen/newstest${YEAR}.de > /tmp/t2t_datagen/newstest${YEAR}.de.tok
#Do compound splitting on the reference
perl -ple 's{(\S)-(\S)}{$1 ##AT##-##AT## $2}g' < /tmp/t2t_datagen/newstest${YEAR}.de.tok > /tmp/t2t_datagen/newstest${YEAR}.de.atat

#Tokenize the translation
perl ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l de < /tmp/t2t_datagen/newstest${YEAR}.en.transformer.transformer_base.beam5.alpha0.6.decodes > /tmp/t2t_datagen/newstest${YEAR}.en.transformer.transformer_base.beam5.alpha0.6.tok
#Do compount splitting on the translation
perl -ple 's{(\S)-(\S)}{$1 ##AT##-##AT## $2}g' < /tmp/t2t_datagen/newstest${YEAR}.en.transformer.transformer_base.beam5.alpha0.6.tok > /tmp/t2t_datagen/newstest${YEAR}.en.transformer.transformer_base.beam5.alpha0.6.atat
#Score the translation
perl ~/mosesdecoder/scripts/generic/multi-bleu.perl /tmp/t2t_datagen/newstest${YEAR}.de.atat < /tmp/t2t_datagen/newstest${YEAR}.en.transformer.transformer_base.beam5.alpha0.6.atat

Here are the BLEU scores I've got:

  newstest2013 newstest2014 newstest2015
BPE 10.81 11.31 12.75
wordpiece 22.41 22.75 25.46

There is a big mismatch with the results reported in the paper, so there must be something wrong with the way I ran these experiments. Could you please provide me some guidance on how to run this properly to reproduce the results from the paper?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions