Option to remove EOS at source in backtranslation dataset
Summary: If we want our parallel data to have EOS at the end of source, we keep the EOS at the end of the generated source dialect backtranslation. If we don't want our parallel data to have EOS at the end of source, we **remove** the EOS at the end of the generated source dialect backtranslation. Note: we always want EOS at the end of our target / reference in parallel data so our model can learn to generate a sentence at any arbitrary length. So we make sure that the original target has an EOS before returning a batch of {generated src, original target}. If our original targets in tgt dataset doesn't have an EOS, we append EOS to each tgt sample before collating. We only do this for the purpose of collating a {generated src, original tgt} batch AFTER generating the backtranslations. We don't enforce any EOS before passing tgt to the tgt->src model for generating the backtranslation. The users of this dataset is expected to format tgt dataset examples in the correct format that the tgt->src model expects. Reviewed By: jmp84 Differential Revision: D10157725 fbshipit-source-id: eb6a15f13c651f7c435b8db28103c9a8189845fb
Loading
Please register or sign in to comment