Support BPE end of word marker suffix in fairseq noising module
Summary: There are 2 ways to implement BPE: 1. use a continuation marker suffix to indicate that there is at least one more subtoken left in the word 2. use a end of word marker suffix to indicate that there is no more subtokens left in the word This adds some logic to account for either kind of BPE marker suffix. This diff adds a corresponding test. I also refactored the test setup to reduce the number of boolean args when setting up test data. Reviewed By: xianxl Differential Revision: D12919428 fbshipit-source-id: 405e9f346dce6e736c1305288721dfc7b63e872a
Loading
Please register or sign in to comment