Refactor BERTDataset to the more general MaskedLMDataset (92a6c548) · Commits · Simon Will / fairseq

Commit 92a6c548 authored Feb 28, 2019 by Kartikay Khandelwal Committed by Facebook Github Bot Feb 28, 2019

Refactor BERTDataset to the more general MaskedLMDataset

Summary: The current BERTDataset has a lot of components needed for generic MaskedLM training but is too restrictive in terms of the assumptions it makes - two blocks being masked, the special tokens used for the sentence embedding as well as the separator etc. In this diff I refactor this dataset and at the same time add make some of the parameters including the probabilities associated with masking configurable.

Reviewed By: rutyrinott

Differential Revision: D14222467

fbshipit-source-id: e9f78788dfe7f56646ba09c62967c4c0bd30aed8

parent 4d59517f

Hide whitespace changes

Inline Side-by-side

Please register or to comment