an option to raise exception if oom happens during fairseq.trainer.train_step (#2)
Summary: Pull Request resolved: https://github.com/fairinternal/fairspeq/pull/2 Pull Request resolved: https://github.com/pytorch/fairseq/pull/689 We found not raising OOM during trainer.train_step causes various issue, including NCCL hangs / gloo sync errors because gradient is not synced properly. Before we found the root cause, let's give users an option to raise OOMs. Reviewed By: jmp84 Differential Revision: D15170357 fbshipit-source-id: 3e15e4e111a8380612157955509c39821a216ec4
Loading
Please register or sign in to comment