Commit a2901f98 authored 5 years ago by Yongqiang Wang Committed by Facebook Github Bot 5 years ago

an option to raise exception if oom happens during fairseq.trainer.train_step (#2)

Summary:
Pull Request resolved: https://github.com/fairinternal/fairspeq/pull/2

Pull Request resolved: https://github.com/pytorch/fairseq/pull/689

We found not raising OOM during trainer.train_step causes various
issue, including NCCL hangs / gloo sync errors because gradient is not synced
properly. Before we found the root cause, let's give users an option to raise
OOMs.

Reviewed By: jmp84

Differential Revision: D15170357

fbshipit-source-id: 3e15e4e111a8380612157955509c39821a216ec4

parent f5fbcaaf

No related branches found

No related tags found

Hide whitespace changes

Inline Side-by-side

Showing with 14 additions and 2 deletions

Please register or to comment