an option to raise exception if oom happens during fairseq.trainer.train_step (#2) (a2901f98) · Commits · Simon Will / fairseq

Commit a2901f98 authored May 03, 2019 by Yongqiang Wang Committed by Facebook Github Bot May 03, 2019

an option to raise exception if oom happens during fairseq.trainer.train_step (#2)

Summary:
Pull Request resolved: https://github.com/fairinternal/fairspeq/pull/2

Pull Request resolved: https://github.com/pytorch/fairseq/pull/689

We found not raising OOM during trainer.train_step causes various
issue, including NCCL hangs / gloo sync errors because gradient is not synced
properly. Before we found the root cause, let's give users an option to raise
OOMs.

Reviewed By: jmp84

Differential Revision: D15170357

fbshipit-source-id: 3e15e4e111a8380612157955509c39821a216ec4

parent f5fbcaaf

Hide whitespace changes

Inline Side-by-side

Please register or to comment