Skip to content
Commit a2901f98 authored by Yongqiang Wang's avatar Yongqiang Wang Committed by Facebook Github Bot
Browse files

an option to raise exception if oom happens during fairseq.trainer.train_step (#2)

Summary:
Pull Request resolved: https://github.com/fairinternal/fairspeq/pull/2

Pull Request resolved: https://github.com/pytorch/fairseq/pull/689

We found not raising OOM during trainer.train_step causes various
issue, including NCCL hangs / gloo sync errors because gradient is not synced
properly. Before we found the root cause, let's give users an option to raise
OOMs.

Reviewed By: jmp84

Differential Revision: D15170357

fbshipit-source-id: 3e15e4e111a8380612157955509c39821a216ec4
parent f5fbcaaf
Loading
Loading
Loading
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment