Take a dummy train step under OOM to keep multiprocessing in sync (6c006a34) · Commits · Simon Will / fairseq

Commit 6c006a34 authored Dec 06, 2018 by Halil Akin Committed by Facebook Github Bot Dec 06, 2018

Take a dummy train step under OOM to keep multiprocessing in sync

Summary: This is not a guaranteed solution (since processes may still get out of sync if OOM happens after an all_gather/all_reduce has been done) - but should still make multiprocessing training more robust in practice since it seems we usually OOM early enough.

Reviewed By: myleott

Differential Revision: D13086018

fbshipit-source-id: feb1b01c2eb8818797cfdabc0faac8056ba1b4ee

parent ccd22212

Hide whitespace changes

Inline Side-by-side

Please register or to comment