Skip to content
Commit 6c006a34 authored by Halil Akin's avatar Halil Akin Committed by Facebook Github Bot
Browse files

Take a dummy train step under OOM to keep multiprocessing in sync

Summary: This is not a guaranteed solution (since processes may still get out of sync if OOM happens after an all_gather/all_reduce has been done) - but should still make multiprocessing training more robust in practice since it seems we usually OOM early enough.

Reviewed By: myleott

Differential Revision: D13086018

fbshipit-source-id: feb1b01c2eb8818797cfdabc0faac8056ba1b4ee
parent ccd22212
Loading
Loading
Loading
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment