Fix another distributed syncing issue (23e9dc2e) · Commits · Simon Will / fairseq

Commit 23e9dc2e authored Oct 22, 2018 by Halil Akin Committed by Facebook Github Bot Oct 22, 2018

Fix another distributed syncing issue

Summary:
This is another failure due to distributed GPU's getting out of sync.
We are running save_and_eval (which has the inter-gpu communication calls) by
looking at number of updates. But number of updates means weight updates. Whenever
there is an issue in the training and weights can't be updated, nodes go
out of sync and nodes start failing. So we should check number of iterations instead.

I am, again, making a small change to save the day, but we should decouple/refactor
save_and_eval logic from the training, to have less headache in future.
Planning, working on that in future. But this should solve some of the
issues for now.

Reviewed By: jhcross

Differential Revision: D10478427

fbshipit-source-id: b9deacfea252b2fb66b81c799fa78e2439fa514c

parent 8441cbf3

Hide whitespace changes

Inline Side-by-side

Please register or to comment