Skip to content
Commit 23e9dc2e authored by Halil Akin's avatar Halil Akin Committed by Facebook Github Bot
Browse files

Fix another distributed syncing issue

Summary:
This is another failure due to distributed GPU's getting out of sync.
We are running save_and_eval (which has the inter-gpu communication calls) by
looking at number of updates. But number of updates means weight updates. Whenever
there is an issue in the training and weights can't be updated, nodes go
out of sync and nodes start failing. So we should check number of iterations instead.

I am, again, making a small change to save the day, but we should decouple/refactor
save_and_eval logic from the training, to have less headache in future.
Planning, working on that in future. But this should solve some of the
issues for now.

Reviewed By: jhcross

Differential Revision: D10478427

fbshipit-source-id: b9deacfea252b2fb66b81c799fa78e2439fa514c
parent 8441cbf3
Loading
Loading
Loading
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment