Fix another distributed syncing issue
Summary: This is another failure due to distributed GPU's getting out of sync. We are running save_and_eval (which has the inter-gpu communication calls) by looking at number of updates. But number of updates means weight updates. Whenever there is an issue in the training and weights can't be updated, nodes go out of sync and nodes start failing. So we should check number of iterations instead. I am, again, making a small change to save the day, but we should decouple/refactor save_and_eval logic from the training, to have less headache in future. Planning, working on that in future. But this should solve some of the issues for now. Reviewed By: jhcross Differential Revision: D10478427 fbshipit-source-id: b9deacfea252b2fb66b81c799fa78e2439fa514c
Loading
Please register or sign in to comment