Fix all-reduce for new versions of PyTorch (d7d82715) · Commits · Simon Will / fairseq

Commit d7d82715 authored Nov 09, 2017 by Myle Ott

Fix all-reduce for new versions of PyTorch

We previously assumed that once a model parameter's gradient buffer was allocated, it stayed fixed during training.
However, this assumption is violated in recent versions of PyTorch (i.e., the gradient buffer may be reallocated during
training), and it's no longer a safe assumption to make.

This is primarily relevant when we do the all-reduce, since we all-reduce a flattened (i.e., contiguous) copy of the
gradients. We can make this more robust by copying the result of the all-reduce back into the model parameter's gradient
buffers after each update. Intra-device copies are cheap, so this doesn't affect performance.

parent 83053f97

Hide whitespace changes

Inline Side-by-side

Please register or to comment