Improve memory efficiency of FP16 optimization (#404)
Summary: Previously when training with --fp16, we stored a copy of the model parameters in FP32 for optimization, which consumed a lot of memory. An alternative is to just do the conversions to FP32 on the fly, which allows the caching allocator to reuse/save some memory. This reduces peak memory usage by ~20% with a negligible reduction in training speed (~2% slower) when training a big transformer on 8 GPUs on wmt en-de with --update-freq=16. This does not affect convergence, i.e., models will train exactly as they did before. Pull Request resolved: https://github.com/pytorch/fairseq/pull/404 Differential Revision: D13394376 Pulled By: myleott fbshipit-source-id: 2b9f808548df4782110513c9cfc9f7c6159bcbbf
Loading
Please register or sign in to comment