Replace implementation of atomic addition.
Old version was based on atomicExch(), while this version uses CUDA's built-in atomicAdd(), added in SM 2.0. When tested in isolation (test code not provided in this commit), on a K10 (Kepler), the built-in atomicAdd() is two times faster than the old version of atomic_add() here, and on a 950M (Maxwell), 3 times faster. Speed up to forward backward, however, is marginal for an nnet3-chain-train call on the TEDLIUM version 1 dataset: Times reported on a K10. Note speedup in BetaDashGeneralFrame(), which is the only code calling the atomic add function. New code: [cudevice profile] AddRows 0.468516s AddVecVec 0.553152s MulRowsVec 0.614542s CuMatrix::SetZero 0.649105s CopyRows 0.748831s TraceMatMat 0.777907s AddVecToRows 0.780592s CuMatrix::Resize 0.850884s AddMat 1.23867s CuMatrixBase::CopyFromMat(from other CuMatrixBase) 2.04559s AddDiagMatMat 2.18652s AddMatVec 3.67839s AlphaGeneralFrame 6.42574s BetaDashGeneralFrame 8.69981s AddMatMat 29.9714s Total GPU time: 63.8273s (may involve some double-counting) ----- Old code: [cudevice profile] AddRows 0.469031s AddVecVec 0.553298s MulRowsVec 0.615624s CuMatrix::SetZero 0.658105s CopyRows 0.750856s AddVecToRows 0.782937s TraceMatMat 0.786361s CuMatrix::Resize 0.91639s AddMat 1.23964s CuMatrixBase::CopyFromMat(from other CuMatrixBase) 2.05253s AddDiagMatMat 2.18863s AddMatVec 3.68707s AlphaGeneralFrame 6.42885s BetaDashGeneralFrame 9.03617s AddMatMat 29.9942s Total GPU time: 64.3928s (may involve some double-counting) -----
Please register or sign in to comment