Commit 6f20b397 authored 8 years ago by Daniel Galvez

Replace implementation of atomic addition.

Old version was based on atomicExch(), while this version uses CUDA's
built-in atomicAdd(), added in SM 2.0. When tested in isolation (test
code not provided in this commit), on a K10 (Kepler), the built-in
atomicAdd() is two times faster than the old version of atomic_add()
here, and on a 950M (Maxwell), 3 times faster.

Speed up to forward backward, however, is marginal for an
nnet3-chain-train call on the TEDLIUM version 1 dataset:

Times reported on a K10. Note speedup in BetaDashGeneralFrame(),
which is the only code calling the atomic add function.

New code:

[cudevice profile]
AddRows	0.468516s
AddVecVec	0.553152s
MulRowsVec	0.614542s
CuMatrix::SetZero	0.649105s
CopyRows	0.748831s
TraceMatMat	0.777907s
AddVecToRows	0.780592s
CuMatrix::Resize	0.850884s
AddMat	1.23867s
CuMatrixBase::CopyFromMat(from other CuMatrixBase)	2.04559s
AddDiagMatMat	2.18652s
AddMatVec	3.67839s
AlphaGeneralFrame	6.42574s
BetaDashGeneralFrame	8.69981s
AddMatMat	29.9714s
Total GPU time:	63.8273s (may involve some double-counting)
-----

Old code:

[cudevice profile]
AddRows	0.469031s
AddVecVec	0.553298s
MulRowsVec	0.615624s
CuMatrix::SetZero	0.658105s
CopyRows	0.750856s
AddVecToRows	0.782937s
TraceMatMat	0.786361s
CuMatrix::Resize	0.91639s
AddMat	1.23964s
CuMatrixBase::CopyFromMat(from other CuMatrixBase)	2.05253s
AddDiagMatMat	2.18863s
AddMatVec	3.68707s
AlphaGeneralFrame	6.42885s
BetaDashGeneralFrame	9.03617s
AddMatMat	29.9942s
Total GPU time:	64.3928s (may involve some double-counting)
-----

parent eb49517c

No related branches found

No related tags found

No related merge requests found

Hide whitespace changes

Inline Side-by-side

Showing with 1 addition and 9 deletions

+ 1

− 9

View file @ 6f20b397

@@ -20,17 +20,9 @@
 #include <cfloat>
 #include "chain/chain-kernels-ansi.h"
 template <typename Real>
 __device__ inline void atomic_add(Real* address, Real value) {
   Real old = value;
   Real ret = atomicExch(address, 0.0f);
   Real new_old = ret + old;
   while ((old = atomicExch(address, new_old)) != 0.0f) {
     new_old = atomicExch(address, 0.0f);
     new_old += old;
+  }
   atomicAdd(address, value);
+}
 template<>
-...

Please register or to comment