Skip to content
Snippets Groups Projects
Commit 6f20b397 authored by Daniel Galvez's avatar Daniel Galvez
Browse files

Replace implementation of atomic addition.

Old version was based on atomicExch(), while this version uses CUDA's
built-in atomicAdd(), added in SM 2.0. When tested in isolation (test
code not provided in this commit), on a K10 (Kepler), the built-in
atomicAdd() is two times faster than the old version of atomic_add()
here, and on a 950M (Maxwell), 3 times faster.

Speed up to forward backward, however, is marginal for an
nnet3-chain-train call on the TEDLIUM version 1 dataset:

Times reported on a K10. Note speedup in BetaDashGeneralFrame(),
which is the only code calling the atomic add function.

New code:

[cudevice profile]
AddRows	0.468516s
AddVecVec	0.553152s
MulRowsVec	0.614542s
CuMatrix::SetZero	0.649105s
CopyRows	0.748831s
TraceMatMat	0.777907s
AddVecToRows	0.780592s
CuMatrix::Resize	0.850884s
AddMat	1.23867s
CuMatrixBase::CopyFromMat(from other CuMatrixBase)	2.04559s
AddDiagMatMat	2.18652s
AddMatVec	3.67839s
AlphaGeneralFrame	6.42574s
BetaDashGeneralFrame	8.69981s
AddMatMat	29.9714s
Total GPU time:	63.8273s (may involve some double-counting)
-----

Old code:

[cudevice profile]
AddRows	0.469031s
AddVecVec	0.553298s
MulRowsVec	0.615624s
CuMatrix::SetZero	0.658105s
CopyRows	0.750856s
AddVecToRows	0.782937s
TraceMatMat	0.786361s
CuMatrix::Resize	0.91639s
AddMat	1.23964s
CuMatrixBase::CopyFromMat(from other CuMatrixBase)	2.05253s
AddDiagMatMat	2.18863s
AddMatVec	3.68707s
AlphaGeneralFrame	6.42885s
BetaDashGeneralFrame	9.03617s
AddMatMat	29.9942s
Total GPU time:	64.3928s (may involve some double-counting)
-----
parent eb49517c
No related branches found
No related tags found
No related merge requests found
......@@ -20,17 +20,9 @@
#include <cfloat>
#include "chain/chain-kernels-ansi.h"
template <typename Real>
__device__ inline void atomic_add(Real* address, Real value) {
Real old = value;
Real ret = atomicExch(address, 0.0f);
Real new_old = ret + old;
while ((old = atomicExch(address, new_old)) != 0.0f) {
new_old = atomicExch(address, 0.0f);
new_old += old;
}
atomicAdd(address, value);
}
template<>
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment