src/chain/chain-kernels.cu · 6f20b39753e858053f73cf724ab09eebb846dae7 · Simon Will / kaldi-commonvoice

8 years ago

Replace implementation of atomic addition. · 6f20b397

Daniel Galvez authored 8 years ago

Old version was based on atomicExch(), while this version uses CUDA's
built-in atomicAdd(), added in SM 2.0. When tested in isolation (test
code not provided in this commit), on a K10 (Kepler), the built-in
atomicAdd() is two times faster than the old version of atomic_add()
here, and on a 950M (Maxwell), 3 times faster.

Speed up to forward backward, however, is marginal for an
nnet3-chain-train call on the TEDLIUM version 1 dataset:

Times reported on a K10. Note speedup in BetaDashGeneralFrame(),
which is the only code calling the atomic add function.

New code:

[cudevice profile]
AddRows	0.468516s
AddVecVec	0.553152s
MulRowsVec	0.614542s
CuMatrix::SetZero	0.649105s
CopyRows	0.748831s
TraceMatMat	0.777907s
AddVecToRows	0.780592s
CuMatrix::Resize	0.850884s
AddMat	1.23867s
CuMatrixBase::CopyFromMat(from other CuMatrixBase)	2.04559s
AddDiagMatMat	2.18652s
AddMatVec	3.67839s
AlphaGeneralFrame	6.42574s
BetaDashGeneralFrame	8.69981s
AddMatMat	29.9714s
Total GPU time:	63.8273s (may involve some double-counting)
-----

Old code:

[cudevice profile]
AddRows	0.469031s
AddVecVec	0.553298s
MulRowsVec	0.615624s
CuMatrix::SetZero	0.658105s
CopyRows	0.750856s
AddVecToRows	0.782937s
TraceMatMat	0.786361s
CuMatrix::Resize	0.91639s
AddMat	1.23964s
CuMatrixBase::CopyFromMat(from other CuMatrixBase)	2.05253s
AddDiagMatMat	2.18863s
AddMatVec	3.68707s
AlphaGeneralFrame	6.42885s
BetaDashGeneralFrame	9.03617s
AddMatMat	29.9942s
Total GPU time:	64.3928s (may involve some double-counting)
-----

6f20b397

History

Replace implementation of atomic addition.

Daniel Galvez authored 8 years ago

Old version was based on atomicExch(), while this version uses CUDA's
built-in atomicAdd(), added in SM 2.0. When tested in isolation (test
code not provided in this commit), on a K10 (Kepler), the built-in
atomicAdd() is two times faster than the old version of atomic_add()
here, and on a 950M (Maxwell), 3 times faster.

Speed up to forward backward, however, is marginal for an
nnet3-chain-train call on the TEDLIUM version 1 dataset:

Times reported on a K10. Note speedup in BetaDashGeneralFrame(),
which is the only code calling the atomic add function.

New code:

[cudevice profile]
AddRows	0.468516s
AddVecVec	0.553152s
MulRowsVec	0.614542s
CuMatrix::SetZero	0.649105s
CopyRows	0.748831s
TraceMatMat	0.777907s
AddVecToRows	0.780592s
CuMatrix::Resize	0.850884s
AddMat	1.23867s
CuMatrixBase::CopyFromMat(from other CuMatrixBase)	2.04559s
AddDiagMatMat	2.18652s
AddMatVec	3.67839s
AlphaGeneralFrame	6.42574s
BetaDashGeneralFrame	8.69981s
AddMatMat	29.9714s
Total GPU time:	63.8273s (may involve some double-counting)
-----

Old code:

[cudevice profile]
AddRows	0.469031s
AddVecVec	0.553298s
MulRowsVec	0.615624s
CuMatrix::SetZero	0.658105s
CopyRows	0.750856s
AddVecToRows	0.782937s
TraceMatMat	0.786361s
CuMatrix::Resize	0.91639s
AddMat	1.23964s
CuMatrixBase::CopyFromMat(from other CuMatrixBase)	2.05253s
AddDiagMatMat	2.18863s
AddMatVec	3.68707s
AlphaGeneralFrame	6.42885s
BetaDashGeneralFrame	9.03617s
AddMatMat	29.9942s
Total GPU time:	64.3928s (may involve some double-counting)
-----