Skip to content
Snippets Groups Projects
  • Daniel Galvez's avatar
    6f20b397
    Replace implementation of atomic addition. · 6f20b397
    Daniel Galvez authored
    Old version was based on atomicExch(), while this version uses CUDA's
    built-in atomicAdd(), added in SM 2.0. When tested in isolation (test
    code not provided in this commit), on a K10 (Kepler), the built-in
    atomicAdd() is two times faster than the old version of atomic_add()
    here, and on a 950M (Maxwell), 3 times faster.
    
    Speed up to forward backward, however, is marginal for an
    nnet3-chain-train call on the TEDLIUM version 1 dataset:
    
    Times reported on a K10. Note speedup in BetaDashGeneralFrame(),
    which is the only code calling the atomic add function.
    
    New code:
    
    [cudevice profile]
    AddRows	0.468516s
    AddVecVec	0.553152s
    MulRowsVec	0.614542s
    CuMatrix::SetZero	0.649105s
    CopyRows	0.748831s
    TraceMatMat	0.777907s
    AddVecToRows	0.780592s
    CuMatrix::Resize	0.850884s
    AddMat	1.23867s
    CuMatrixBase::CopyFromMat(from other CuMatrixBase)	2.04559s
    AddDiagMatMat	2.18652s
    AddMatVec	3.67839s
    AlphaGeneralFrame	6.42574s
    BetaDashGeneralFrame	8.69981s
    AddMatMat	29.9714s
    Total GPU time:	63.8273s (may involve some double-counting)
    -----
    
    Old code:
    
    [cudevice profile]
    AddRows	0.469031s
    AddVecVec	0.553298s
    MulRowsVec	0.615624s
    CuMatrix::SetZero	0.658105s
    CopyRows	0.750856s
    AddVecToRows	0.782937s
    TraceMatMat	0.786361s
    CuMatrix::Resize	0.91639s
    AddMat	1.23964s
    CuMatrixBase::CopyFromMat(from other CuMatrixBase)	2.05253s
    AddDiagMatMat	2.18863s
    AddMatVec	3.68707s
    AlphaGeneralFrame	6.42885s
    BetaDashGeneralFrame	9.03617s
    AddMatMat	29.9942s
    Total GPU time:	64.3928s (may involve some double-counting)
    -----
    6f20b397
    History
    Replace implementation of atomic addition.
    Daniel Galvez authored
    Old version was based on atomicExch(), while this version uses CUDA's
    built-in atomicAdd(), added in SM 2.0. When tested in isolation (test
    code not provided in this commit), on a K10 (Kepler), the built-in
    atomicAdd() is two times faster than the old version of atomic_add()
    here, and on a 950M (Maxwell), 3 times faster.
    
    Speed up to forward backward, however, is marginal for an
    nnet3-chain-train call on the TEDLIUM version 1 dataset:
    
    Times reported on a K10. Note speedup in BetaDashGeneralFrame(),
    which is the only code calling the atomic add function.
    
    New code:
    
    [cudevice profile]
    AddRows	0.468516s
    AddVecVec	0.553152s
    MulRowsVec	0.614542s
    CuMatrix::SetZero	0.649105s
    CopyRows	0.748831s
    TraceMatMat	0.777907s
    AddVecToRows	0.780592s
    CuMatrix::Resize	0.850884s
    AddMat	1.23867s
    CuMatrixBase::CopyFromMat(from other CuMatrixBase)	2.04559s
    AddDiagMatMat	2.18652s
    AddMatVec	3.67839s
    AlphaGeneralFrame	6.42574s
    BetaDashGeneralFrame	8.69981s
    AddMatMat	29.9714s
    Total GPU time:	63.8273s (may involve some double-counting)
    -----
    
    Old code:
    
    [cudevice profile]
    AddRows	0.469031s
    AddVecVec	0.553298s
    MulRowsVec	0.615624s
    CuMatrix::SetZero	0.658105s
    CopyRows	0.750856s
    AddVecToRows	0.782937s
    TraceMatMat	0.786361s
    CuMatrix::Resize	0.91639s
    AddMat	1.23964s
    CuMatrixBase::CopyFromMat(from other CuMatrixBase)	2.05253s
    AddDiagMatMat	2.18863s
    AddMatVec	3.68707s
    AlphaGeneralFrame	6.42885s
    BetaDashGeneralFrame	9.03617s
    AddMatMat	29.9942s
    Total GPU time:	64.3928s (may involve some double-counting)
    -----