Skip to content
Snippets Groups Projects
  1. Sep 14, 2016
    • Shiyin Kang's avatar
      b885535e
    • Shiyin Kang's avatar
      mv diff log softmax code to CuMatrix · 7a525668
      Shiyin Kang authored
      7a525668
    • Daniel Povey's avatar
      Merge pull request #1025 from galv/atomic-add · 13e5cc8f
      Daniel Povey authored
      Replace implementation of atomic addition.
      13e5cc8f
    • Daniel Galvez's avatar
      Replace implementation of atomic addition. · 6f20b397
      Daniel Galvez authored
      Old version was based on atomicExch(), while this version uses CUDA's
      built-in atomicAdd(), added in SM 2.0. When tested in isolation (test
      code not provided in this commit), on a K10 (Kepler), the built-in
      atomicAdd() is two times faster than the old version of atomic_add()
      here, and on a 950M (Maxwell), 3 times faster.
      
      Speed up to forward backward, however, is marginal for an
      nnet3-chain-train call on the TEDLIUM version 1 dataset:
      
      Times reported on a K10. Note speedup in BetaDashGeneralFrame(),
      which is the only code calling the atomic add function.
      
      New code:
      
      [cudevice profile]
      AddRows	0.468516s
      AddVecVec	0.553152s
      MulRowsVec	0.614542s
      CuMatrix::SetZero	0.649105s
      CopyRows	0.748831s
      TraceMatMat	0.777907s
      AddVecToRows	0.780592s
      CuMatrix::Resize	0.850884s
      AddMat	1.23867s
      CuMatrixBase::CopyFromMat(from other CuMatrixBase)	2.04559s
      AddDiagMatMat	2.18652s
      AddMatVec	3.67839s
      AlphaGeneralFrame	6.42574s
      BetaDashGeneralFrame	8.69981s
      AddMatMat	29.9714s
      Total GPU time:	63.8273s (may involve some double-counting)
      -----
      
      Old code:
      
      [cudevice profile]
      AddRows	0.469031s
      AddVecVec	0.553298s
      MulRowsVec	0.615624s
      CuMatrix::SetZero	0.658105s
      CopyRows	0.750856s
      AddVecToRows	0.782937s
      TraceMatMat	0.786361s
      CuMatrix::Resize	0.91639s
      AddMat	1.23964s
      CuMatrixBase::CopyFromMat(from other CuMatrixBase)	2.05253s
      AddDiagMatMat	2.18863s
      AddMatVec	3.68707s
      AlphaGeneralFrame	6.42885s
      BetaDashGeneralFrame	9.03617s
      AddMatMat	29.9942s
      Total GPU time:	64.3928s (may involve some double-counting)
      -----
      6f20b397
  2. Sep 13, 2016
  3. Sep 11, 2016
  4. Sep 08, 2016
  5. Sep 07, 2016
  6. Sep 06, 2016
  7. Sep 05, 2016
  8. Sep 02, 2016
  9. Sep 01, 2016
  10. Aug 31, 2016
  11. Aug 30, 2016
    • Daniel Povey's avatar
      Merge pull request #1013 from kangshiyin/log-softmax · b2c8497b
      Daniel Povey authored
      Speed up log softmax
      b2c8497b
    • Shiyin Kang's avatar
      comment about aliasing in AddMatMatDivMat. · 81e20c4c
      Shiyin Kang authored
      81e20c4c
    • Shiyin Kang's avatar
      reimpl log softmax · 4c1a86d8
      Shiyin Kang authored
      New: For CuMatrix::LogSoftmax<float>, for dim = 16, speed was 0.0138019 gigaflops.
      Old: For CuMatrix::LogSoftmax<float>, for dim = 16, speed was 0.0133804 gigaflops.
      New: For CuMatrix::LogSoftmax<float>, for dim = 32, speed was 0.056202 gigaflops.
      Old: For CuMatrix::LogSoftmax<float>, for dim = 32, speed was 0.052121 gigaflops.
      New: For CuMatrix::LogSoftmax<float>, for dim = 64, speed was 0.227829 gigaflops.
      Old: For CuMatrix::LogSoftmax<float>, for dim = 64, speed was 0.186255 gigaflops.
      New: For CuMatrix::LogSoftmax<float>, for dim = 128, speed was 0.65638 gigaflops.
      Old: For CuMatrix::LogSoftmax<float>, for dim = 128, speed was 0.65072 gigaflops.
      New: For CuMatrix::LogSoftmax<float>, for dim = 256, speed was 2.15268 gigaflops.
      Old: For CuMatrix::LogSoftmax<float>, for dim = 256, speed was 1.64888 gigaflops.
      New: For CuMatrix::LogSoftmax<float>, for dim = 512, speed was 5.1179 gigaflops.
      Old: For CuMatrix::LogSoftmax<float>, for dim = 512, speed was 3.85136 gigaflops.
      New: For CuMatrix::LogSoftmax<float>, for dim = 1024, speed was 10.8209 gigaflops.
      Old: For CuMatrix::LogSoftmax<float>, for dim = 1024, speed was 6.76963 gigaflops.
      New: For CuMatrix::LogSoftmax<double>, for dim = 16, speed was 0.0133584 gigaflops.
      Old: For CuMatrix::LogSoftmax<double>, for dim = 16, speed was 0.011373 gigaflops.
      New: For CuMatrix::LogSoftmax<double>, for dim = 32, speed was 0.0533796 gigaflops.
      Old: For CuMatrix::LogSoftmax<double>, for dim = 32, speed was 0.0528196 gigaflops.
      New: For CuMatrix::LogSoftmax<double>, for dim = 64, speed was 0.202721 gigaflops.
      Old: For CuMatrix::LogSoftmax<double>, for dim = 64, speed was 0.170107 gigaflops.
      New: For CuMatrix::LogSoftmax<double>, for dim = 128, speed was 0.627234 gigaflops.
      Old: For CuMatrix::LogSoftmax<double>, for dim = 128, speed was 0.722198 gigaflops.
      New: For CuMatrix::LogSoftmax<double>, for dim = 256, speed was 1.89987 gigaflops.
      Old: For CuMatrix::LogSoftmax<double>, for dim = 256, speed was 1.44478 gigaflops.
      New: For CuMatrix::LogSoftmax<double>, for dim = 512, speed was 4.14807 gigaflops.
      Old: For CuMatrix::LogSoftmax<double>, for dim = 512, speed was 3.37973 gigaflops.
      New: For CuMatrix::LogSoftmax<double>, for dim = 1024, speed was 6.70849 gigaflops.
      Old: For CuMatrix::LogSoftmax<double>, for dim = 1024, speed was 4.96657 gigaflops.
      4c1a86d8
    • Daniel Povey's avatar
      Merge pull request #1012 from vijayaditya/lstm_config_bugfix · 38d4c2af
      Daniel Povey authored
      nnet3: lstm/make_configs.py : Removed a bug where label_delay was not…
      38d4c2af
    • Vijayaditya Peddinti's avatar
      nnet3: lstm/make_configs.py : Removed a bug where label_delay was not being... · f4b4e250
      Vijayaditya Peddinti authored
      nnet3: lstm/make_configs.py : Removed a bug where label_delay was not being added to the xentropy branch in chain models.
      f4b4e250
  12. Aug 29, 2016
  13. Aug 27, 2016
  14. Aug 26, 2016
Loading