Skip to content
Snippets Groups Projects
  1. Sep 20, 2016
  2. Sep 14, 2016
    • Shiyin Kang's avatar
      single-kernel impl for diff log softmax · bc79ed49
      Shiyin Kang authored
      bench result:
      New: For CuMatrix::DiffLogSoftmaxPerRow<float>, for dim = 16, speed was 0.0152883 gigaflops.
      Old: For CuMatrix::DiffLogSoftmaxPerRow<float>, for dim = 16, speed was 0.00217375 gigaflops.
      New: For CuMatrix::DiffLogSoftmaxPerRow<float>, for dim = 32, speed was 0.0577221 gigaflops.
      Old: For CuMatrix::DiffLogSoftmaxPerRow<float>, for dim = 32, speed was 0.00867094 gigaflops.
      New: For CuMatrix::DiffLogSoftmaxPerRow<float>, for dim = 64, speed was 0.267811 gigaflops.
      Old: For CuMatrix::DiffLogSoftmaxPerRow<float>, for dim = 64, speed was 0.035306 gigaflops.
      New: For CuMatrix::DiffLogSoftmaxPerRow<float>, for dim = 128, speed was 0.878541 gigaflops.
      Old: For CuMatrix::DiffLogSoftmaxPerRow<float>, for dim = 128, speed was 0.134737 gigaflops.
      New: For CuMatrix::DiffLogSoftmaxPerRow<float>, for dim = 256, speed was 2.8799 gigaflops.
      Old: For CuMatrix::DiffLogSoftmaxPerRow<float>, for dim = 256, speed was 0.491975 gigaflops.
      New: For CuMatrix::DiffLogSoftmaxPerRow<float>, for dim = 512, speed was 6.20522 gigaflops.
      Old: For CuMatrix::DiffLogSoftmaxPerRow<float>, for dim = 512, speed was 1.34159 gigaflops.
      New: For CuMatrix::DiffLogSoftmaxPerRow<float>, for dim = 1024, speed was 10.4197 gigaflops.
      Old: For CuMatrix::DiffLogSoftmaxPerRow<float>, for dim = 1024, speed was 2.4438 gigaflops.
      New: For CuMatrix::DiffLogSoftmaxPerRow<float>, for dim = 2048, speed was 10.5138 gigaflops.
      Old: For CuMatrix::DiffLogSoftmaxPerRow<float>, for dim = 2048, speed was 2.97796 gigaflops.
      New: For CuMatrix::DiffLogSoftmaxPerRow<float>, for dim = 4096, speed was 10.3679 gigaflops.
      Old: For CuMatrix::DiffLogSoftmaxPerRow<float>, for dim = 4096, speed was 3.25972 gigaflops.
      New: For CuMatrix::DiffLogSoftmaxPerRow<double>, for dim = 16, speed was 0.0139596 gigaflops.
      Old: For CuMatrix::DiffLogSoftmaxPerRow<double>, for dim = 16, speed was 0.00193458 gigaflops.
      New: For CuMatrix::DiffLogSoftmaxPerRow<double>, for dim = 32, speed was 0.0573372 gigaflops.
      Old: For CuMatrix::DiffLogSoftmaxPerRow<double>, for dim = 32, speed was 0.0073193 gigaflops.
      New: For CuMatrix::DiffLogSoftmaxPerRow<double>, for dim = 64, speed was 0.197072 gigaflops.
      Old: For CuMatrix::DiffLogSoftmaxPerRow<double>, for dim = 64, speed was 0.0282332 gigaflops.
      New: For CuMatrix::DiffLogSoftmaxPerRow<double>, for dim = 128, speed was 0.751801 gigaflops.
      Old: For CuMatrix::DiffLogSoftmaxPerRow<double>, for dim = 128, speed was 0.111315 gigaflops.
      New: For CuMatrix::DiffLogSoftmaxPerRow<double>, for dim = 256, speed was 2.43203 gigaflops.
      Old: For CuMatrix::DiffLogSoftmaxPerRow<double>, for dim = 256, speed was 0.394491 gigaflops.
      New: For CuMatrix::DiffLogSoftmaxPerRow<double>, for dim = 512, speed was 4.53031 gigaflops.
      Old: For CuMatrix::DiffLogSoftmaxPerRow<double>, for dim = 512, speed was 0.930698 gigaflops.
      New: For CuMatrix::DiffLogSoftmaxPerRow<double>, for dim = 1024, speed was 5.43358 gigaflops.
      Old: For CuMatrix::DiffLogSoftmaxPerRow<double>, for dim = 1024, speed was 1.52317 gigaflops.
      New: For CuMatrix::DiffLogSoftmaxPerRow<double>, for dim = 2048, speed was 5.47013 gigaflops.
      Old: For CuMatrix::DiffLogSoftmaxPerRow<double>, for dim = 2048, speed was 1.84648 gigaflops.
      New: For CuMatrix::DiffLogSoftmaxPerRow<double>, for dim = 4096, speed was 5.23873 gigaflops.
      Old: For CuMatrix::DiffLogSoftmaxPerRow<double>, for dim = 4096, speed was 1.87967 gigaflops.
      
      Conflicts:
      	src/cudamatrix/cu-kernels-ansi.h
      	src/cudamatrix/cu-kernels.h
      
      naming of diff log softmax
      bc79ed49
    • Shiyin Kang's avatar
      b885535e
    • Shiyin Kang's avatar
      mv diff log softmax code to CuMatrix · 7a525668
      Shiyin Kang authored
      7a525668
    • Daniel Galvez's avatar
      Replace implementation of atomic addition. · 6f20b397
      Daniel Galvez authored
      Old version was based on atomicExch(), while this version uses CUDA's
      built-in atomicAdd(), added in SM 2.0. When tested in isolation (test
      code not provided in this commit), on a K10 (Kepler), the built-in
      atomicAdd() is two times faster than the old version of atomic_add()
      here, and on a 950M (Maxwell), 3 times faster.
      
      Speed up to forward backward, however, is marginal for an
      nnet3-chain-train call on the TEDLIUM version 1 dataset:
      
      Times reported on a K10. Note speedup in BetaDashGeneralFrame(),
      which is the only code calling the atomic add function.
      
      New code:
      
      [cudevice profile]
      AddRows	0.468516s
      AddVecVec	0.553152s
      MulRowsVec	0.614542s
      CuMatrix::SetZero	0.649105s
      CopyRows	0.748831s
      TraceMatMat	0.777907s
      AddVecToRows	0.780592s
      CuMatrix::Resize	0.850884s
      AddMat	1.23867s
      CuMatrixBase::CopyFromMat(from other CuMatrixBase)	2.04559s
      AddDiagMatMat	2.18652s
      AddMatVec	3.67839s
      AlphaGeneralFrame	6.42574s
      BetaDashGeneralFrame	8.69981s
      AddMatMat	29.9714s
      Total GPU time:	63.8273s (may involve some double-counting)
      -----
      
      Old code:
      
      [cudevice profile]
      AddRows	0.469031s
      AddVecVec	0.553298s
      MulRowsVec	0.615624s
      CuMatrix::SetZero	0.658105s
      CopyRows	0.750856s
      AddVecToRows	0.782937s
      TraceMatMat	0.786361s
      CuMatrix::Resize	0.91639s
      AddMat	1.23964s
      CuMatrixBase::CopyFromMat(from other CuMatrixBase)	2.05253s
      AddDiagMatMat	2.18863s
      AddMatVec	3.68707s
      AlphaGeneralFrame	6.42885s
      BetaDashGeneralFrame	9.03617s
      AddMatMat	29.9942s
      Total GPU time:	64.3928s (may involve some double-counting)
      -----
      6f20b397
  3. Sep 13, 2016
  4. Sep 08, 2016
  5. Sep 06, 2016
  6. Sep 05, 2016
  7. Sep 01, 2016
  8. Aug 31, 2016
    • Peter Smit's avatar
      Make wav-copy accept both xspecifiers and xfilenames · 278fcbe8
      Peter Smit authored
      In scripts such as perturb-speed and perturb-volume scp lines are
      tranformed into piped command with the appropropriate sox command. The
      case that the scp file has file offsets was not handled. This commit
      both generalizes the wav-copy command to work also on xfilenames and
      fixes the two perturb scripts to use this command in case of file
      offsets.
      278fcbe8
  9. Aug 30, 2016
    • Shiyin Kang's avatar
      comment about aliasing in AddMatMatDivMat. · 81e20c4c
      Shiyin Kang authored
      81e20c4c
    • Shiyin Kang's avatar
      reimpl log softmax · 4c1a86d8
      Shiyin Kang authored
      New: For CuMatrix::LogSoftmax<float>, for dim = 16, speed was 0.0138019 gigaflops.
      Old: For CuMatrix::LogSoftmax<float>, for dim = 16, speed was 0.0133804 gigaflops.
      New: For CuMatrix::LogSoftmax<float>, for dim = 32, speed was 0.056202 gigaflops.
      Old: For CuMatrix::LogSoftmax<float>, for dim = 32, speed was 0.052121 gigaflops.
      New: For CuMatrix::LogSoftmax<float>, for dim = 64, speed was 0.227829 gigaflops.
      Old: For CuMatrix::LogSoftmax<float>, for dim = 64, speed was 0.186255 gigaflops.
      New: For CuMatrix::LogSoftmax<float>, for dim = 128, speed was 0.65638 gigaflops.
      Old: For CuMatrix::LogSoftmax<float>, for dim = 128, speed was 0.65072 gigaflops.
      New: For CuMatrix::LogSoftmax<float>, for dim = 256, speed was 2.15268 gigaflops.
      Old: For CuMatrix::LogSoftmax<float>, for dim = 256, speed was 1.64888 gigaflops.
      New: For CuMatrix::LogSoftmax<float>, for dim = 512, speed was 5.1179 gigaflops.
      Old: For CuMatrix::LogSoftmax<float>, for dim = 512, speed was 3.85136 gigaflops.
      New: For CuMatrix::LogSoftmax<float>, for dim = 1024, speed was 10.8209 gigaflops.
      Old: For CuMatrix::LogSoftmax<float>, for dim = 1024, speed was 6.76963 gigaflops.
      New: For CuMatrix::LogSoftmax<double>, for dim = 16, speed was 0.0133584 gigaflops.
      Old: For CuMatrix::LogSoftmax<double>, for dim = 16, speed was 0.011373 gigaflops.
      New: For CuMatrix::LogSoftmax<double>, for dim = 32, speed was 0.0533796 gigaflops.
      Old: For CuMatrix::LogSoftmax<double>, for dim = 32, speed was 0.0528196 gigaflops.
      New: For CuMatrix::LogSoftmax<double>, for dim = 64, speed was 0.202721 gigaflops.
      Old: For CuMatrix::LogSoftmax<double>, for dim = 64, speed was 0.170107 gigaflops.
      New: For CuMatrix::LogSoftmax<double>, for dim = 128, speed was 0.627234 gigaflops.
      Old: For CuMatrix::LogSoftmax<double>, for dim = 128, speed was 0.722198 gigaflops.
      New: For CuMatrix::LogSoftmax<double>, for dim = 256, speed was 1.89987 gigaflops.
      Old: For CuMatrix::LogSoftmax<double>, for dim = 256, speed was 1.44478 gigaflops.
      New: For CuMatrix::LogSoftmax<double>, for dim = 512, speed was 4.14807 gigaflops.
      Old: For CuMatrix::LogSoftmax<double>, for dim = 512, speed was 3.37973 gigaflops.
      New: For CuMatrix::LogSoftmax<double>, for dim = 1024, speed was 6.70849 gigaflops.
      Old: For CuMatrix::LogSoftmax<double>, for dim = 1024, speed was 4.96657 gigaflops.
      4c1a86d8
  10. Aug 27, 2016
  11. Aug 26, 2016
  12. Aug 24, 2016
  13. Aug 23, 2016
  14. Aug 21, 2016
  15. Aug 17, 2016
  16. Aug 12, 2016
  17. Aug 11, 2016
  18. Aug 10, 2016
  19. Aug 09, 2016
  20. Aug 08, 2016
  21. Aug 07, 2016
  22. Aug 05, 2016
    • Daniel Povey's avatar
    • vesis84's avatar
      nnet1: redesigning LSTM, BLSTM code, · e2247f32
      vesis84 authored
      - introducing interface 'MultistreamComponent',
        - handles stream-lengths and stream-resets,
      - rewritten most of training tools 'nnet-train-lstm-streams',
        'nnet-train-blstm-streams',
      - introducing 'RecurrentComponent' with simple forward recurrency.
      - the LSTM/BLSTM components have clipping presets we recently
        found helpful for BLSTM-CTC system.
      - renaming tools and components (removing 'streams' from names)
      - updating the scripts for generating lstm/blstm prototypes
      - updating 'rm' lstm/blstm examples
      e2247f32
Loading