Skip to content
Snippets Groups Projects
  1. Aug 30, 2016
    • Shiyin Kang's avatar
      comment about aliasing in AddMatMatDivMat. · 81e20c4c
      Shiyin Kang authored
      81e20c4c
    • Shiyin Kang's avatar
      reimpl log softmax · 4c1a86d8
      Shiyin Kang authored
      New: For CuMatrix::LogSoftmax<float>, for dim = 16, speed was 0.0138019 gigaflops.
      Old: For CuMatrix::LogSoftmax<float>, for dim = 16, speed was 0.0133804 gigaflops.
      New: For CuMatrix::LogSoftmax<float>, for dim = 32, speed was 0.056202 gigaflops.
      Old: For CuMatrix::LogSoftmax<float>, for dim = 32, speed was 0.052121 gigaflops.
      New: For CuMatrix::LogSoftmax<float>, for dim = 64, speed was 0.227829 gigaflops.
      Old: For CuMatrix::LogSoftmax<float>, for dim = 64, speed was 0.186255 gigaflops.
      New: For CuMatrix::LogSoftmax<float>, for dim = 128, speed was 0.65638 gigaflops.
      Old: For CuMatrix::LogSoftmax<float>, for dim = 128, speed was 0.65072 gigaflops.
      New: For CuMatrix::LogSoftmax<float>, for dim = 256, speed was 2.15268 gigaflops.
      Old: For CuMatrix::LogSoftmax<float>, for dim = 256, speed was 1.64888 gigaflops.
      New: For CuMatrix::LogSoftmax<float>, for dim = 512, speed was 5.1179 gigaflops.
      Old: For CuMatrix::LogSoftmax<float>, for dim = 512, speed was 3.85136 gigaflops.
      New: For CuMatrix::LogSoftmax<float>, for dim = 1024, speed was 10.8209 gigaflops.
      Old: For CuMatrix::LogSoftmax<float>, for dim = 1024, speed was 6.76963 gigaflops.
      New: For CuMatrix::LogSoftmax<double>, for dim = 16, speed was 0.0133584 gigaflops.
      Old: For CuMatrix::LogSoftmax<double>, for dim = 16, speed was 0.011373 gigaflops.
      New: For CuMatrix::LogSoftmax<double>, for dim = 32, speed was 0.0533796 gigaflops.
      Old: For CuMatrix::LogSoftmax<double>, for dim = 32, speed was 0.0528196 gigaflops.
      New: For CuMatrix::LogSoftmax<double>, for dim = 64, speed was 0.202721 gigaflops.
      Old: For CuMatrix::LogSoftmax<double>, for dim = 64, speed was 0.170107 gigaflops.
      New: For CuMatrix::LogSoftmax<double>, for dim = 128, speed was 0.627234 gigaflops.
      Old: For CuMatrix::LogSoftmax<double>, for dim = 128, speed was 0.722198 gigaflops.
      New: For CuMatrix::LogSoftmax<double>, for dim = 256, speed was 1.89987 gigaflops.
      Old: For CuMatrix::LogSoftmax<double>, for dim = 256, speed was 1.44478 gigaflops.
      New: For CuMatrix::LogSoftmax<double>, for dim = 512, speed was 4.14807 gigaflops.
      Old: For CuMatrix::LogSoftmax<double>, for dim = 512, speed was 3.37973 gigaflops.
      New: For CuMatrix::LogSoftmax<double>, for dim = 1024, speed was 6.70849 gigaflops.
      Old: For CuMatrix::LogSoftmax<double>, for dim = 1024, speed was 4.96657 gigaflops.
      4c1a86d8
  2. Aug 27, 2016
  3. Aug 26, 2016
  4. Aug 24, 2016
  5. Aug 23, 2016
  6. Aug 21, 2016
  7. Aug 17, 2016
  8. Aug 04, 2016
  9. Jul 30, 2016
    • Shiyin Kang's avatar
      Kernel for AddDiagMatMat M*N and M^T*N^T. Both need matrix transpose. · 833db2f1
      Shiyin Kang authored
      CuVector::AddDiagMatMat<double>[no-trans],[no-trans],   16  0.0138  0.0172 1.24x
      CuVector::AddDiagMatMat<double>[no-trans],[no-trans],   32  0.0581  0.0646 1.11x
      CuVector::AddDiagMatMat<double>[no-trans],[no-trans],   64  0.2201  0.2271 1.03x
      CuVector::AddDiagMatMat<double>[no-trans],[no-trans],  128  0.7907  0.7302 0.92x
      CuVector::AddDiagMatMat<double>[no-trans],[no-trans],  256  1.9197  2.0379 1.06x
      CuVector::AddDiagMatMat<double>[no-trans],[no-trans],  512  3.8760  3.9739 1.03x
      CuVector::AddDiagMatMat<double>[no-trans],[no-trans], 1024  5.3297  7.2730 1.36x
      CuVector::AddDiagMatMat<double>[no-trans],[no-trans], 2048  4.7379  7.2775 1.54x
      CuVector::AddDiagMatMat<double>[no-trans],[no-trans], 4096  4.1652  8.7746 2.11x
      CuVector::AddDiagMatMat<double>[no-trans],[no-trans], 8192  2.7393  9.6129 3.51x
      CuVector::AddDiagMatMat<double>[trans],[trans],   16  0.0137  0.0175 1.28x
      CuVector::AddDiagMatMat<double>[trans],[trans],   32  0.0576  0.0639 1.11x
      CuVector::AddDiagMatMat<double>[trans],[trans],   64  0.2209  0.2254 1.02x
      CuVector::AddDiagMatMat<double>[trans],[trans],  128  0.8055  0.7418 0.92x
      CuVector::AddDiagMatMat<double>[trans],[trans],  256  1.9017  2.0358 1.07x
      CuVector::AddDiagMatMat<double>[trans],[trans],  512  3.8703  3.9644 1.02x
      CuVector::AddDiagMatMat<double>[trans],[trans], 1024  5.2985  7.3149 1.38x
      CuVector::AddDiagMatMat<double>[trans],[trans], 2048  4.9325  7.2759 1.48x
      CuVector::AddDiagMatMat<double>[trans],[trans], 4096  4.1638  8.7515 2.10x
      CuVector::AddDiagMatMat<double>[trans],[trans], 8192  2.6703  9.6149 3.60x
      CuVector::AddDiagMatMat<float>[no-trans],[no-trans],   16  0.0137  0.0174 1.28x
      CuVector::AddDiagMatMat<float>[no-trans],[no-trans],   32  0.0576  0.0614 1.07x
      CuVector::AddDiagMatMat<float>[no-trans],[no-trans],   64  0.2150  0.2367 1.10x
      CuVector::AddDiagMatMat<float>[no-trans],[no-trans],  128  0.8098  0.7457 0.92x
      CuVector::AddDiagMatMat<float>[no-trans],[no-trans],  256  1.9851  2.1878 1.10x
      CuVector::AddDiagMatMat<float>[no-trans],[no-trans],  512  4.1400  4.3129 1.04x
      CuVector::AddDiagMatMat<float>[no-trans],[no-trans], 1024  6.2485  8.0504 1.29x
      CuVector::AddDiagMatMat<float>[no-trans],[no-trans], 2048  6.7869 12.2660 1.81x
      CuVector::AddDiagMatMat<float>[no-trans],[no-trans], 4096  5.8144 12.1037 2.08x
      CuVector::AddDiagMatMat<float>[no-trans],[no-trans], 8192  3.2519 15.0645 4.63x
      CuVector::AddDiagMatMat<float>[trans],[trans],   16  0.0137  0.0180 1.31x
      CuVector::AddDiagMatMat<float>[trans],[trans],   32  0.0568  0.0672 1.18x
      CuVector::AddDiagMatMat<float>[trans],[trans],   64  0.2193  0.2263 1.03x
      CuVector::AddDiagMatMat<float>[trans],[trans],  128  0.8132  0.7751 0.95x
      CuVector::AddDiagMatMat<float>[trans],[trans],  256  1.9621  2.1918 1.12x
      CuVector::AddDiagMatMat<float>[trans],[trans],  512  4.2527  4.3181 1.02x
      CuVector::AddDiagMatMat<float>[trans],[trans], 1024  6.3149  8.0543 1.28x
      CuVector::AddDiagMatMat<float>[trans],[trans], 2048  6.7934 12.3520 1.82x
      CuVector::AddDiagMatMat<float>[trans],[trans], 4096  5.8246 12.0940 2.08x
      CuVector::AddDiagMatMat<float>[trans],[trans], 8192  3.2314 15.0555 4.66x
      
      reformat code
      833db2f1
    • Shiyin Kang's avatar
      Kernel for AddDiagMatMat M^T * N · c458314e
      Shiyin Kang authored
      CuVector::AddDiagMatMat<float>[trans],[no-trans],   16  0.0150  0.0172 1.15x
      CuVector::AddDiagMatMat<float>[trans],[no-trans],   32  0.0593  0.0666 1.12x
      CuVector::AddDiagMatMat<float>[trans],[no-trans],   64  0.2161  0.2533 1.17x
      CuVector::AddDiagMatMat<float>[trans],[no-trans],  128  0.6925  0.9069 1.31x
      CuVector::AddDiagMatMat<float>[trans],[no-trans],  256  1.7409  2.9110 1.67x
      CuVector::AddDiagMatMat<float>[trans],[no-trans],  512  3.5518  6.7235 1.89x
      CuVector::AddDiagMatMat<float>[trans],[no-trans], 1024  5.5328 13.3136 2.41x
      CuVector::AddDiagMatMat<double>[trans],[no-trans],   16  0.0157  0.0179 1.14x
      CuVector::AddDiagMatMat<double>[trans],[no-trans],   32  0.0578  0.0693 1.20x
      CuVector::AddDiagMatMat<double>[trans],[no-trans],   64  0.2088  0.2620 1.25x
      CuVector::AddDiagMatMat<double>[trans],[no-trans],  128  0.7430  0.9503 1.28x
      CuVector::AddDiagMatMat<double>[trans],[no-trans],  256  1.7494  3.0979 1.77x
      CuVector::AddDiagMatMat<double>[trans],[no-trans],  512  3.0646  6.1060 1.99x
      CuVector::AddDiagMatMat<double>[trans],[no-trans], 1024  5.0206  9.4023 1.87x
      c458314e
    • Shiyin Kang's avatar
      Kernel for AddDiagMatMat M * N^T · 71fe8889
      Shiyin Kang authored
      CuVector::AddDiagMatMat<float>[no-trans],[trans],   16  0.0132  0.0188 1.42x
      CuVector::AddDiagMatMat<float>[no-trans],[trans],   32  0.0563  0.0738 1.31x
      CuVector::AddDiagMatMat<float>[no-trans],[trans],   64  0.2220  0.2846 1.28x
      CuVector::AddDiagMatMat<float>[no-trans],[trans],  128  0.8277  0.9890 1.19x
      CuVector::AddDiagMatMat<float>[no-trans],[trans],  256  3.4564  3.3012 0.96x
      CuVector::AddDiagMatMat<float>[no-trans],[trans],  512  7.8546  8.6339 1.10x
      CuVector::AddDiagMatMat<float>[no-trans],[trans], 1024 14.4238 16.4371 1.14x
      CuVector::AddDiagMatMat<double>[no-trans],[trans],   16  0.0138  0.0175 1.27x
      CuVector::AddDiagMatMat<double>[no-trans],[trans],   32  0.0561  0.0715 1.27x
      CuVector::AddDiagMatMat<double>[no-trans],[trans],   64  0.2280  0.2765 1.21x
      CuVector::AddDiagMatMat<double>[no-trans],[trans],  128  0.9059  0.9130 1.01x
      CuVector::AddDiagMatMat<double>[no-trans],[trans],  256  3.2346  2.9633 0.92x
      CuVector::AddDiagMatMat<double>[no-trans],[trans],  512  5.7313  6.6734 1.16x
      CuVector::AddDiagMatMat<double>[no-trans],[trans], 1024  9.2105 10.1042 1.10x
      71fe8889
  10. Jul 24, 2016
  11. Jul 16, 2016
    • Shiyin Kang's avatar
      speed test and unit test for diff group pnorm · 1fc49b12
      Shiyin Kang authored
      bench result:
      CuMatrix::DiffGroupPnorm<float>,    16   0.019   0.009  2.11x
      CuMatrix::DiffGroupPnorm<float>,    32   0.074   0.036  2.06x
      CuMatrix::DiffGroupPnorm<float>,    64   0.297   0.142  2.10x
      CuMatrix::DiffGroupPnorm<float>,   128   1.142   0.520  2.20x
      CuMatrix::DiffGroupPnorm<float>,   256   3.442   1.553  2.22x
      CuMatrix::DiffGroupPnorm<float>,   512   6.856   2.943  2.33x
      CuMatrix::DiffGroupPnorm<float>,  1024  11.653   3.915  2.98x
      CuMatrix::DiffGroupPnorm<float>,  2048  13.812   4.263  3.24x
      CuMatrix::DiffGroupPnorm<float>,  4096  14.431   4.381  3.29x
      CuMatrix::DiffGroupPnorm<double>,    16   0.019   0.009  2.17x
      CuMatrix::DiffGroupPnorm<double>,    32   0.073   0.033  2.20x
      CuMatrix::DiffGroupPnorm<double>,    64   0.296   0.133  2.22x
      CuMatrix::DiffGroupPnorm<double>,   128   1.068   0.457  2.34x
      CuMatrix::DiffGroupPnorm<double>,   256   2.999   1.159  2.59x
      CuMatrix::DiffGroupPnorm<double>,   512   4.921   1.705  2.89x
      CuMatrix::DiffGroupPnorm<double>,  1024   6.932   1.993  3.48x
      CuMatrix::DiffGroupPnorm<double>,  2048   7.499   2.087  3.59x
      CuMatrix::DiffGroupPnorm<double>,  4096   7.684   2.104  3.65x
      
      fix bug
      
      unit test for diff group pnorm
      
      easy test for now
      
      back to full test
      
      fix p=inf for MatrixBase::GrouPnormDeriv
      1fc49b12
    • Shiyin Kang's avatar
      new kernel: _diff_group_pnorm · 58c8f0f4
      Shiyin Kang authored
      standard inf
      
      del TODO
      58c8f0f4
    • Shiyin Kang's avatar
      move pnorm back prop to cumatrix · a338e533
      Shiyin Kang authored
      fix bug
      a338e533
    • Daniel Povey's avatar
      Add script to automatically put the Kaldi libraries we link with in the right... · dbae7fa1
      Daniel Povey authored
      Add script to automatically put the Kaldi libraries we link with in the right order; use it to modify the Makefiles.  Minor top-level Makefile fix.
      dbae7fa1
  12. Jul 15, 2016
  13. Jul 08, 2016
    • Shiyin Kang's avatar
      re-impl softmax: less __syncthreads() / arithmetic op / global mem access · 42352b63
      Shiyin Kang authored
      New: For CuMatrix::Softmax<float>, for dim = 16, speed was 0.0153621 gigaflops.
      Old: For CuMatrix::Softmax<float>, for dim = 16, speed was 0.0138999 gigaflops.
      New: For CuMatrix::Softmax<float>, for dim = 32, speed was 0.0614275 gigaflops.
      Old: For CuMatrix::Softmax<float>, for dim = 32, speed was 0.0507328 gigaflops.
      New: For CuMatrix::Softmax<float>, for dim = 64, speed was 0.235765 gigaflops.
      Old: For CuMatrix::Softmax<float>, for dim = 64, speed was 0.203548 gigaflops.
      New: For CuMatrix::Softmax<float>, for dim = 128, speed was 0.729239 gigaflops.
      Old: For CuMatrix::Softmax<float>, for dim = 128, speed was 0.725481 gigaflops.
      New: For CuMatrix::Softmax<float>, for dim = 256, speed was 2.30126 gigaflops.
      Old: For CuMatrix::Softmax<float>, for dim = 256, speed was 1.71863 gigaflops.
      New: For CuMatrix::Softmax<float>, for dim = 512, speed was 5.0565 gigaflops.
      Old: For CuMatrix::Softmax<float>, for dim = 512, speed was 3.69659 gigaflops.
      New: For CuMatrix::Softmax<float>, for dim = 1024, speed was 10.2482 gigaflops.
      Old: For CuMatrix::Softmax<float>, for dim = 1024, speed was 6.38335 gigaflops.
      New: For CuMatrix::Softmax<double>, for dim = 16, speed was 0.0143354 gigaflops.
      Old: For CuMatrix::Softmax<double>, for dim = 16, speed was 0.013143 gigaflops.
      New: For CuMatrix::Softmax<double>, for dim = 32, speed was 0.0590478 gigaflops.
      Old: For CuMatrix::Softmax<double>, for dim = 32, speed was 0.0495458 gigaflops.
      New: For CuMatrix::Softmax<double>, for dim = 64, speed was 0.228611 gigaflops.
      Old: For CuMatrix::Softmax<double>, for dim = 64, speed was 0.193465 gigaflops.
      New: For CuMatrix::Softmax<double>, for dim = 128, speed was 0.668961 gigaflops.
      Old: For CuMatrix::Softmax<double>, for dim = 128, speed was 0.676449 gigaflops.
      New: For CuMatrix::Softmax<double>, for dim = 256, speed was 2.1013 gigaflops.
      Old: For CuMatrix::Softmax<double>, for dim = 256, speed was 1.51862 gigaflops.
      New: For CuMatrix::Softmax<double>, for dim = 512, speed was 4.13055 gigaflops.
      Old: For CuMatrix::Softmax<double>, for dim = 512, speed was 3.1547 gigaflops.
      New: For CuMatrix::Softmax<double>, for dim = 1024, speed was 6.43429 gigaflops.
      Old: For CuMatrix::Softmax<double>, for dim = 1024, speed was 5.02974 gigaflops.
      
      minor changes
      42352b63
  14. Jun 26, 2016
  15. Jun 25, 2016
    • Shiyin Kang's avatar
      _diff_softmax kernel: 4 reads and 1 write. · 6b8eefbb
      Shiyin Kang authored
      New: For CuMatrix::DiffSoftmaxPerRow<float>, for dim = 16, speed was 0.0165568 gigaflops.
      Old: For CuMatrix::DiffSoftmaxPerRow<float>, for dim = 16, speed was 0.00355242 gigaflops.
      New: For CuMatrix::DiffSoftmaxPerRow<float>, for dim = 32, speed was 0.0678791 gigaflops.
      Old: For CuMatrix::DiffSoftmaxPerRow<float>, for dim = 32, speed was 0.0145515 gigaflops.
      New: For CuMatrix::DiffSoftmaxPerRow<float>, for dim = 64, speed was 0.24739 gigaflops.
      Old: For CuMatrix::DiffSoftmaxPerRow<float>, for dim = 64, speed was 0.0583246 gigaflops.
      New: For CuMatrix::DiffSoftmaxPerRow<float>, for dim = 128, speed was 0.898427 gigaflops.
      Old: For CuMatrix::DiffSoftmaxPerRow<float>, for dim = 128, speed was 0.225076 gigaflops.
      New: For CuMatrix::DiffSoftmaxPerRow<float>, for dim = 256, speed was 2.89009 gigaflops.
      Old: For CuMatrix::DiffSoftmaxPerRow<float>, for dim = 256, speed was 0.834096 gigaflops.
      New: For CuMatrix::DiffSoftmaxPerRow<float>, for dim = 512, speed was 6.72164 gigaflops.
      Old: For CuMatrix::DiffSoftmaxPerRow<float>, for dim = 512, speed was 1.92722 gigaflops.
      New: For CuMatrix::DiffSoftmaxPerRow<float>, for dim = 1024, speed was 10.4916 gigaflops.
      Old: For CuMatrix::DiffSoftmaxPerRow<float>, for dim = 1024, speed was 2.78281 gigaflops.
      New: For CuMatrix::DiffSoftmaxPerRow<double>, for dim = 16, speed was 0.0148584 gigaflops.
      Old: For CuMatrix::DiffSoftmaxPerRow<double>, for dim = 16, speed was 0.00260567 gigaflops.
      New: For CuMatrix::DiffSoftmaxPerRow<double>, for dim = 32, speed was 0.0586865 gigaflops.
      Old: For CuMatrix::DiffSoftmaxPerRow<double>, for dim = 32, speed was 0.0121077 gigaflops.
      New: For CuMatrix::DiffSoftmaxPerRow<double>, for dim = 64, speed was 0.22893 gigaflops.
      Old: For CuMatrix::DiffSoftmaxPerRow<double>, for dim = 64, speed was 0.0527767 gigaflops.
      New: For CuMatrix::DiffSoftmaxPerRow<double>, for dim = 128, speed was 0.763462 gigaflops.
      Old: For CuMatrix::DiffSoftmaxPerRow<double>, for dim = 128, speed was 0.175736 gigaflops.
      New: For CuMatrix::DiffSoftmaxPerRow<double>, for dim = 256, speed was 2.40457 gigaflops.
      Old: For CuMatrix::DiffSoftmaxPerRow<double>, for dim = 256, speed was 0.58351 gigaflops.
      New: For CuMatrix::DiffSoftmaxPerRow<double>, for dim = 512, speed was 4.55165 gigaflops.
      Old: For CuMatrix::DiffSoftmaxPerRow<double>, for dim = 512, speed was 1.42464 gigaflops.
      New: For CuMatrix::DiffSoftmaxPerRow<double>, for dim = 1024, speed was 4.36421 gigaflops.
      Old: For CuMatrix::DiffSoftmaxPerRow<double>, for dim = 1024, speed was 1.94971 gigaflops.
      6b8eefbb
    • Shiyin Kang's avatar
      add speed test and unit test · 619889a1
      Shiyin Kang authored
      619889a1
    • Shiyin Kang's avatar
      mv diffsoftmax to cumatrix · 69ccd5ce
      Shiyin Kang authored
      69ccd5ce
  16. Jun 23, 2016
  17. Jun 08, 2016
    • Shiyin Kang's avatar
      full unit test for group_spec_norm with special p · 1bad1143
      Shiyin Kang authored
          stronger unit test
      1bad1143
    • Shiyin Kang's avatar
      fast GroupPnorm for p=0,1,2,inf with group transform reduce kernel template · af843b6e
      Shiyin Kang authored
      New: For CuMatrix::GroupPnorm<float>, for dim = 16, speed was 0.014416 gigaflops.
      Old: For CuMatrix::GroupPnorm<float>, for dim = 16, speed was 0.0138561 gigaflops.
      New: For CuMatrix::GroupPnorm<float>, for dim = 32, speed was 0.0616648 gigaflops.
      Old: For CuMatrix::GroupPnorm<float>, for dim = 32, speed was 0.0542906 gigaflops.
      New: For CuMatrix::GroupPnorm<float>, for dim = 64, speed was 0.241291 gigaflops.
      Old: For CuMatrix::GroupPnorm<float>, for dim = 64, speed was 0.213442 gigaflops.
      New: For CuMatrix::GroupPnorm<float>, for dim = 128, speed was 0.869675 gigaflops.
      Old: For CuMatrix::GroupPnorm<float>, for dim = 128, speed was 0.821949 gigaflops.
      New: For CuMatrix::GroupPnorm<float>, for dim = 256, speed was 3.07193 gigaflops.
      Old: For CuMatrix::GroupPnorm<float>, for dim = 256, speed was 2.90466 gigaflops.
      New: For CuMatrix::GroupPnorm<float>, for dim = 512, speed was 8.8404 gigaflops.
      Old: For CuMatrix::GroupPnorm<float>, for dim = 512, speed was 6.48644 gigaflops.
      New: For CuMatrix::GroupPnorm<float>, for dim = 1024, speed was 16.7489 gigaflops.
      Old: For CuMatrix::GroupPnorm<float>, for dim = 1024, speed was 9.3791 gigaflops.
      New: For CuMatrix::GroupPnorm<double>, for dim = 16, speed was 0.0159731 gigaflops.
      Old: For CuMatrix::GroupPnorm<double>, for dim = 16, speed was 0.0101083 gigaflops.
      New: For CuMatrix::GroupPnorm<double>, for dim = 32, speed was 0.0605624 gigaflops.
      Old: For CuMatrix::GroupPnorm<double>, for dim = 32, speed was 0.0393037 gigaflops.
      New: For CuMatrix::GroupPnorm<double>, for dim = 64, speed was 0.249944 gigaflops.
      Old: For CuMatrix::GroupPnorm<double>, for dim = 64, speed was 0.153672 gigaflops.
      New: For CuMatrix::GroupPnorm<double>, for dim = 128, speed was 0.840825 gigaflops.
      Old: For CuMatrix::GroupPnorm<double>, for dim = 128, speed was 0.598191 gigaflops.
      New: For CuMatrix::GroupPnorm<double>, for dim = 256, speed was 3.13722 gigaflops.
      Old: For CuMatrix::GroupPnorm<double>, for dim = 256, speed was 1.78274 gigaflops.
      New: For CuMatrix::GroupPnorm<double>, for dim = 512, speed was 6.86864 gigaflops.
      Old: For CuMatrix::GroupPnorm<double>, for dim = 512, speed was 2.96384 gigaflops.
      New: For CuMatrix::GroupPnorm<double>, for dim = 1024, speed was 12.5614 gigaflops.
      Old: For CuMatrix::GroupPnorm<double>, for dim = 1024, speed was 3.79237 gigaflops.
      af843b6e
    • Shiyin Kang's avatar
      generalize group_max vec_reduce mat_col_reduce to *_transform_reduce · 0f9625f2
      Shiyin Kang authored
          loop unroll by template
      
          generalize to group transform reduce
      
          _transform_reduce for vec, mat-col and group
      
          fix min bug
      
          fix bug
      
          fix template param bug
      0f9625f2
  18. Jun 04, 2016
    • Shiyin Kang's avatar
      Parallel group max using multiple threads per group. · a1b2f2bd
      Shiyin Kang authored
      Good performance on large group sizes (>10).
      New: For CuMatrix::GroupMax<float>, for dim = 16, speed was 0.0190836 gigaflops.
      Old: For CuMatrix::GroupMax<float>, for dim = 16, speed was 0.0193129 gigaflops.
      New: For CuMatrix::GroupMax<float>, for dim = 32, speed was 0.0791846 gigaflops.
      Old: For CuMatrix::GroupMax<float>, for dim = 32, speed was 0.0768508 gigaflops.
      New: For CuMatrix::GroupMax<float>, for dim = 64, speed was 0.311131 gigaflops.
      Old: For CuMatrix::GroupMax<float>, for dim = 64, speed was 0.299519 gigaflops.
      New: For CuMatrix::GroupMax<float>, for dim = 128, speed was 1.13589 gigaflops.
      Old: For CuMatrix::GroupMax<float>, for dim = 128, speed was 1.14847 gigaflops.
      New: For CuMatrix::GroupMax<float>, for dim = 256, speed was 4.22264 gigaflops.
      Old: For CuMatrix::GroupMax<float>, for dim = 256, speed was 3.92072 gigaflops.
      New: For CuMatrix::GroupMax<float>, for dim = 512, speed was 12.2629 gigaflops.
      Old: For CuMatrix::GroupMax<float>, for dim = 512, speed was 10.0812 gigaflops.
      New: For CuMatrix::GroupMax<float>, for dim = 1024, speed was 21.6979 gigaflops.
      Old: For CuMatrix::GroupMax<float>, for dim = 1024, speed was 16.5123 gigaflops.
      New: For CuMatrix::GroupMax (all group sizes)<float>, for dim = 16, speed was 0.0188551 gigaflops.
      Old: For CuMatrix::GroupMax (all group sizes)<float>, for dim = 16, speed was 0.0163827 gigaflops.
      New: For CuMatrix::GroupMax (all group sizes)<float>, for dim = 32, speed was 0.0701613 gigaflops.
      Old: For CuMatrix::GroupMax (all group sizes)<float>, for dim = 32, speed was 0.0620238 gigaflops.
      New: For CuMatrix::GroupMax (all group sizes)<float>, for dim = 64, speed was 0.271106 gigaflops.
      Old: For CuMatrix::GroupMax (all group sizes)<float>, for dim = 64, speed was 0.215268 gigaflops.
      New: For CuMatrix::GroupMax (all group sizes)<float>, for dim = 128, speed was 0.931745 gigaflops.
      Old: For CuMatrix::GroupMax (all group sizes)<float>, for dim = 128, speed was 0.723582 gigaflops.
      New: For CuMatrix::GroupMax (all group sizes)<float>, for dim = 256, speed was 3.53189 gigaflops.
      Old: For CuMatrix::GroupMax (all group sizes)<float>, for dim = 256, speed was 1.9751 gigaflops.
      New: For CuMatrix::GroupMax (all group sizes)<float>, for dim = 512, speed was 9.95109 gigaflops.
      Old: For CuMatrix::GroupMax (all group sizes)<float>, for dim = 512, speed was 3.91183 gigaflops.
      New: For CuMatrix::GroupMax (all group sizes)<float>, for dim = 1024, speed was 17.2099 gigaflops.
      Old: For CuMatrix::GroupMax (all group sizes)<float>, for dim = 1024, speed was 4.92671 gigaflops.
      New: For CuMatrix::GroupMax<double>, for dim = 16, speed was 0.0199497 gigaflops.
      Old: For CuMatrix::GroupMax<double>, for dim = 16, speed was 0.0148693 gigaflops.
      New: For CuMatrix::GroupMax<double>, for dim = 32, speed was 0.079538 gigaflops.
      Old: For CuMatrix::GroupMax<double>, for dim = 32, speed was 0.0718237 gigaflops.
      New: For CuMatrix::GroupMax<double>, for dim = 64, speed was 0.314509 gigaflops.
      Old: For CuMatrix::GroupMax<double>, for dim = 64, speed was 0.237838 gigaflops.
      New: For CuMatrix::GroupMax<double>, for dim = 128, speed was 1.08104 gigaflops.
      Old: For CuMatrix::GroupMax<double>, for dim = 128, speed was 0.788395 gigaflops.
      New: For CuMatrix::GroupMax<double>, for dim = 256, speed was 3.7741 gigaflops.
      Old: For CuMatrix::GroupMax<double>, for dim = 256, speed was 2.87856 gigaflops.
      New: For CuMatrix::GroupMax<double>, for dim = 512, speed was 8.65988 gigaflops.
      Old: For CuMatrix::GroupMax<double>, for dim = 512, speed was 5.87111 gigaflops.
      New: For CuMatrix::GroupMax<double>, for dim = 1024, speed was 14.0373 gigaflops.
      Old: For CuMatrix::GroupMax<double>, for dim = 1024, speed was 8.88655 gigaflops.
      New: For CuMatrix::GroupMax (all group sizes)<double>, for dim = 16, speed was 0.0174585 gigaflops.
      Old: For CuMatrix::GroupMax (all group sizes)<double>, for dim = 16, speed was 0.0136057 gigaflops.
      New: For CuMatrix::GroupMax (all group sizes)<double>, for dim = 32, speed was 0.0694617 gigaflops.
      Old: For CuMatrix::GroupMax (all group sizes)<double>, for dim = 32, speed was 0.0500527 gigaflops.
      New: For CuMatrix::GroupMax (all group sizes)<double>, for dim = 64, speed was 0.265809 gigaflops.
      Old: For CuMatrix::GroupMax (all group sizes)<double>, for dim = 64, speed was 0.177945 gigaflops.
      New: For CuMatrix::GroupMax (all group sizes)<double>, for dim = 128, speed was 0.973417 gigaflops.
      Old: For CuMatrix::GroupMax (all group sizes)<double>, for dim = 128, speed was 0.588654 gigaflops.
      New: For CuMatrix::GroupMax (all group sizes)<double>, for dim = 256, speed was 3.43166 gigaflops.
      Old: For CuMatrix::GroupMax (all group sizes)<double>, for dim = 256, speed was 1.57864 gigaflops.
      New: For CuMatrix::GroupMax (all group sizes)<double>, for dim = 512, speed was 8.26032 gigaflops.
      Old: For CuMatrix::GroupMax (all group sizes)<double>, for dim = 512, speed was 3.14173 gigaflops.
      New: For CuMatrix::GroupMax (all group sizes)<double>, for dim = 1024, speed was 12.1338 gigaflops.
      Old: For CuMatrix::GroupMax (all group sizes)<double>, for dim = 1024, speed was 3.05406 gigaflops.
      
      fix typo; rename and comment
      a1b2f2bd
    • Shiyin Kang's avatar
      Speed test on all posible group sizes given the dim. · 763b27a5
      Shiyin Kang authored
      fix space
      763b27a5
  19. May 30, 2016
    • Shiyin Kang's avatar
      fix bug on _div_rows_vec; faster DivRowsVec on large float matrix · bb589475
      Shiyin Kang authored
      New: For CuMatrix::DivRowsVec<float>, for dim = 16, speed was 0.0180391 gigaflops.
      Old: For CuMatrix::DivRowsVec<float>, for dim = 16, speed was 0.017677 gigaflops.
      New: For CuMatrix::DivRowsVec<float>, for dim = 32, speed was 0.0686798 gigaflops.
      Old: For CuMatrix::DivRowsVec<float>, for dim = 32, speed was 0.0682798 gigaflops.
      New: For CuMatrix::DivRowsVec<float>, for dim = 64, speed was 0.290613 gigaflops.
      Old: For CuMatrix::DivRowsVec<float>, for dim = 64, speed was 0.273113 gigaflops.
      New: For CuMatrix::DivRowsVec<float>, for dim = 128, speed was 1.12576 gigaflops.
      Old: For CuMatrix::DivRowsVec<float>, for dim = 128, speed was 1.08792 gigaflops.
      New: For CuMatrix::DivRowsVec<float>, for dim = 256, speed was 3.79354 gigaflops.
      Old: For CuMatrix::DivRowsVec<float>, for dim = 256, speed was 3.48151 gigaflops.
      New: For CuMatrix::DivRowsVec<float>, for dim = 512, speed was 9.247 gigaflops.
      Old: For CuMatrix::DivRowsVec<float>, for dim = 512, speed was 8.70703 gigaflops.
      New: For CuMatrix::DivRowsVec<float>, for dim = 1024, speed was 16.535 gigaflops.
      Old: For CuMatrix::DivRowsVec<float>, for dim = 1024, speed was 12.8467 gigaflops.
      New: For CuMatrix::DivRowsVec<float>, for dim = 2048, speed was 21.0912 gigaflops.
      Old: For CuMatrix::DivRowsVec<float>, for dim = 2048, speed was 14.6946 gigaflops.
      New: For CuMatrix::DivRowsVec<float>, for dim = 4096, speed was 21.8187 gigaflops.
      Old: For CuMatrix::DivRowsVec<float>, for dim = 4096, speed was 15.1197 gigaflops.
      New: For CuMatrix::DivRowsVec<float>, for dim = 8192, speed was 20.9238 gigaflops.
      Old: For CuMatrix::DivRowsVec<float>, for dim = 8192, speed was 15.2273 gigaflops.
      New: For CuMatrix::DivRowsVec<double>, for dim = 16, speed was 0.0171395 gigaflops.
      Old: For CuMatrix::DivRowsVec<double>, for dim = 16, speed was 0.0173988 gigaflops.
      New: For CuMatrix::DivRowsVec<double>, for dim = 32, speed was 0.0708914 gigaflops.
      Old: For CuMatrix::DivRowsVec<double>, for dim = 32, speed was 0.0745867 gigaflops.
      New: For CuMatrix::DivRowsVec<double>, for dim = 64, speed was 0.302615 gigaflops.
      Old: For CuMatrix::DivRowsVec<double>, for dim = 64, speed was 0.279866 gigaflops.
      New: For CuMatrix::DivRowsVec<double>, for dim = 128, speed was 1.12123 gigaflops.
      Old: For CuMatrix::DivRowsVec<double>, for dim = 128, speed was 1.15183 gigaflops.
      New: For CuMatrix::DivRowsVec<double>, for dim = 256, speed was 3.73959 gigaflops.
      Old: For CuMatrix::DivRowsVec<double>, for dim = 256, speed was 3.61588 gigaflops.
      New: For CuMatrix::DivRowsVec<double>, for dim = 512, speed was 6.75394 gigaflops.
      Old: For CuMatrix::DivRowsVec<double>, for dim = 512, speed was 6.86088 gigaflops.
      New: For CuMatrix::DivRowsVec<double>, for dim = 1024, speed was 10.2967 gigaflops.
      Old: For CuMatrix::DivRowsVec<double>, for dim = 1024, speed was 9.63553 gigaflops.
      New: For CuMatrix::DivRowsVec<double>, for dim = 2048, speed was 11.3301 gigaflops.
      Old: For CuMatrix::DivRowsVec<double>, for dim = 2048, speed was 10.9322 gigaflops.
      New: For CuMatrix::DivRowsVec<double>, for dim = 4096, speed was 11.063 gigaflops.
      Old: For CuMatrix::DivRowsVec<double>, for dim = 4096, speed was 10.7829 gigaflops.
      New: For CuMatrix::DivRowsVec<double>, for dim = 8192, speed was 10.6967 gigaflops.
      Old: For CuMatrix::DivRowsVec<double>, for dim = 8192, speed was 10.6246 gigaflops.
      bb589475
    • Shiyin Kang's avatar
      282d228c
  20. May 29, 2016
    • Shiyin Kang's avatar
      CuMatrixBase::Max, Min and Sum by reduce_mat_cols · 8816d95d
      Shiyin Kang authored
          New sum:
          LOG (TestCuMatrixSum():cu-matrix-speed-test.cc:57) For CuMatrix::TestCuMatrixSum<float>, for dim = 16, speed was 0.00954969 gigaflops, result = 26.2034
          LOG (TestCuMatrixSum():cu-matrix-speed-test.cc:57) For CuMatrix::TestCuMatrixSum<float>, for dim = 32, speed was 0.0400381 gigaflops, result = 17.9455
          LOG (TestCuMatrixSum():cu-matrix-speed-test.cc:57) For CuMatrix::TestCuMatrixSum<float>, for dim = 64, speed was 0.152595 gigaflops, result = 31.6159
          LOG (TestCuMatrixSum():cu-matrix-speed-test.cc:57) For CuMatrix::TestCuMatrixSum<float>, for dim = 128, speed was 0.546459 gigaflops, result = 81.3117
          LOG (TestCuMatrixSum():cu-matrix-speed-test.cc:57) For CuMatrix::TestCuMatrixSum<float>, for dim = 256, speed was 1.94432 gigaflops, result = 572.224
          LOG (TestCuMatrixSum():cu-matrix-speed-test.cc:57) For CuMatrix::TestCuMatrixSum<float>, for dim = 512, speed was 6.23377 gigaflops, result = 39.6669
          LOG (TestCuMatrixSum():cu-matrix-speed-test.cc:57) For CuMatrix::TestCuMatrixSum<float>, for dim = 1024, speed was 16.0119 gigaflops, result = 518.841
          LOG (TestCuMatrixSum():cu-matrix-speed-test.cc:57) For CuMatrix::TestCuMatrixSum<double>, for dim = 16, speed was 0.00916145 gigaflops, result = -2.71724
          LOG (TestCuMatrixSum():cu-matrix-speed-test.cc:57) For CuMatrix::TestCuMatrixSum<double>, for dim = 32, speed was 0.0366853 gigaflops, result = 43.261
          LOG (TestCuMatrixSum():cu-matrix-speed-test.cc:57) For CuMatrix::TestCuMatrixSum<double>, for dim = 64, speed was 0.144912 gigaflops, result = 30.3323
          LOG (TestCuMatrixSum():cu-matrix-speed-test.cc:57) For CuMatrix::TestCuMatrixSum<double>, for dim = 128, speed was 0.501765 gigaflops, result = -152.665
          LOG (TestCuMatrixSum():cu-matrix-speed-test.cc:57) For CuMatrix::TestCuMatrixSum<double>, for dim = 256, speed was 1.83353 gigaflops, result = -355.256
          LOG (TestCuMatrixSum():cu-matrix-speed-test.cc:57) For CuMatrix::TestCuMatrixSum<double>, for dim = 512, speed was 5.609 gigaflops, result = 744.185
          LOG (TestCuMatrixSum():cu-matrix-speed-test.cc:57) For CuMatrix::TestCuMatrixSum<double>, for dim = 1024, speed was 12.6693 gigaflops, result = -857.049
      8816d95d
    • Shiyin Kang's avatar
      Speed tests and unit tests for cu matrix max/min/sum · a831e9af
      Shiyin Kang authored
      Old sum:
      LOG (TestCuMatrixSum():cu-matrix-speed-test.cc:57) For CuMatrix::TestCuMatrixSum<float>, for dim = 16, speed was 0.00340611 gigaflops, result = 26.2034
      LOG (TestCuMatrixSum():cu-matrix-speed-test.cc:57) For CuMatrix::TestCuMatrixSum<float>, for dim = 32, speed was 0.0141018 gigaflops, result = 17.9455
      LOG (TestCuMatrixSum():cu-matrix-speed-test.cc:57) For CuMatrix::TestCuMatrixSum<float>, for dim = 64, speed was 0.0575425 gigaflops, result = 31.6159
      LOG (TestCuMatrixSum():cu-matrix-speed-test.cc:57) For CuMatrix::TestCuMatrixSum<float>, for dim = 128, speed was 0.229418 gigaflops, result = 81.3117
      LOG (TestCuMatrixSum():cu-matrix-speed-test.cc:57) For CuMatrix::TestCuMatrixSum<float>, for dim = 256, speed was 0.778943 gigaflops, result = 572.224
      LOG (TestCuMatrixSum():cu-matrix-speed-test.cc:57) For CuMatrix::TestCuMatrixSum<float>, for dim = 512, speed was 3.11055 gigaflops, result = 39.6668
      LOG (TestCuMatrixSum():cu-matrix-speed-test.cc:57) For CuMatrix::TestCuMatrixSum<float>, for dim = 1024, speed was 7.50506 gigaflops, result = 518.842
      LOG (TestCuMatrixSum():cu-matrix-speed-test.cc:57) For CuMatrix::TestCuMatrixSum<double>, for dim = 16, speed was 0.00216499 gigaflops, result = -2.71724
      LOG (TestCuMatrixSum():cu-matrix-speed-test.cc:57) For CuMatrix::TestCuMatrixSum<double>, for dim = 32, speed was 0.00863257 gigaflops, result = 43.261
      LOG (TestCuMatrixSum():cu-matrix-speed-test.cc:57) For CuMatrix::TestCuMatrixSum<double>, for dim = 64, speed was 0.0513208 gigaflops, result = 30.3323
      LOG (TestCuMatrixSum():cu-matrix-speed-test.cc:57) For CuMatrix::TestCuMatrixSum<double>, for dim = 128, speed was 0.20313 gigaflops, result = -152.665
      LOG (TestCuMatrixSum():cu-matrix-speed-test.cc:57) For CuMatrix::TestCuMatrixSum<double>, for dim = 256, speed was 0.759338 gigaflops, result = -355.256
      LOG (TestCuMatrixSum():cu-matrix-speed-test.cc:57) For CuMatrix::TestCuMatrixSum<double>, for dim = 512, speed was 2.60159 gigaflops, result = 744.185
      LOG (TestCuMatrixSum():cu-matrix-speed-test.cc:57) For CuMatrix::TestCuMatrixSum<double>, for dim = 1024, speed was 7.2258 gigaflops, result = -857.049
      a831e9af
  21. May 28, 2016
  22. May 27, 2016
    • Shiyin Kang's avatar
      _vector_reduce kernel template for CuVector Sum, Max and Min. · b19ab6b3
      Shiyin Kang authored
      Add test to choose min length of vectors to be reduced on GPU.
      
      New:
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<float>, for dim = 16, speed was 0.000886179 gigaflops.
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<float>, for dim = 32, speed was 0.00119834 gigaflops.
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<float>, for dim = 64, speed was 0.00182674 gigaflops.
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<float>, for dim = 128, speed was 0.00721178 gigaflops.
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<float>, for dim = 256, speed was 0.0166563 gigaflops.
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<float>, for dim = 1024, speed was 0.0626621 gigaflops.
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<float>, for dim = 2048, speed was 0.108495 gigaflops.
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<float>, for dim = 4096, speed was 0.162914 gigaflops.
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<float>, for dim = 8192, speed was 0.248687 gigaflops.
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<float>, for dim = 16384, speed was 0.491677 gigaflops.
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<float>, for dim = 32768, speed was 0.931507 gigaflops.
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<float>, for dim = 65536, speed was 1.75797 gigaflops.
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<double>, for dim = 16, speed was 0.00116685 gigaflops.
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<double>, for dim = 32, speed was 0.00229885 gigaflops.
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<double>, for dim = 64, speed was 0.00430313 gigaflops.
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<double>, for dim = 128, speed was 0.00840191 gigaflops.
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<double>, for dim = 256, speed was 0.0156417 gigaflops.
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<double>, for dim = 1024, speed was 0.051799 gigaflops.
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<double>, for dim = 2048, speed was 0.09064 gigaflops.
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<double>, for dim = 4096, speed was 0.122844 gigaflops.
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<double>, for dim = 8192, speed was 0.241084 gigaflops.
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<double>, for dim = 16384, speed was 0.468114 gigaflops.
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<double>, for dim = 32768, speed was 0.859946 gigaflops.
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<double>, for dim = 65536, speed was 1.53817 gigaflops.
      
      Old:
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<float>, for dim = 16, speed was 0.000461866 gigaflops.
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<float>, for dim = 32, speed was 0.000936284 gigaflops.
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<float>, for dim = 64, speed was 0.00180461 gigaflops.
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<float>, for dim = 128, speed was 0.00350883 gigaflops.
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<float>, for dim = 256, speed was 0.00700597 gigaflops.
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<float>, for dim = 1024, speed was 0.0273135 gigaflops.
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<float>, for dim = 2048, speed was 0.0529984 gigaflops.
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<float>, for dim = 4096, speed was 0.0930953 gigaflops.
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<float>, for dim = 8192, speed was 0.149376 gigaflops.
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<float>, for dim = 16384, speed was 0.197131 gigaflops.
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<float>, for dim = 32768, speed was 0.492249 gigaflops.
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<float>, for dim = 65536, speed was 0.657485 gigaflops.
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<double>, for dim = 16, speed was 0.000406633 gigaflops.
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<double>, for dim = 32, speed was 0.000836551 gigaflops.
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<double>, for dim = 64, speed was 0.00167463 gigaflops.
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<double>, for dim = 128, speed was 0.00338708 gigaflops.
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<double>, for dim = 256, speed was 0.00668978 gigaflops.
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<double>, for dim = 1024, speed was 0.0253556 gigaflops.
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<double>, for dim = 2048, speed was 0.0510465 gigaflops.
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<double>, for dim = 4096, speed was 0.081494 gigaflops.
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<double>, for dim = 8192, speed was 0.156451 gigaflops.
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<double>, for dim = 16384, speed was 0.311666 gigaflops.
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<double>, for dim = 32768, speed was 0.545834 gigaflops.
      LOG (TestCuVectorSum():cu-vector-speed-test.cc:72) For CuVector::Sum<double>, for dim = 65536, speed was 0.914985 gigaflops.
      
      fix vector sum bug.
      
      del old kernels
      
      correct way for inline.
      
      only do this when we have cuda.
      b19ab6b3
  23. May 19, 2016
Loading