[src] Optimizations for batch nnet3. The issue fixed here is that (#3351) (0b443bd6) · Commits · Simon Will / kaldi-commonvoice

Commit 0b443bd6 authored Jun 03, 2019 by Justin Luitjens Committed by Daniel Povey Jun 03, 2019

[src] Optimizations for batch nnet3. The issue fixed here is that (#3351)

small cuda memory copies are inefficeint because each copy can
add multiple micro-seconds of latency.  The code as written
would copy a small matrices or vectors to and from the tasks one
after another.  To avoid this i've implemented a batched matrix
copy routine.  This takes arrays of matrix descriptions for the
input and output and batches the copies in a single kernel call.
This is used in both FormatInputs and FormatOutputs to reduce
launch latency overhead.

The kernel for the batched copy uses a trick to avoid a memory
copy of the host paramters.  The parameters are put into a struct
containing a static sized array.  These parameters are then marshalled
like normal cuda parameters.  This avoids additional launch latency
overhead.

There is still more work to do at the beginning and end of nnet3.
In particular we may want to batch the clamped memory copies and
the large number of D2D copies at the end.  I haven't fully tracked
those down and may return to them in the future.

parent eedd9fa9

Hide whitespace changes

Inline Side-by-side

Please register or to comment