[src] Add interfaces to nnet-batch-compute that expects device input. (#3311)
This avoids a ping pong of memory to host. Implementation now assumes device memory. interfaces will allocate device memory and copy to it if data starts on host. Add a cuda matrix copy function which clamps rows. This is much faster than copying one row at a time and the kernel can handle the clamping for free.
Loading
Please register or sign in to comment