gather takes really long after using ptx file /CUDA

3 views (last 30 days)
I try to make a matrixmultiplication using CUDA via ptx file to take advantage over the matlab internal functions. My .cu codes can calculate the matrixmultiplication faster than matlabs internal function, but the gather command after running the kernel takes much longer than after using matlab commands and gpuArray before:
Here my matlab file to compare both:
g=gpuDevice
reset(g)
clear all
N=1024;
A=rand(N,N);
B=rand(N,N);
%gpuDevice using Matlab
A_gpu=gpuArray(A);%Create array on GPU
B_gpu=gpuArray(B);%Create array on GPU
tic
C_gpu=A_gpu*B_gpu;
toc
tic
C=gather(C_gpu);%copy array from GPU to CPU
toc
%now using CUDA
A=A';
a_gpu=gpuArray(A(:)');%Create array on GPU make vector
b_gpu=gpuArray(B(:)');%Create array on GPU make vector
c_gpu=gpuArray(zeros(N*N,1));
k = parallel.gpu.CUDAKernel('matrixmul.ptx', 'matrixmul.cu');
k.ThreadBlockSize = [N,1,1];
k.GridSize=[N,N];
tic
[o] = feval(k, c_gpu,a_gpu,b_gpu);
o=reshape(o,N,N);
toc
tic
c2=gather(o);%back to host
toc
%check
max(max(abs(C-c2)))
My .cu file looks like this: _global_ void matrixmul( double *c, double *a, double *b) { _shared_ double cache[1024]; int cacheIndex = threadIdx.x;
int Aind=threadIdx.x + blockIdx.x * gridDim.y;
int Bind=threadIdx.x + blockIdx.y * gridDim.x;
cache[cacheIndex]=a[Aind]*b[Bind];
__syncthreads();
int i=blockDim.x/2;
while (i != 0) {
if (cacheIndex<i)
cache[cacheIndex]+=cache[cacheIndex+i];
__syncthreads();
i/=2;
}
if (cacheIndex == 0)
c[blockIdx.y *gridDim.y + blockIdx.x ]=cache[0];
}
In my version i use directly vectors instead of matrices and I transposed the 2 Matrix before starting the calculation to take advantage of the better order inside the vector for the memory access:
Thats what I get back: Elapsed time is 0.110911 seconds. Elapsed time is 0.007010 seconds. Elapsed time is 0.001937 seconds. Elapsed time is 3.651635 seconds.
ans =
1.0800e-12
As you see the first gather command takes only 0.007 seconds while the second one needs more than 3sec. Also if I put all my calling stuff into some function, also the call of this function takes a lot (without even reading the gpuArray.
Any suggestions whats going wrong here?
Thanks
Robert

Answers (1)

James Lebak
James Lebak on 24 Jan 2013
In MATLAB R2012a and later, GPU functions execute asynchronously in MATLAB. To get accurate timings, you need to call the wait function to make sure that gpu execution is finished. To accurately measure the time taken by MATLAB's multiply or by your kernel, rewrite your code as follows:
tic
C_gpu=A_gpu*B_gpu;
wait(g); % g is the value returned by gpuDevice, above
toc
tic
[o] = feval(k, c_gpu,a_gpu,b_gpu);
o=reshape(o,N,N);
wait(g); % g is the value returned by gpuDevice, above
toc

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!