CUDAKernel Object crashes GPU

Hi,
I am running some calculation using Matlab on the GPU using CUDAKernel Object.
It was working fine with a grid size of 41x41, but with different grid sizes the GPU crashes. Yet, it does not seem like a problem of memory since it is working with 61x61, but with 55x55 is crashes. The calculations are fine (I compared it to a CPU calculation).
When I loaded all my data to the GPU , I saw before the Kernel execution that I have left around 1.5GB out of its 2GB from 'Dedicated GPU Memory'.
Does the size of the "Result" vector that I sent to the GPU change during the calculation? I sent all zeros and then each thread is calculating a value for different cell in the vector.
The error message i get when I close Matlab is:
NVIDIA OpenGL Driver Unable to recover from a kernel exception. The application must close.
Error code: 3 (subcode 2)
I tried to change the settings of NVIDIA control panel Gloabl settings to 3D App - Visual Simulation.
This trick worked with the 55x55 grid, but did not solve the problem for other sizes such as 71x71 which makes me thing it is only in the right direction but not quiet sufficient.
Thank you very much, I am looking forward for your help.

8 Comments

The most likely explanation is a bug in your CUDAKernel C++ code. You should post that code.
Hi, I added my code below. It has many inputs which set the Matrix-Vector multiplication I aim to calculate. I have a large matrix (M) made out of replicas of the smaller matrix (A). Since A is sparse, I use find() function to send the Rows,Columns and Values of non zero elements. The size of A is mHeight and mWidth, and each replica is shifted by colShift columns. Each thread is calculating a different replica multiplication, and they do not share the same rows - so no writing to the same element in the array is made.
#include <stdio.h>
#include <math.h>
__global__ void Non_Transpose_matrix(const double *b,
const unsigned int mHeight,
const unsigned int mWidth,
const unsigned int colShift,
const double *mVal,
const double *mRows,
const double *mCols,
const unsigned int SizeSparse,
const unsigned int numReplica,
double *Result
){
int idx = threadIdx.x+blockDim.x*blockIdx.x; //on each block threadIdx is ranged from 0 to blockDim.x-1
//Result should be sent zeros!!!!
// max(mRows)<=mHeight;
//max(mCols)<=mWidth
while(idx<numReplica){
for(int i=0;i<SizeSparse;i=i+1 ){
int r=mRows[i]-1+idx*mHeight; //mRows holds location in MATLAB notation
int c=mCols[i]-1+idx*colShift; //mCols holds location in MATLAB notation
double val=mVal[i];
Result[r] +=val*b[c];
}
idx+=blockDim.x*gridDim.x;
}
}
Thank you in advance,
Omer
There are many things that could be wrong with this code if you are not calling it with the right inputs, for instance if b is not numReplica*(colShift-1) + mWidth elements, or if Result is not long enough. So you'd better post the code you are using to define the inputs and the launch the parameters of the CUDAKernel.
Hi Joss, Thank you for your reply. The code where I call this kernel is more complicated and not relevant. I double checked the sizes of vector b and Result vector, and before the call to the kernel, all the sizes and initialization is correct. The error occurs only when I call this kernel. Can it be that I initialize vector Result which occupy 200MB on the GPU, and due to the calculation it's size gets too large for the GPU memory? It says I have more than 1.3GB left before I call the kernel.
Omer
Joss Knight
Joss Knight on 2 Oct 2018
Edited: Joss Knight on 2 Oct 2018
Of course the calling code is relevant! It's essential that you define your CUDAKernel block and grid dimensions correctly and that you're passing the right data of the right size and shape to the kernel. This is especially important for your kernel because CUDAKernel will not automatically detect the correct block and grid dimensions from your data, because you are processing a chunk of the input with each thread. All I need to know is the properties of your CUDAKernel and the size and shape of all the arguments to feval.
Result cannot grow during your calculation, if it does that will cause an illegal memory access. So Result must be initialized to be mHeight*numReplica elements.
Hi, So, the calling to the CUDAKernel has the next inputs logic: The grid size is Nx on Ny . In addition I have two inputs Np and Nt standing for the number of sampling points and time. I have the size Nk=Nx+Np. I sent to the CUDAKernel vector b, whereas [b]=[Ny*Nk] matrix A_mat whereas [A_mat]=[Np*Nt,Ny*Nx]; A_mat is sparse so I decompose it to 3 vectors mRows,mCols & mVal , all the same size of SizeSparse. Take a note that max(mRows)<Np*Nt. mHeight is therefore mHeight=Np*Nt, mWidth=Ny*Nx. colShift=Ny and numReplica is Np - therefore I go over all the rows of my matrix Np*Nt*Np. [Result]=Np^2*Nt.
Thank you,
Omer
Sorry, I can't interpret all that. Please just display the CUDAKernel object so I can see all its properties, show the line of code where you call feval, call size on all the array input arguments and show me the results, and give me the value of all the scalar input arguments ( mWidth, mHeight, colShift, SizeSparse, numReplica ).
Hi Joss, Please find the attached code that calls the kernel:
kernel1 = parallel.gpu.CUDAKernel('SpMV_Omer.ptx','SpMV_Omer.cu','Non_Transpose_matrix');
kernel1.ThreadBlockSize = [1024 1];
numReplica=N_p; %Np=4*Ny+1
Nk=N_range; %Nk=Nx+Np-1
Ny=N_x;
b=gpuArray(rand(Ny*Nk,1));
mHeight=size(A_mat,1);%Np*Nt
mWidth=size(A_mat,2);%Ny*Nx
colShift=Ny;
[mRows,mCols,mVal]=find(A_mat);
mRows=gpuArray(mRows);mCols=gpuArray(mCols);mVal=gpuArray(mVal);
SizeSparse=size(mVal,1);
Result=zeros(N_p*N_p*N_t,1,'gpuArray');
[Result]=feval(kernel1,b,mHeight,mWidth,colShift,mVal,mRows,mCols,SizeSparse,numReplica,Result);
Thank you for your help,
Omer

Sign in to comment.

Answers (0)

Products

Release

R2017b

Tags

Asked:

on 27 Sep 2018

Commented:

on 8 Oct 2018

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!