cuda kernel does not work when a data is transfer to the GPU. any problem with my gateway MEX code?
14 views (last 30 days)
Show older comments
Moein Mozaffarzadeh
on 19 Mar 2021
Commented: Moein Mozaffarzadeh
on 24 May 2021
Hi,
I'm trying to transfer a data to a KUDA kernel, do some processings and have the output back to Matlab. I have already evaluated the Kernel in visual studio and it works as it should. However, when I make a MEX file out of the code, It does not provide me the output i expect. here is my MEX gateway code:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include "cuda.h"
using namespace std;
#include <mex.h>
__global__ void kernel_Reconstruction2(int* Dev_RfData, int* ReconstructedImage_GPU, int transmit, int NStart_Transmit) {
int TID = threadIdx.y * blockDim.x + threadIdx.x;
int BlockOFFset = blockDim.x * blockDim.y * blockIdx.x;
int RowOFFset = blockDim.x * blockDim.y * gridDim.x * blockIdx.y;
int GID = RowOFFset + BlockOFFset + TID;
}
void mexFunction(int nlhs, mxArray* plhs[],
int nrhs, const mxArray* prhs[]) {
int* RfData; // RF data; a pinned memory was dedicated to this
int* ReconstructedImage_GPU;
RfData = (int*)mxGetPr(prhs[0]);
plhs[0] = mxCreateNumericMatrix(1, 64 * 64, mxINT32_CLASS, mxREAL);
ReconstructedImage_GPU = (int*)mxGetData(plhs[0]);
int ArrayByteSize_RfData = sizeof(int) * (96 * 96 * 4096);
int BYTES_PER_STREAM = ArrayByteSize_RfData / 96;
//Memory allocation: RfData ; we send the RF data to the device with streaming
int* Device_RfData; // device pointer to the RF data.
(cudaMalloc((int**)&Device_RfData, ArrayByteSize_RfData));
int* Device_ReconstructedImage_GPU; // device pointer to the reconstructed image
int ArrayByteSize_ReconstructedImage_GPU = sizeof(int) * (96*96);
(cudaMalloc((int**)&Device_ReconstructedImage_GPU, ArrayByteSize_ReconstructedImage_GPU));
printf("The CUDA reconstruction started... \n");
dim3 block(1024, 1);
dim3 grid(64 * 64, 96);//SystemSetup.NumberOfTransmitter
cudaStream_t* streams = new cudaStream_t[96]; //SystemSetup.NumberOfTransmitter
int NStart_Transmit{};
for (int transmit = 0; transmit < 96; transmit++) {
cudaStreamCreate(&streams[transmit]);
NStart_Transmit = transmit * (96 * 4096);
cudaMemcpyAsync(&Device_RfData[NStart_Transmit], &RfData[NStart_Transmit], BYTES_PER_STREAM, cudaMemcpyHostToDevice, streams[transmit]);
kernel_Reconstruction2 << <grid, block, 0, streams[transmit] >> > (&Device_RfData[NStart_Transmit], Device_ReconstructedImage_GPU, transmit, NStart_Transmit);
(cudaPeekAtLastError());
}
for (int transmit = 0; transmit < 96; transmit++) { cudaStreamDestroy(streams[transmit]); } // destroy the streams
delete[] streams;
cudaDeviceSynchronize();
(cudaMemcpy(ReconstructedImage_GPU, Device_ReconstructedImage_GPU, ArrayByteSize_ReconstructedImage_GPU, cudaMemcpyDeviceToHost));
(cudaFree(Device_RfData));
(cudaFree(Device_ReconstructedImage_GPU));
}
I have also checked if the "RfData" contains the actual values in the host or not; all is good with this as well. So, i think there is something wrong with the "Device_RfData" values. Is there anything that I'm missing ?
Regards,
Moein.
2 Comments
Richard
on 22 Mar 2021
Hi Moein, could you add a Matlab code example to show how you are calling this function, and what output you expect for that example? (I can't see where ReconstructedImage_GPU is ever altered on the device - are you just expecting a zeros matrix as output and not getting it?)
Accepted Answer
Richard
on 24 Mar 2021
Edited: Richard
on 24 Mar 2021
Hi Moein,
There is no way of allocating MATLAB arrays directly as pinned CUDA memory. Inputs to a mex function will always be allocated by MATLAB's memory manager.
You should be able to allocate your own pinned host memory and copy data from the mxArray into it before then using streams to do the async copy to device. I don't know whether this additional host-side copy will be more expensive than the savings from using streams though.
In my tests I was using a basic call to mexcuda:
mexcuda TUI_CUDA.cu MexFunctions.cu
You might also want to call mex -setup to ensure that you have the same compiler selected in MATLAB as you are using in VS.
I do not think it is likely that the differences you are seeing are due to compiler flags. I think it is better to concentrate on validating that the inputs, then processing steps and then outputs match between your VS version and MATLAB version.
As seen with the output visualization issue, the memory order of arrays is different between the two. There is clearly a 2D (or even 3D) structure to your input data, and you are processing it with kernels that are designed to work on a slice along one of those dimensions. But the input arrays are 1D. Is the input data that you pass in from MATLAB definitely imported correctly? If 2D matrices were involved at some stage then there may be a transpose missing there as well. Have you tried resizing the inputs to an (N-by-ntransmitters) image to check that both of the 1D streams look to have the expected structure?
When you have verified that the input data streams are identical, then you will need to look at exactly where in the kernel processing a deviation occurs. It will likely be very helpful if you can formulate an example input that is much smaller so that you have fewer data items to track and compare between the two systems while you do this. Progressively (temporarily) cutting code out of the kernel until outputs match is another approach to try - you could even go as far as having the kernel simply output thread indices as a basic test to prove each approach is executing the same kernel.
Richard
9 Comments
Richard
on 24 May 2021
Hi Moein,
The direct problem here is that variables in C that you want to reference in future calls to a function need to be declared as static, and their declaration needs to be in an appropriate scope, i.e. not in the if() branch. Only the code that creates the contents of the variables needs to be guarded and run just once. They also should not be freed at the end of the mexfunction, and you should not use cudaResetDevice, because both of those will make the cuda pointers invalid in the next call. You will need to register a function to free the pointers when the mex file is cleared using mexAtExit - see https://www.mathworks.com/help/matlab/apiref/mexatexit.html.
However, I don't think that this approach is best way to approach the problem. It does not mean that the mex function will work without any user intervention: you will need to clearly document to users that they must set the "TransferToHost" flag whenever certain other input variables have different data. And also, they should not set that flag if they want to see best performance at other times. This interface is confusing and is likely that a user will make a mistake. It also means that the user has no real control over the lifetime of the persistent GPU data - there is no obvious way for them to say "ok, I'm done with your code, now I want to use the whole GPU for something else".
If instead you accept a gpuArray as input to the mex function, you will just need to document that you support gpuArrays as inputs, and that the user should type "otherData = gpuArray(otherData)" before calling your mex function. You can point to MathWorks documentation for how to create a gpuArray, and the lifetime of the persistent data will be controlled by standard Matlab rules for the life of data - it will persist for as long as the user has a variable that contains it, and it will be freed when the variable is cleared.
Looking at your StackOverflow question , it looks like you are making progress on the latter option. The easiest way of compiling these GPU mex functions for Matlab is to use the mexcuda command inside Matlab, but I do understand that this is part of a bigger project and you want VS to produce it - I would suggest using mexcuda as a starting point so that you can test and confirm that your code works as expected, and then work on configuring VS (I suspect that you may just need to point VS at the extern/lib/win64/microsoft directory so that it finds the gpu.lib file)
Richard
More Answers (1)
Richard
on 22 Mar 2021
Edited: Richard
on 22 Mar 2021
Moein, there are a couple of issues that I can see in the code.
The first issue is that it is not initializing the GPU device in any way. The easiest way to do this is to add a call to mxInitGPU as documented in https://www.mathworks.com/help/parallel-computing/mxinitgpu.html . This will use MATLAB's normal rules to select the default GPU. I added a simple assignment line in the kernel to produce dummy output values which I also observed were not returned, and then confirmed that adding a call to mxInitGPU at the start of mexFunction() caused the expected values to be returned (after restarting MATLAB to fix the GPU state).
The second issue is that the code sample looks to be over-running array bounds when copying back from Device_ReconstructedImage_GPU to ReconstructedImage_GPU. The output mxArray is created as a 1-by-(64x64) matrix but the array size used on the device is 1-by-(96x96). This is likely to cause MATLAB to crash - the output mxArray needs to match the GPU device array size.
7 Comments
See Also
Categories
Find more on Kernel Creation from MATLAB Code in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!