cuda kernel does not work when a data is transfer to the GPU. any problem with my gateway MEX code?

Question

Moein Mozaffarzadeh on 19 Mar 2021

0
Link

Direct link to this question

https://ch.mathworks.com/matlabcentral/answers/777732-cuda-kernel-does-not-work-when-a-data-is-transfer-to-the-gpu-any-problem-with-my-gateway-mex-code

Commented: Moein Mozaffarzadeh on 24 May 2021

Hi,

I'm trying to transfer a data to a KUDA kernel, do some processings and have the output back to Matlab. I have already evaluated the Kernel in visual studio and it works as it should. However, when I make a MEX file out of the code, It does not provide me the output i expect. here is my MEX gateway code:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include "cuda.h"
using namespace std;
#include <mex.h>
__global__ void kernel_Reconstruction2(int* Dev_RfData, int* ReconstructedImage_GPU, int  transmit, int  NStart_Transmit) {
	int TID = threadIdx.y * blockDim.x + threadIdx.x;
	int BlockOFFset = blockDim.x * blockDim.y * blockIdx.x;
	int RowOFFset = blockDim.x * blockDim.y * gridDim.x * blockIdx.y;
	int GID = RowOFFset + BlockOFFset + TID;
}
void mexFunction(int nlhs, mxArray* plhs[],
	int nrhs, const mxArray* prhs[]) {
	int* RfData;  // RF data; a pinned memory was dedicated to this
	int* ReconstructedImage_GPU;
	RfData = (int*)mxGetPr(prhs[0]);
	plhs[0] = mxCreateNumericMatrix(1, 64 * 64, mxINT32_CLASS, mxREAL);
	ReconstructedImage_GPU = (int*)mxGetData(plhs[0]);
	int ArrayByteSize_RfData = sizeof(int) * (96 * 96 * 4096);
	int BYTES_PER_STREAM = ArrayByteSize_RfData / 96;
	//Memory allocation:    RfData ; we send the RF data to the device with streaming
	int* Device_RfData;  // device pointer to the RF data.
	(cudaMalloc((int**)&Device_RfData, ArrayByteSize_RfData));
	int* Device_ReconstructedImage_GPU;  //  device pointer to the reconstructed image
	int ArrayByteSize_ReconstructedImage_GPU = sizeof(int) * (96*96);
	(cudaMalloc((int**)&Device_ReconstructedImage_GPU, ArrayByteSize_ReconstructedImage_GPU));
	printf("The CUDA reconstruction started... \n");
	dim3 block(1024, 1);
	dim3 grid(64 * 64, 96);//SystemSetup.NumberOfTransmitter
	cudaStream_t* streams = new cudaStream_t[96];  //SystemSetup.NumberOfTransmitter
	int NStart_Transmit{};
	for (int transmit = 0; transmit < 96; transmit++) {
		cudaStreamCreate(&streams[transmit]);
		NStart_Transmit = transmit * (96 * 4096);
		cudaMemcpyAsync(&Device_RfData[NStart_Transmit], &RfData[NStart_Transmit], BYTES_PER_STREAM, cudaMemcpyHostToDevice, streams[transmit]);
		kernel_Reconstruction2 << <grid, block, 0, streams[transmit] >> > (&Device_RfData[NStart_Transmit], Device_ReconstructedImage_GPU, transmit, NStart_Transmit);
		(cudaPeekAtLastError());
	}
	for (int transmit = 0; transmit < 96; transmit++) { cudaStreamDestroy(streams[transmit]); }   // destroy the streams
	delete[] streams;
	cudaDeviceSynchronize();
	(cudaMemcpy(ReconstructedImage_GPU, Device_ReconstructedImage_GPU, ArrayByteSize_ReconstructedImage_GPU, cudaMemcpyDeviceToHost));
(cudaFree(Device_RfData));
(cudaFree(Device_ReconstructedImage_GPU));
}

I have also checked if the "RfData" contains the actual values in the host or not; all is good with this as well. So, i think there is something wrong with the "Device_RfData" values. Is there anything that I'm missing ?

Regards,

Moein.

2 Comments
Show NoneHide None

Richard on 22 Mar 2021

Hi Moein, could you add a Matlab code example to show how you are calling this function, and what output you expect for that example? (I can't see where ReconstructedImage_GPU is ever altered on the device - are you just expecting a zeros matrix as output and not getting it?)

Moein Mozaffarzadeh on 22 Mar 2021

Open in MATLAB Online

Hi Richard,

To breifly explain it, ReconstructedImage_GPU is an image. For each pixel of this image, i need to sum up some of the samples of the Dev_RfData. This is what happens in my Kernel, and what i provided is the simplified version of my code. Hers is what exactly happens :

 
__global__ void kernel_Reconstruction2(Setup*  SetupLoaded_p, float*   MediumZ_p, float*  MediumX_p, float*  TRansducerCorrZ_p, float*  TRansducerCorrX_p
	, int*  RfData, int*  Dir, int   Dir_Size, int  Reconstruct_SoundSpeed, int* ReconstructedImage_GPU, int  transmit, int  NStart_Transmit, int  size, float*  Device_ConvArrivalTime) {
	int TID = threadIdx.y * blockDim.x + threadIdx.x;
	int BlockOFFset = blockDim.x * blockDim.y * blockIdx.x;
	int RowOFFset = blockDim.x * blockDim.y * gridDim.x * blockIdx.y;
	int GID = RowOFFset + BlockOFFset + TID;
	int GID_RowBased = BlockOFFset + TID;
	int D1, D2, sam, Pz_man, Px_man, receive, RoundTripSample, IndexingReceive, IndexingTransmit;
	float ReceiveTime, RoundTripTime, TransmitTime;
	if (GID_RowBased < size) {
		 Px_man = (GID_RowBased) % (SetupLoaded_p->Nx);
		 Pz_man = (GID_RowBased) / (SetupLoaded_p->Nx);
		 receive = blockIdx.y;
		 IndexingReceive = receive * Dir_Size + (GID_RowBased);
		 IndexingTransmit = transmit * Dir_Size + (GID_RowBased);
		TransmitTime = (sqrtf(((TRansducerCorrX_p[transmit] - MediumX_p[Px_man]) * (TRansducerCorrX_p[transmit] - MediumX_p[Px_man])) + ((TRansducerCorrZ_p[transmit] - MediumZ_p[Pz_man]) * (TRansducerCorrZ_p[transmit] - MediumZ_p[Pz_man])))) / Reconstruct_SoundSpeed;
		ReceiveTime = (sqrtf(((TRansducerCorrX_p[receive] - MediumX_p[Px_man]) * (TRansducerCorrX_p[receive] - MediumX_p[Px_man])) + ((TRansducerCorrZ_p[receive] - MediumZ_p[Pz_man]) * (TRansducerCorrZ_p[receive] - MediumZ_p[Pz_man])))) / Reconstruct_SoundSpeed;
		RoundTripTime = (TransmitTime + ReceiveTime);
		RoundTripTime += (SetupLoaded_p->TransmissionOffset);
		RoundTripSample = lroundf(RoundTripTime * SetupLoaded_p->Fs)-1;
	ReconstructedImage_GPU[GID_RowBased] += ((RfData[RoundTripSample + ((receive)*SetupLoaded_p->NumberOfSamples)])
			* (Dir[IndexingReceive] * Dir[IndexingTransmit]));
	}
}

Of course, i need to copy all the arrays used in the kernel beforehand using cudaMemcpy. Sorry if i does not match well with my Gateway mex code (I tried to provide a compileable code in my first post).

I'm sure that the kernel works fine since it gives me the correct image in my Visual Studio project (sorry if i do not put all the codes here as it takes a lot of variables and a large datasets).

I think the Matlab does not launch the kernel correctly. Do you have any idea what could be wrong?

Moein.

Sign in to comment.

Sign in to answer this question.

Answer 1

Richard on 24 Mar 2021

0
Link

Direct link to this answer

https://ch.mathworks.com/matlabcentral/answers/777732-cuda-kernel-does-not-work-when-a-data-is-transfer-to-the-gpu-any-problem-with-my-gateway-mex-code#answer_656963

Edited: Richard on 24 Mar 2021

Open in MATLAB Online

Hi Moein,

There is no way of allocating MATLAB arrays directly as pinned CUDA memory. Inputs to a mex function will always be allocated by MATLAB's memory manager.

You should be able to allocate your own pinned host memory and copy data from the mxArray into it before then using streams to do the async copy to device. I don't know whether this additional host-side copy will be more expensive than the savings from using streams though.

In my tests I was using a basic call to mexcuda:

mexcuda TUI_CUDA.cu MexFunctions.cu

You might also want to call mex -setup to ensure that you have the same compiler selected in MATLAB as you are using in VS.

I do not think it is likely that the differences you are seeing are due to compiler flags. I think it is better to concentrate on validating that the inputs, then processing steps and then outputs match between your VS version and MATLAB version.

As seen with the output visualization issue, the memory order of arrays is different between the two. There is clearly a 2D (or even 3D) structure to your input data, and you are processing it with kernels that are designed to work on a slice along one of those dimensions. But the input arrays are 1D. Is the input data that you pass in from MATLAB definitely imported correctly? If 2D matrices were involved at some stage then there may be a transpose missing there as well. Have you tried resizing the inputs to an (N-by-ntransmitters) image to check that both of the 1D streams look to have the expected structure?

When you have verified that the input data streams are identical, then you will need to look at exactly where in the kernel processing a deviation occurs. It will likely be very helpful if you can formulate an example input that is much smaller so that you have fewer data items to track and compare between the two systems while you do this. Progressively (temporarily) cutting code out of the kernel until outputs match is another approach to try - you could even go as far as having the kernel simply output thread indices as a basic test to prove each approach is executing the same kernel.

Richard

9 Comments
Show 7 older commentsHide 7 older comments

Moein Mozaffarzadeh on 21 May 2021

Edited: Moein Mozaffarzadeh on 21 May 2021

TUI_Main.zip

Hi Richard,

Sorry to bother again. I have been working on optimizing my code for a while.

Indeed the only variable changing is RfData, but i still want to have everything included in my MEXgateway code, not in Matlab. This is because I want to make a MEX file that I could simply share with others.

Now, here is the problem:

At the end of my gateway code (as you can see in the lines 320-356 of the attached code), i remove all the device/host variables. I need the first 241 lines to run only once so i could have all the variables in the GPU (this is why i have defined an if in line 24 with TransferDataToDevice variable). However, this makes all the variables undefined in my code and I cannot compile the code in the visual studio. Is there any solution for this?

One more thing that I found strange is that :

I'm already removing all the device/host variables at the ed of my code adn also using cudaDeviceReset. if i do not use cudaDeviceReset and put my MEX function (assuming that I do not use the if in line 24 to get rid of the compiling error) in a loop in matlab, it works for the first iteration, but Matlab crashes with second iteration!. Any idea what could be the reason?

Regards,

Moein.

Richard on 24 May 2021

Hi Moein,

The direct problem here is that variables in C that you want to reference in future calls to a function need to be declared as static, and their declaration needs to be in an appropriate scope, i.e. not in the if() branch. Only the code that creates the contents of the variables needs to be guarded and run just once. They also should not be freed at the end of the mexfunction, and you should not use cudaResetDevice, because both of those will make the cuda pointers invalid in the next call. You will need to register a function to free the pointers when the mex file is cleared using mexAtExit - see https://www.mathworks.com/help/matlab/apiref/mexatexit.html.

However, I don't think that this approach is best way to approach the problem. It does not mean that the mex function will work without any user intervention: you will need to clearly document to users that they must set the "TransferToHost" flag whenever certain other input variables have different data. And also, they should not set that flag if they want to see best performance at other times. This interface is confusing and is likely that a user will make a mistake. It also means that the user has no real control over the lifetime of the persistent GPU data - there is no obvious way for them to say "ok, I'm done with your code, now I want to use the whole GPU for something else".

If instead you accept a gpuArray as input to the mex function, you will just need to document that you support gpuArrays as inputs, and that the user should type "otherData = gpuArray(otherData)" before calling your mex function. You can point to MathWorks documentation for how to create a gpuArray, and the lifetime of the persistent data will be controlled by standard Matlab rules for the life of data - it will persist for as long as the user has a variable that contains it, and it will be freed when the variable is cleared.

Looking at your StackOverflow question , it looks like you are making progress on the latter option. The easiest way of compiling these GPU mex functions for Matlab is to use the mexcuda command inside Matlab, but I do understand that this is part of a bigger project and you want VS to produce it - I would suggest using mexcuda as a starting point so that you can test and confirm that your code works as expected, and then work on configuring VS (I suspect that you may just need to point VS at the extern/lib/win64/microsoft directory so that it finds the gpu.lib file)

Richard

Moein Mozaffarzadeh on 24 May 2021

Hi Richard,

Hi Richard, Thank you very much for your explanation. Yes, I think you are right about a document which support gpuArrays. So, I'll switch to use this feasure of Matlab.

I have just figured out that there is a memory management problem with my MEXGateway code. The first step to use gpuArrays is to remove the "cudaResetDevice" at the end of my code (but still have all the memories removed) to make sure that I'm in a full control of the memories I have defined throughout the code (and I'm not missing something). However, makes the output of my MexFunction incorrect (the output almost gets doubled each time i run the MexFunction in Matlab); the first time i run the MexFunction, all is fine and I get the output i expect though. The problem and the codes are provided in the following link:

https://www.mathworks.com/matlabcentral/answers/838088-why-the-output-of-my-mex-function-become-incorrect-when-used-in-a-for-loop-in-matlab-memory-managem?s_tid=prof_contriblnk

Since you are already sort of familiar with my coding and project :), I would appreciate your comment and support for this. Thanks in advance.

Moein.

Sign in to comment.

Answer 2

Richard on 22 Mar 2021

0
Link

Direct link to this answer

https://ch.mathworks.com/matlabcentral/answers/777732-cuda-kernel-does-not-work-when-a-data-is-transfer-to-the-gpu-any-problem-with-my-gateway-mex-code#answer_654867

Edited: Richard on 22 Mar 2021

Moein, there are a couple of issues that I can see in the code.

The first issue is that it is not initializing the GPU device in any way. The easiest way to do this is to add a call to mxInitGPU as documented in https://www.mathworks.com/help/parallel-computing/mxinitgpu.html . This will use MATLAB's normal rules to select the default GPU. I added a simple assignment line in the kernel to produce dummy output values which I also observed were not returned, and then confirmed that adding a call to mxInitGPU at the start of mexFunction() caused the expected values to be returned (after restarting MATLAB to fix the GPU state).

The second issue is that the code sample looks to be over-running array bounds when copying back from Device_ReconstructedImage_GPU to ReconstructedImage_GPU. The output mxArray is created as a 1-by-(64x64) matrix but the array size used on the device is 1-by-(96x96). This is likely to cause MATLAB to crash - the output mxArray needs to match the GPU device array size.

7 Comments
Show 5 older commentsHide 5 older comments

Moein Mozaffarzadeh on 22 Mar 2021

Open in MATLAB Online

Hi Richard,

Thank you for your help. first, let me confirm that defining a 1-by-(96x96) was my bad during making this simple code to post here. here is my actial MEX gateway code , which is fully adpative:

#include "Header.cuh"
#define NRHS 10   // the number of inputs
#define NLHS 2  // the number of outputs
__global__ void Test(int * Device_MediumZ) {
	Device_MediumZ[threadIdx.x] += 1;
	//printf("Device_MediumZ: %d ", Device_MediumZ[threadIdx.x]);
}
void mexFunction(int nlhs, mxArray* plhs[],
	             int nrhs, const mxArray* prhs[]) {
    /* check for proper number of arguments */
    if (nrhs != NRHS) {
        mexErrMsgIdAndTxt("MyToolbox:arrayProduct:nrhs", "Number of inputs must be %u", NRHS);
    }
    if (nlhs != NLHS) {
        mexErrMsgIdAndTxt("MyToolbox:arrayProduct:nlhs", "Number of outputs must be %u", NLHS);
    }
	int Counter = 0;
	double TransferDataToDevice;
	TransferDataToDevice = mxGetScalar(prhs[Counter]); Counter++;;   // we can set this variable "0" once the informative data are transfered.
    Setup SystemSetup;  // the variable containing the properties of the medium.
	fillSetup(SystemSetup, prhs[Counter], 1); Counter++;  // set the last parameter zero if you do not want to show the properties of the setup.
	float* MediumX;  // X
	float* MediumZ; // Z coordinates of the pixels
	int* LensZIndex;  // Z index of the lens; it has been already substracted by -1 in MATLAB
	int* RfData;  // RF data; a pinned memory was dedicated to this
	int* Dir;     // directivity pattern of the elements; a pinned memory was dedicated to this
	float* LensArrivalTime; // The arrival time of the lens index with respect to the array elements; a pinned memory was dedicated to this
	float* TRansducerCorrX;  // X
	float* TRansducerCorrZ;  // Z coordinates of the array elements
	int* ReconstructedImage_GPU;
	//const mxArray* pMxArray= prhs[Counter];
	MediumX = (float*)mxGetPr(prhs[Counter]); Counter++;
	MediumZ = (float*)mxGetPr(prhs[Counter]); Counter++;
	LensZIndex = (int*)mxGetPr(prhs[Counter]); Counter++;
	RfData = (int *)mxGetPr(prhs[Counter]); Counter++;
	Dir = (int *)mxGetPr(prhs[Counter]); Counter++;
	LensArrivalTime = (float*)mxGetPr(prhs[Counter]); Counter++;
	TRansducerCorrX = (float*)mxGetPr(prhs[Counter]); Counter++;
	TRansducerCorrZ = (float*)mxGetPr(prhs[Counter]); Counter++;
	plhs[0] = mxCreateNumericMatrix(1, SystemSetup.Nz * SystemSetup.Nx, mxINT32_CLASS, mxREAL);
	//plhs[0] = mxCreateDoubleMatrix(1, SystemSetup.Nz * SystemSetup.Nx, mxREAL);
	ReconstructedImage_GPU = (int*)mxGetData(plhs[0]);
	int* RFOUT;
	plhs[1] = mxCreateNumericMatrix(1, (SystemSetup.NumberOfTransmitter * SystemSetup.NumberOfReceiver * SystemSetup.NumberOfSamples), mxINT32_CLASS, mxREAL);
	RFOUT = (int*)mxGetData(plhs[1]);
	printf("RfData : %d , %d, %d . \n", RfData[22], RfData[3538942], RfData[3538943]);
	int* ZIndexProximal = (int*)calloc(SystemSetup.Nx, sizeof(int));                          // pointer for the proximal end
	int* ZIndexDistal = (int*)calloc(SystemSetup.Nx, sizeof(int));                            // pointer for the distal end
	//int* ReconstructedImage_GPU = (int*)calloc(SystemSetup.Nz * SystemSetup.Nx, sizeof(int));              // the pointer for the reconstructed image
	//int* ReconstructedImage_GPU ;              // the pointer for the reconstructed image
	//int* ReconstructedImage_GPU = new int[SystemSetup.Nz * SystemSetup.Nx]{}; // the pointer for the reconstructed image
	//for (int Pz = 0; Pz < SystemSetup.Nz; Pz++) {
	//	for (int Px = 0; Px < SystemSetup.Nx; Px++) {
	//		ReconstructedImage_GPU[Pz * SystemSetup.Nx + Px]=0;
	//	}
	//}
	//int* ReconstructedImage_GPU ; // the pointer for the reconstructed image
	int NStart_Transmit{};
	bool ShowTime = 1;
	clock_t CUDA_start, CUDA_end, MAX_Find_Start, MAX_Find_End;  // pointers to calculate the processing time at different stages
	//float* MediumX = (float*)calloc(SystemSetup.Nx, sizeof(float));
	//float* MediumZ = (float*)calloc(SystemSetup.Nz, sizeof(float));
	//int* LensZIndex = new int[SystemSetup.Nx];  // Z index of the lens; it has been already substracted by -1 in MATLAB
	//int* RfData;  // RF data; a pinned memory was dedicated to this
	int ArrayByteSize_RfData = sizeof(int) * (SystemSetup.NumberOfTransmitter * SystemSetup.NumberOfReceiver * SystemSetup.NumberOfSamples);
	//int* Dir;     // directivity pattern of the elements; a pinned memory was dedicated to this
	int Dir_Size = SystemSetup.Nz * SystemSetup.Nx;  // keep this constant. to refere to each pixel seen by each element, we need to shift by Dir_Size
	int ArrayByteSize_Dir = sizeof(int) * (SystemSetup.NumberOfReceiver * Dir_Size);
	//float* LensArrivalTime; // The arrival time of the lens index with respect to the array elements; a pinned memory was dedicated to this
	float ArrayByteSize_LensArrivalTime = sizeof(float) * (SystemSetup.NumberOfReceiver * SystemSetup.Nx);
	//float* TRansducerCorrX = new float[SystemSetup.NumberOfReceiver];  // X
	//float* TRansducerCorrZ = new float[SystemSetup.NumberOfReceiver];  // Z coordinates of the array elements
	//int* ZIndexProximal = (int*)calloc(SystemSetup.Nx, sizeof(int));                          // pointer for the proximal end
	//int* ZIndexDistal = (int*)calloc(SystemSetup.Nx, sizeof(int));                            // pointer for the distal end
	//int* ReconstructedImage_GPU = new int[SystemSetup.Nz * SystemSetup.Nx]{}; // the pointer for the reconstructed image
	//float* ReconstructedImage_GPU_Filtered = new float[SystemSetup.Nz * SystemSetup.Nx]{}; // the pointer for the filtered reconstructed image
	//int NStart_Transmit{};
	//bool ShowTime = 1; // set "1" if want to show the processing time on the consol
	//clock_t CUDA_start, CUDA_end, MAX_Find_Start, MAX_Find_End;  // pointers to calculate the processing time at different stages
	////>>>>>>>>>>>>>>>>> POINTERS for the DEVICE  >>>>>>>>>>>>>>>>>>>>>>
	Setup* Device_SystemSetup;   // Devide pointer to the medium properties
	float ArrayByteSize_MediumZ = sizeof(float) * SystemSetup.Nz;
	float* Device_MediumZ;    // device pointer to the Z coordinates of the medium
	float ArrayByteSize_MediumX = sizeof(float) * SystemSetup.Nx;
	float* Device_MediumX; // device pointer to the X coordinates of the medium
	float ArrayByteSize_TRansducerCorrZ = sizeof(float) * SystemSetup.NumberOfReceiver;
	float* Device_TRansducerCorrZ;  // device pointer to the Z coordinates of the array elements
	float ArrayByteSize_TRansducerCorrX = sizeof(float) * SystemSetup.NumberOfReceiver;
	float* Device_TRansducerCorrX; // device pointer to the X coordinates of the array elements
	int ArrayByteSize_LensZIndex = sizeof(int) * SystemSetup.Nx;
	int* Device_LensZIndex;  // device pointer to the Z coordinates of the lens.
	float* Device_LensArrivalTime; // device pointer to the lens arrival time.
	int* Device_RfData;  // device pointer to the RF data.
	int BYTES_PER_STREAM = ArrayByteSize_RfData / SystemSetup.NumberOfTransmitter;
	int* Device_Dir;// device pointer to the directivity of the array elements.
	int ArrayByteSize_ReconstructedImage_GPU = sizeof(int) * (SystemSetup.Nz * SystemSetup.Nx);
	int* Device_ReconstructedImage_GPU;  //  device pointer to the reconstructed image
	int Size_Proximal = (SystemSetup.Perio_ZEnd - SystemSetup.LenseThicknessIndex) * SystemSetup.Nx;
	float ArrayByteSize_ProximalArrivalTime = sizeof(float) * (SystemSetup.NumberOfReceiver * Size_Proximal);
	float* Device_ProximalArrivalTime;  // Device pointer to the arrival time of the pixels we need to reconstruct to find the proximal end
	int ArrayByteSize_ReconstructedImage_GPU_ProximalSegmentation = sizeof(int) * ((SystemSetup.Perio_ZEnd - SystemSetup.LenseThicknessIndex) * SystemSetup.Nx);
	float* Device_DistalArrivalTime;// Device pointer to the arrival time of the pixels we need to reconstruct to find the distal end
	float* Device_RestArrivalTime;  // Device pointer to the arrival time of the pixels we need to reconstruct after finding the distal end
	int* Device_ZIndexProximal;    // Device pointer to the segmented proximal end
	int* Device_ZIndexDistal; // Device pointer to the segmented distal end
	int GridX, size, ReconstructionSoundSpeed, Size_Distal;
	int NumberOfPixels{};
	/// pointers needed for parabola fitting:
	int* xParabola = new int[SystemSetup.Nx];
	int* FittingOutput = new int[SystemSetup.Nx];
	for (int i = 0; i < SystemSetup.Nx; i++) {
		xParabola[i] = i;
	}
	int Segmentation_ZStartPixel;
	int Segmentation_ZEndPixel;
	int Nz_Seg;
	float ArrayByteSize_DistalArrivalTime;
	float ArrayByteSize_RestArrivalTime;
	//>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>           prepare the data to run Kernels in the device  <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
	// Memory allocation and transfer to the Device:      SystemSetup this should be only transfered ONCE
	gpuErrchk(cudaMalloc((void**)&Device_SystemSetup, sizeof(Setup)));
	gpuErrchk(cudaMemcpy(Device_SystemSetup, &SystemSetup, sizeof(Setup), cudaMemcpyHostToDevice));
	// Memory allocation and transfer to the Device:      MediumZ  this should be only transfered ONCE
	gpuErrchk(cudaMalloc((float**)&Device_MediumZ, ArrayByteSize_MediumZ));
	gpuErrchk(cudaMemcpy(Device_MediumZ, MediumZ, ArrayByteSize_MediumZ, cudaMemcpyHostToDevice));
	// Memory allocation and transfer to the Device :    MediumX  this should be only transfered ONCE
	gpuErrchk(cudaMalloc((float**)&Device_MediumX, ArrayByteSize_MediumX));
	gpuErrchk(cudaMemcpy(Device_MediumX, MediumX, ArrayByteSize_MediumX, cudaMemcpyHostToDevice));
	// Memory allocation and transfer to the Device :    TRansducerCorrZ  this should be only transfered ONCE
	gpuErrchk(cudaMalloc((float**)&Device_TRansducerCorrZ, ArrayByteSize_TRansducerCorrZ));
	gpuErrchk(cudaMemcpy(Device_TRansducerCorrZ, TRansducerCorrZ, ArrayByteSize_TRansducerCorrZ, cudaMemcpyHostToDevice));
	// Memory allocation and transfer to the Device :    TRansducerCorrX  this should be only transfered ONCE
	gpuErrchk(cudaMalloc((float**)&Device_TRansducerCorrX, ArrayByteSize_TRansducerCorrX));
	gpuErrchk(cudaMemcpy(Device_TRansducerCorrX, TRansducerCorrX, ArrayByteSize_TRansducerCorrX, cudaMemcpyHostToDevice));
	//Memory allocation and  transfer to the Device :    LensZIndex  this should be only transfered ONCE
	gpuErrchk(cudaMalloc((int**)&Device_LensZIndex, ArrayByteSize_LensZIndex));
	gpuErrchk(cudaMemcpy(Device_LensZIndex, LensZIndex, ArrayByteSize_LensZIndex, cudaMemcpyHostToDevice));
	// Memory allocation and transfer to the Device :    LensArrivalTime  this should be only transfered ONCE
	gpuErrchk(cudaMalloc((float**)&Device_LensArrivalTime, ArrayByteSize_LensArrivalTime));
	gpuErrchk(cudaMemcpy(Device_LensArrivalTime, LensArrivalTime, ArrayByteSize_LensArrivalTime, cudaMemcpyHostToDevice));
	//Memory allocation and transfer to the Device :    Dir ; this should be only transfered ONCE
	/*gpuErrchk(cudaMalloc((int**)&Device_Dir, ArrayByteSize_Dir));
	gpuErrchk(cudaMemcpy(Device_Dir, Dir, ArrayByteSize_Dir, cudaMemcpyHostToDevice));*/
	gpuErrchk(cudaMalloc((int**)&Device_Dir, ArrayByteSize_Dir));
	gpuErrchk(cudaMemcpy(Device_Dir, Dir, ArrayByteSize_Dir, cudaMemcpyHostToDevice));
	//gpuErrchk(cudaBindTexture(NULL, tex, Device_Dir, ArrayByteSize_Dir));
	  // create texture object
	//cudaResourceDesc resDesc;
	//memset(&resDesc, 0, sizeof(resDesc));
	//resDesc.resType = cudaResourceTypeLinear;
	//resDesc.res.linear.devPtr = Device_Dir;
	//resDesc.res.linear.desc.f = cudaChannelFormatKindSigned;
	//resDesc.res.linear.desc.x = 32; // bits per channel
	//resDesc.res.linear.sizeInBytes = ArrayByteSize_Dir;
	//cudaTextureDesc texDesc;
	//memset(&texDesc, 0, sizeof(texDesc));
	//texDesc.readMode = cudaReadModeElementType;
	//// create texture object: we only have to do this once!
	//cudaTextureObject_t tex = 0;
	//cudaCreateTextureObject(&tex, &resDesc, &texDesc, NULL);
	//Memory allocation :    ReconstructedImage_GPU  this should be only allocated ONCE
	gpuErrchk(cudaMalloc((int**)&Device_ReconstructedImage_GPU, ArrayByteSize_ReconstructedImage_GPU));
	//for (int Pz = 0; Pz < SystemSetup.Nz; Pz++) {
	//	for (int Px = 0; Px < SystemSetup.Nx; Px++) {
	//		ReconstructedImage_GPU[Pz * SystemSetup.Nx + Px] = 0;
	//	}
	//}
	//gpuErrchk(cudaMemcpy(Device_ReconstructedImage_GPU, ReconstructedImage_GPU, ArrayByteSize_ReconstructedImage_GPU, cudaMemcpyHostToDevice));  // we do not transafer this as we do not need a primary image
		//Memory allocation :    Memory allocation for the Z index of the proximal end segmented with the Dikstra; this should be only allocated ONCE
	gpuErrchk(cudaMalloc((int**)&Device_ZIndexProximal, ArrayByteSize_LensZIndex));
	//Memory allocation :    Memory allocation for the Z index of the distal end segmented with the Dikstra; this should be only allocated ONCE
	gpuErrchk(cudaMalloc((int**)&Device_ZIndexDistal, ArrayByteSize_LensZIndex));
	// Memory allocation:     this contains the arrival time of all the pixels we reconstruct using the conventional method
	float* Device_ConvArrivalTime;  // Device pointer
	float ArrayByteSize_Device_ConvArrivalTime = sizeof(float) * (SystemSetup.NumberOfReceiver * SystemSetup.Nz * SystemSetup.Nx);
	gpuErrchk(cudaMalloc((float**)&Device_ConvArrivalTime, ArrayByteSize_Device_ConvArrivalTime));
	// Memory allocation:     this contains the arrival time of all the pixels we need to reconstruct to segment the proximal end
	gpuErrchk(cudaMalloc((float**)&Device_ProximalArrivalTime, ArrayByteSize_ProximalArrivalTime));
	/*
		// transfer to the Device :    PixelToSyntheticElements  this should be only transfered once
		float ArrayByteSize_PixelToSyntheticElements = sizeof(float) * SystemSetup.Nx;
		float* Device_PixelToSyntheticElements;
		gpuErrchk(cudaMalloc((float**)&Device_PixelToSyntheticElements, ArrayByteSize_PixelToSyntheticElements));
	*/
	cout << "The pointers are assigned..." << endl;
	// ***************************************************************************************
	// from here, codes need to get updated once a new data comes from the Verasonics.
	// ***************************************************************************************
	if (ShowTime == 1) { CUDA_start = clock(); }
	//Memory allocation:    RfData ; we send the RF data to the device with streaming
	gpuErrchk(cudaMalloc((int**)&Device_RfData, ArrayByteSize_RfData));
	//gpuErrchk(cudaMemcpy(Device_RfData, RfData, ArrayByteSize_RfData, cudaMemcpyHostToDevice));
	printf("The CUDA reconstruction started... \n");
	dim3 block(1024, 1);
	// >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Find the type of image recnstruction: conventional/advanced >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
	if (SystemSetup.ProcessingType == 1) {
		NumberOfPixels = (SystemSetup.Nz) * SystemSetup.Nx;
		GridX = (NumberOfPixels / block.x);
		ReconstructionSoundSpeed = SystemSetup.WaterSoundSpeed;
	}
	else if (SystemSetup.ProcessingType == 2) {
		NumberOfPixels = SystemSetup.LenseThicknessIndex * SystemSetup.Nx;
		GridX = (NumberOfPixels / block.x + (((NumberOfPixels % block.x) != 0) * 1));
		ReconstructionSoundSpeed = SystemSetup.LenseSoundSpeed;
	}
	else {
		cout << "ERROR in the Griding: The processing type does not exist..." << endl;
	}
	//dim3 grid(GridX, SystemSetup.NumberOfTransmitter);//SystemSetup.NumberOfTransmitter
	dim3 grid(GridX, SystemSetup.NumberOfTransmitter);//SystemSetup.NumberOfTransmitter
	cudaStream_t* streams = new cudaStream_t[SystemSetup.NumberOfTransmitter];  //SystemSetup.NumberOfTransmitter
	for (int transmit = 0; transmit < SystemSetup.NumberOfTransmitter; transmit++) {
		cudaStreamCreate(&streams[transmit]);
		NStart_Transmit = transmit * (SystemSetup.NumberOfReceiver * SystemSetup.NumberOfSamples);
		cudaMemcpyAsync(&Device_RfData[NStart_Transmit], &RfData[NStart_Transmit], BYTES_PER_STREAM, cudaMemcpyHostToDevice, streams[transmit]);
		kernel_Reconstruction2 << <grid, block, 0, streams[transmit] >> > (Device_SystemSetup, Device_MediumZ, Device_MediumX, Device_TRansducerCorrZ, Device_TRansducerCorrX,
		&Device_RfData[NStart_Transmit], Device_Dir, Dir_Size, ReconstructionSoundSpeed, Device_ReconstructedImage_GPU, transmit, NStart_Transmit, NumberOfPixels, Device_ConvArrivalTime);
		gpuErrchk(cudaPeekAtLastError());
	}
	for (int transmit = 0; transmit < SystemSetup.NumberOfTransmitter; transmit++) { cudaStreamDestroy(streams[transmit]); }   // destroy the streams
	delete[] streams;
	cudaDeviceSynchronize();
	if (ShowTime == 1) {
		CUDA_end = clock();
		//cout << "Processing time in GPU: " << (double)((double)(CUDA_end - CUDA_start) / CLOCKS_PER_SEC) << " [s]." << endl;
		printf("Processing time in GPU: %f [s]. \n ", (double)((double)(CUDA_end - CUDA_start) / CLOCKS_PER_SEC));
	}
	gpuErrchk(cudaMemcpy(ReconstructedImage_GPU, Device_ReconstructedImage_GPU, ArrayByteSize_ReconstructedImage_GPU, cudaMemcpyDeviceToHost));
	ofstream fout30("ReconstructedImage_GPU_Check.txt");
	for (int Pz = 0; Pz < SystemSetup.Nz; Pz++) {
		for (int Px = 0; Px < SystemSetup.Nx; Px++) {
			fout30 << ReconstructedImage_GPU[Pz * SystemSetup.Nx + Px] << ",";
		}
		fout30 << endl;
	}
	fout30.close();
	/* create the output matrix */
	//plhs[0] = mxCreateDoubleMatrix(1, SystemSetup.Nz * SystemSetup.Nx, mxREAL);
	//ReconstructedImage_GPU = (int*)mxGetData(plhs[0]);
mexPrintf("The code works fine\n");
// delete the memories defined in the program
gpuErrchk(cudaFree(Device_SystemSetup));
gpuErrchk(cudaFree(Device_MediumZ));
gpuErrchk(cudaFree(Device_MediumX));
gpuErrchk(cudaFree(Device_TRansducerCorrZ));
gpuErrchk(cudaFree(Device_TRansducerCorrX));
gpuErrchk(cudaFree(Device_LensZIndex));
gpuErrchk(cudaFree(Device_LensArrivalTime));
gpuErrchk(cudaFree(Device_Dir));
gpuErrchk(cudaFree(Device_ConvArrivalTime));
gpuErrchk(cudaFree(Device_RfData));
//
gpuErrchk(cudaFree(Device_ReconstructedImage_GPU));
gpuErrchk(cudaFree(Device_ProximalArrivalTime));
//if (SystemSetup.ProcessingType == 2) {
	gpuErrchk(cudaFree(Device_ZIndexProximal));
//	gpuErrchk(cudaFree(Device_DistalArrivalTime));
	gpuErrchk(cudaFree(Device_ZIndexDistal));
//	gpuErrchk(cudaFree(Device_RestArrivalTime));
//
//}
cudaDeviceReset();
delete[] xParabola;
delete[]FittingOutput;
//delete[] MediumX;
//delete[] MediumZ;
//free(MediumZ);
//delete[] TRansducerCorrX;
//delete[] TRansducerCorrZ;
//delete[] ReconstructedImage_GPU;
//delete[] ReconstructedImage_GPU_Filtered;
//delete[] LensZIndex;
free(ZIndexProximal);
free(ZIndexDistal);
//cudaFreeHost(RfData); // free the pinned memory
//cudaFreeHost(Dir); // free the pinned memory
//cudaFreeHost(LensArrivalTime); // free the pinned memory
}

here is also my Header.cuh code:

#ifndef Header
#define Header
//#include<windows.h>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include "cuda.h"
//#include "cuda_fp16.h"
#include <iostream>
#include <iomanip>
struct Setup {
    int ProcessingType, NumberOfTransmitter, NumberOfReceiver, NumberOfSamples, WaterSoundSpeed, CBone, LenseSoundSpeed, Nz, Nx, Perio_ZStart, DetectionAngle,
        Perio_ZEnd, LenseThicknessIndex, Distal_ZStart_Index,Distal_ZEnd_Index, FittingDegree;
    float F0, Fs,TransmissionOffset, ArrayElementWidth, Lambda ;
};
#define PI 3.14159265
#include <iostream>
#include <fstream>
#include <sstream>
#include <chrono>
#include <iterator>
#include <stdio.h>
#include <cstdio>
using namespace std;
#include <vector>
#include <cstdlib>
#include <string>
#include <iomanip>
#include <math.h>
#include <cmath>
#include <ctime>
#include <mex.h>
#include "MexFunctions.cuh"
//texture <int, 1, cudaReadModeElementType> tex;
//texture <int> tex;
//__constant__ float constData[512];
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char* file, int line, bool abort = true)
{
	if (code != cudaSuccess)
	{
		fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
		if (abort) exit(code);
	}
}
__global__ void kernel_Reconstruction2(Setup*   SetupLoaded_p, float*   MediumZ_p, float*  MediumX_p, float*  TRansducerCorrZ_p, float*  TRansducerCorrX_p
	, int*  RfData, int*  Dir, int   Dir_Size, int  Reconstruct_SoundSpeed, int* ReconstructedImage_GPU, int  transmit, int  NStart_Transmit, int  size, float*  Device_ConvArrivalTime);
#endif // !Header

So, when i want to call this MEX file in matlab, i use

clear TUI_CUDA
gpuDevice();
[Image, RF]=TUI_CUDA(TransferDataToDevice,Setup,MediumX,MediumZ,Lens_Index_Array,...
data_Rearranged_saved,Dir_Rearranged_saved,LensToElementsArrivalTime,MediumX,MediumZ);

the link you provided was expired , but i found this one:

https://www.mathworks.com/help/parallel-computing/mxinitgpu.html

is there any difference between gpuDevice(); and

"

#include "gpu/mxGPUArray.h"

int mxInitGPU()

"

in CUDA project? BTW, where can i find this gpu/mxGPUArray.h ?

I guess now that I have provided the whole code, things are better clear. I hope i'm not making things more complicated than my first post. Please advise.

Moein

Moein Mozaffarzadeh on 22 Mar 2021

Open in MATLAB Online

oh, here is my MexFunctions.cu as well:

#include "MexFunctions.cuh"
void  fillSetup(Setup& SystemSetup, const mxArray* pMxArray, const int ShowInput) {
    double* pMxData;
    pMxData = mxGetPr(pMxArray);
	// Read Setup of experience.
	int CounterSystem = 0;
	SystemSetup.ProcessingType = (int)pMxData[CounterSystem]; CounterSystem++;
	SystemSetup.NumberOfTransmitter = (int)pMxData[CounterSystem]; CounterSystem++;
	SystemSetup.NumberOfReceiver = (int)pMxData[CounterSystem]; CounterSystem++;
	SystemSetup.NumberOfSamples = (int)pMxData[CounterSystem]; CounterSystem++;
	SystemSetup.Fs = ((float)pMxData[CounterSystem]) * 1000; CounterSystem++;
	SystemSetup.F0 = (float)pMxData[CounterSystem]; CounterSystem++;
	SystemSetup.WaterSoundSpeed = (int)pMxData[CounterSystem]; CounterSystem++;
	SystemSetup.CBone = (int)pMxData[CounterSystem]; CounterSystem++;
	SystemSetup.LenseThicknessIndex = (int)pMxData[CounterSystem]; CounterSystem++;
	SystemSetup.LenseSoundSpeed = (int)pMxData[CounterSystem]; CounterSystem++;
	SystemSetup.Nz = (int)pMxData[CounterSystem]; CounterSystem++;
	SystemSetup.Nx = (int)pMxData[CounterSystem]; CounterSystem++;
	SystemSetup.TransmissionOffset = ((float)pMxData[CounterSystem]) / 1000; CounterSystem++;
	SystemSetup.DetectionAngle = (int)pMxData[CounterSystem]; CounterSystem++;
	SystemSetup.ArrayElementWidth = (float)pMxData[CounterSystem]; CounterSystem++;
	SystemSetup.Perio_ZStart = (int)pMxData[CounterSystem]; CounterSystem++; // the pixel number at which we need to start to look for Proximal end
	SystemSetup.Perio_ZEnd = (int)pMxData[CounterSystem]; CounterSystem++; // the pixel number at which we need to finish to look for Proximal end
	SystemSetup.Distal_ZStart_Index = (int)pMxData[CounterSystem]; CounterSystem++; // the pixel number at which we need to start to look for Distal end
	SystemSetup.Distal_ZEnd_Index = (int)pMxData[CounterSystem]; CounterSystem++; // the pixel number at which we need to finish to look for Distal end
	SystemSetup.FittingDegree = (int)pMxData[CounterSystem]; CounterSystem++; // the pixel number at w
	SystemSetup.Lambda = SystemSetup.WaterSoundSpeed / SystemSetup.F0 / 1000; // lambda is on [mm]
	// let's show the properties of the setup
	if (ShowInput == 1) {
		if (SystemSetup.ProcessingType == 1) {
			cout << "Conventional image reconstruction." << endl;
		}
		else {
			cout << "Advanced image reconstruction." << endl;
		}
		cout << "Number of transmitter: " << SystemSetup.NumberOfTransmitter << endl;
		cout << "Number of receiver: " << SystemSetup.NumberOfReceiver << endl;
		cout << "Number of samples per each channel: " << SystemSetup.NumberOfSamples << endl;
		cout << "Central frequency [MHz]: " << SystemSetup.F0 << endl;
		cout << "Sampling frequency [MHz]: " << SystemSetup.Fs / 1000 << endl;
		cout << "Sound speed of water [m/s]: " << SystemSetup.WaterSoundSpeed << endl;
		cout << "Sound speed of Bone [m/s]: " << SystemSetup.CBone << endl;
		cout << "Thickness of Lens [Pixel]: " << SystemSetup.LenseThicknessIndex << endl;
		cout << "Sound speed of Lens [m/s]: " << SystemSetup.LenseSoundSpeed << endl;
		cout << "Number of pixels in Z direction: " << SystemSetup.Nz << endl;
		cout << "Number of pixels in X direction: " << SystemSetup.Nx << endl;
		cout << "Transmission offset [ms]: " << SystemSetup.TransmissionOffset << endl;
		cout << "Detection angle [degree]: " << SystemSetup.DetectionAngle << endl;
		cout << "Width of element [mm]: " << SystemSetup.ArrayElementWidth << endl;
		cout << "Proximal surface Segmentation [pixel]: " << SystemSetup.Perio_ZStart << "-" << SystemSetup.Perio_ZEnd << endl;
		cout << "Distal surface Segmentation [pixel]: " << SystemSetup.Distal_ZStart_Index << "-" << SystemSetup.Distal_ZEnd_Index << endl;
		cout << "Degree of fitting: " << SystemSetup.FittingDegree << endl;
		cout << "-------------------------------------------------------------------------" << endl;
	}
}

I'm using the visual studio to make the MEX file. Maybe i should directly use Matlab to do it?!!!!

i have tried this code in matlab:

mexcuda COMPFLAGS='$COMPFLAGS --use-local-env -maxrregcount=0 --machine 64 --compile -cudart static -Xptxas -dlcm=cg -use_fast_math -DWIN32 -DWIN64 -DNDEBUG -D_CONSOLE -D_MBCS -Xcompiler' MyFunction.cu

but still, the output of the kernel (i.e, Image in matlab) is not correct. you can also see that I'm also extracting a "ReconstructedImage_GPU_Check" to see if the problem is with Output gateway, but again, the results of the extracted ReconstructedImage_GPU_Check and "Image" are the same. I'm really confused!

Richard on 22 Mar 2021

Hi Moein,

Thanks for pointing out the broken link. I think it picked up a trailing dot and hopefully I have fixed it.

mxInitGPU is effectively the same as calling gpuDevice. The main advantage is that you don't have to remember to call gpuDevice, but it is also best practice to use it in case a future version of MATLAB needs to add extra initialization before mxGPU functions are used.

gpu/mxGPUarray.h is in <matlabroot>/toolbox/parallel/gpu/extern/include. The mexcuda command adds this to the include path for you but if you are using the VS build system then you will need to add that path.

Building with VS should not be an issue and since you are compiling successfully and not causing crashes it is unlikely that is an issue.

I did compile and run your original example with just a minor addition of assigning a fixed output value inside the kernel. This ran successfully for me and returned an array containing that fixed value.

I have managed to compile your full example but I don't know how to run it - like you say it is more complex. Is there an easy way for me to generate data, run this, and compare with expected outputs? Alternatively, is there a simple addition to the kernel of the original example that produces an output that you think is incorrect?

Richard on 23 Mar 2021

Open in MATLAB Online

Hi Moein, thanks for the example data.

I have a few things that you might want check. Firstly, I did manage to run the code and I have some images. This is the image I get using excatly the code you posted:

Is this the same incorrect output that you are seeing?

I think that the reshaping code you are using to turn the output into a matrix is incorrect. Remember that arrays in MATLAB are column-major vs row-major in C. When I use this code:

DAS_Ccode=reshape(Image,[64, 256])';
abs1= abs(DAS_Ccode);
figure;imagesc(abs1);

I get this image instead:

which looks a lot more sensible to me, but still isn't the same as the image you posted. Is the above a correct image, or not?

There are a couple of possible code issues I did note:

(1) You have a mismatch in types for the input data. The MATLAB data is in uint32 arrays - unsigned. In the C code, you are reading it via a signed int pointer. Given the values in the example data you appear to be getting away with it, but probably should fix it.

(2) I experienced persistent crashes when running the code on certain versions of MATLAB, and eventually I tracked it down to a memory access error being returned from the cudaMemcpyAsync calls that transfer data to the device. The async API requires "pinned" memory, and the source in this case is mxArray data that is not allocated via cuda at all.

Richard

Moein Mozaffarzadeh on 23 Mar 2021

Open in MATLAB Online

Hi Richard,

Thank you for your help. I think you are right about "DAS_Ccode=reshape(Image,[64, 256])';".

Yes, the first image you posted is what I also get here with my MEX function, which is not sensible and is not correct. The second image you posted seems more sensible to me as well, but still not the correct image. The correct image is the one i posted on my previous comment.

About the issues you found:

1- yes, you are right. That was indeed a problem with my code. I fix it now, but still i cannot get the correct image. I guess it is because of the next issue

2- In my cuda project in visual studio I used :

int* RfData;  // RF data; a pinned memory was dedicated to this	
int* Device_RfData;  // device pointer to the RF data.
int ArrayByteSize_RfData = sizeof(int) * (SystemSetup.NumberOfTransmitter * SystemSetup.NumberOfReceiver * SystemSetup.NumberOfSamples);
cudaMallocHost((int**)&RfData, ArrayByteSize_RfData); // pinned memory
gpuErrchk(cudaMalloc((int**)&Device_RfData, ArrayByteSize_RfData));

and using "cudaMemcpyAsync(&Device_RfData[NStart_Transmit], &RfData[NStart_Transmit], BYTES_PER_STREAM, cudaMemcpyHostToDevice, streams[transmit]);" inside the "for loop" containing my Kernel i transfer the "RfData" to the Kernel in an asynchronous way for the seek of time.

In my Mex code, as you can see, first i define a pointer "int* RfData;" and then use

RfData = (int *)mxGetPr(prhs[Counter]); Counter++;

to be able to copy the "data_Rearranged_saved" from Matlab to "RfData". I presume that since "RfData" is not a pinned memory in my MEX code is why my kernel does not work fine in MEX (but works good in visual studio project since i use a pinned memory there and as you also mentioned The async API requires "pinned" memory).

So, my question is now: how can i define a pinned memory and transfer the "data_Rearranged_saved" from Matlab to it? Please let me know.

Moein.

Moein Mozaffarzadeh on 23 Mar 2021

Open in MATLAB Online

Richard,

to add more to my previous comment, I also checked the version of my code which does not need a pinned memory. It works fine in my CUDA project in visual studio, but again does not work with the MEX file.

As a guess: I compile my MEX code in visual studio. Does it make any difference to compile in Matlab instead? if so, could you please provide me the code i need to use in Matlab?

I've already used this:

mexcuda  COMPFLAGS='$COMPFLAGS  --use-local-env -maxrregcount=0  --machine 64 -compatibleArrayDims --compile -cudart static -Xptxas -dlcm=cg -use_fast_math -DWIN32 -DWIN64 -DNDEBUG -D_CONSOLE -D_MBCS -Xcompiler' TUI_Main.cu MexFunctions.cu CudaKernels.cu

but the same wrong Image is generated.

Moein.

Sign in to comment.

cuda kernel does not work when a data is transfer to the GPU. any problem with my gateway MEX code?

2 Comments
Show NoneHide None

Accepted Answer

9 Comments
Show 7 older commentsHide 7 older comments

More Answers (1)

7 Comments
Show 5 older commentsHide 5 older comments

See Also

Categories

Tags

Community Treasure Hunt

cuda kernel does not work when a data is transfer to the GPU. any problem with my gateway MEX code?

2 Comments Show NoneHide None

Accepted Answer

9 Comments Show 7 older commentsHide 7 older comments

More Answers (1)

7 Comments Show 5 older commentsHide 5 older comments

See Also

Categories

Tags

Community Treasure Hunt

2 Comments
Show NoneHide None

9 Comments
Show 7 older commentsHide 7 older comments

7 Comments
Show 5 older commentsHide 5 older comments