how can I do parallel computation on GPU?

97 views (last 30 days)
I am new to parallel and GPU computing.
I understand parfor function didnt work on GPU
I wish to do operation similar to the parfor on GPU, I checked that I can go up to 512 workers
I am wondering how can I assign worker on each gpu processor (I only have 1 gpudevice)? so that I can maybe use a for loop to do the operation, or does the function arryfun + gpuarray = parallel GPU
for example, doing matrics operation, each gpu processor handle one roll or coulmn of the matrix
some Info:
CUDADevice with properties:
Name: 'GeForce GTX 1060 3GB'
Index: 1
ComputeCapability: '6.1'
SupportsDouble: 1
DriverVersion: 10
ToolkitVersion: 9.1000
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [2.1475e+09 65535 65535]
SIMDWidth: 32
TotalMemory: 3.2212e+09
AvailableMemory: 2.5223e+09
MultiprocessorCount: 9
ClockRateKHz: 1708500
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 1
CanMapHostMemory: 1
DeviceSupported: 1
DeviceSelected: 1
  1 Comment
Matt J
Matt J on 31 Jan 2020
Comment moved here:
I don't understand why systematically one may see so much confusion between the data parallel computing, instructions (jobs) parallel computing and the mix of both.
The fact that some particular environement doesn't allow to do one of them this not imply the need for this kind of computing is wrong or impossible to do in any way.
Translating the CPU array computing to GPU array computing is one aspect of the question ("To be or not to be?!" (Shakespeare)).
But is easy to imagine something very very recurrent in any long numerical project:
for k = 1:N
R(k) = someFunction(A(k,:,:, ...), B, C)
end
The number of ':' inside the paranthesis determines the number of dimensions to consider as slices from the array A.
If "someFunction" is a very complicated mix of A and other arrays B, C etc, forget for the translation of the CPU functions envolving BLAS (via "gpuArray") to GPU corresponding CUBLAS libraries.
By the way, "Arrayfun" (built-in inside Matlab) is not for big help, it has too much restrictions. The single solution might be to use a translation of the PARFOR (which is OK for CPU data) to the GPU. Non existent inside Matlab. Maybe in the next release of the PCT toolbox :-)
Please try to answer to this question instead telling to Matlab users that transforming any variable into gpuArrays and keeping the FOR as simple sequential loop will solve the question. One may win for a static computation, but in the case of a a very long sequential FOR (N = 1000 or 10000), translating all on the GPU may become dramatically slow (the frequency of the GPU's shaders is smaller than the frequency of the CPU cores, and the cache memory is drastically smaller).
The equivalent CUDA FOR loop code (which any Matlab user tries to avoid) is very fast because the FOR is compiled to be executed IN PARALLEL on many shaders. Not sequentially !
Regards

Sign in to comment.

Accepted Answer

Matt J
Matt J on 24 Oct 2019
Edited: Matt J on 24 Oct 2019
That is not the way to take advantage of the GPU. Instead of trying to parallelize loop iterations, you get the benefit of the GPU by turning your large matrices and arrays into gpuArrays and doing the same kind of manipulations with them as you would with normal Matlab matrices on the CPU. Except, on the GPU, these operations will be faster.
  7 Comments
Walter Roberson
Walter Roberson on 25 Oct 2019
Yes, if a GPU has been given enough work it will use all of the cores. If the arrays are smaller it might not need all of the cores.

Sign in to comment.

More Answers (1)

Walter Roberson
Walter Roberson on 31 Jan 2020
Edited: Walter Roberson on 1 Feb 2020
arrayfun for gpu array works by generating bytecodes for kernels and sending them to the gpu along with the data. It holds off collecting the results until you gather() so that it does not need to send or retrieve the data for arrays that are already on the gpu when more processing is requested.
There are a limited number of functions that arrayfun knows how to generate effecient kernels for.
If I understand correctly (and I could be wrong), this operates like a client / server architecture with caching, with the MATLAB process sending individual requests and the gpu acting on them. As far as I understand (and I could definitely be wrong) there is little combining of kernels, but that some typical patterns are detected and combined into a single kernel, similar to the way that for large numeric arrays MATLAB can detect some common patterns of computation and turn them into BLAS or LAPACK calls (for example A*x+B can be done more efficiently than calculating all of A*x and then doing the addition to the matrix B. Modern CPU often have a "Fused Multiply And Add" hardware instruction.) But if I understand correctly, the current implementation does not "hold on to" lists of processing requests for the GPU and then analyze the complete request to figure out the most efficient gpu kernel to achieve the combined operation. ("tall" arrays for CPU have that property of optimizing arbitrary combinations of calls)
The step after that, to generate efficient kernels for combinations of instructions, is to use the GPU Coder product https://www.mathworks.com/products/gpu-coder.html . That product supports a subset of MATLAB language features and has a library of about 139 library calls mostly for computer vision and deep learning at this time. The product transforms the MATLAB into a C-like language that is then compiled by a specialized optimizing compiler that tracks data lifetimes and order dependencies to generate an efficient byte level program to achieve the entire task. This product does not give any explicit mechanism for splitting up computation -- no "gpuparfor" to specify complete threads.
When this is not enough, you can use mexcuda https://www.mathworks.com/help/parallel-computing/mexcuda.html to compile a MATLAB compatible C interface to cuda libraries, including linking to cuda kernels that you have programmed using Nvidia programming tools to create .cu libraries. You can, if you want, get down to the level of telling each control unit on the gpu what to do independently, but you would not be doing so in terms of MATLAB code.

Categories

Find more on Parallel Computing Fundamentals in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!