GPU and CPU Parallelization and Bicg Optimization

6 views (last 30 days)
I use a matlab script to solve a big matrix using bicg function. Simply my code looks something like this:
for i=1:n
...
[Pvect] = bicg(AS, BS, tol, maxit,L,U); %where AS, BS, L, and U are different in each loop
%AS is a 10^6x10^6 sparse complex double
%BS=10^6x1 is a sparse complex double
%L&U are 10^6x10^6 sparse sparse complex double
...
end
Every for loop is independent. I recently parallelized this script by using parfor. The computer I use has 128 CPU cores, but I noticed that using parpool(anything more than 32) the local workers are exhausted (i.e., the code run time does not decrease significantly). However, I usually use n=32 (i.e., run the for script for 32 different scenarios), so this is not a big issue for me. The code currently looks something like this:
parpool(32)
parfor i=1:n
...
[Pvect] = bicg(AS, BS, tol, maxit,L,U); %where AS, BS, L, and U are different in each loop
%AS is a 10^6x10^6 sparse complex double
%BS=10^6x1 is a sparse complex double
%L&U are 10^6x10^6 sparse sparse complex double
...
end
I want to further speed up the code using gpuArray (which is supported on bicg). The main reason for that I also use another script where I run the bicg function sequentially many times. So in that case n is 1, but running it many times makes it computationally expensive. However, if possible, I also want to use gpuArrays for cases where n is 32 or more (i.e., the code described above).
I checked the documentation and other user questions, however, I am a little lost on how to utilize cpu and gpu power concurrently. The computer I use has 3 GPU's that I can utilize.
- Should I try to use only the GPUs for both the parfor loop and solution of bicg?
- Or should I run the parfoor loop with CPU power and use all the GPUs for solution of bicg? If so how can do this? As far as I understood, GPU resources will be distributed to each worker in this case.
- Or what would be your suggestion on doing this properly? Thank you very much for any kind of guidance in advance!
The computer that I use is the following (I can also try to use 2 of these computers/nodes in the future. Do you think that would help with any of the scenarios described above?):
GPU: 3x NVIDIA A100 PCIE 40GB
(1 per socket )
gpu0: socket 0
gpu1: socket1
gpu2: socket1
GPU Memory: 40 GB HBM2
CPU: 2x AMD EPYC 7763 64-Core Processor ("Milan")
Total cores per node: 128 cores on two sockets (64 cores / socket )
Hardware threads per core: 1 per core
Hardware threads per node: 128 x 1 = 128
Clock rate: 2.45 GHz
RAM: 256 GB
Cache: 32KB L1 data cache per core
512KB L2 per core
32 MB L3 per core complex
(1 core complex contains 8 cores)
256 MB L3 total (8 core complexes )
Each socket can cache up to 288 MB
(sum of L2 and L3 capacity)
Local storage: 144GB /tmp partition on a 288GB SSD.

Accepted Answer

Alvaro
Alvaro on 26 Jan 2023

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!