# help with parallelization of matrix operations

5 views (last 30 days)
Daniel Ackerberg on 23 Sep 2023
Edited: James Tursa on 23 Sep 2023
Hi. I'm trying to fully utilize a multiprocessor machine, and am running into a problem in that some matrix multiplications seem to be parallelized, and others do not. A simple illustration of the issue is the below code. Calculating both x and y in the code require 1 billion multiplications. But when it is an element by element multiplication of two 1bill x 1 vectors (i.e. x = a.*b), it is highly parallelized (I can see all CPUs being used), but when it is an outer product of a 100millx1 vector times a 1x10 vector (i.e. y=c*d), Matlab does not appear to parallelize the operation, and it takes about 4 times as long.
It seems that since both y=c*d and x = a.*b are doing 1 billion multiplications, there should be a way to get the y=c*d operation done in parallel and at least as quick as the x = a.*b operation (the actual problem I am doing is of the form y=c*d). Kron(c,d') (also 1 billion multiplications) does a little better and does seem to parallelize, but still not as fast as a.*b. Thanks in advance for any help.
a = rand(1000000000,1);
b = rand(1000000000,1);
c = rand(100000000,1);
d = rand(1,10);
for t=1:10
tic
x=a.*b;
toc
end;
for t=1:10
tic
y=c*d;
toc
end;

James Tursa on 23 Sep 2023
Edited: James Tursa on 23 Sep 2023
Both of these operations can be multi-threaded in the background, but this kind of timing test is not straightforward to do because large amounts of memory and thus caching on the CPU will heavily come into play. Also, I don't know the rules the MATLAB BLAS library functions use for when and how to multi-thread the operations based on input sizes (or even if they call a BLAS function for the outer product). The element-wise multiply is certainly the most efficient use of cache that you can get, linear memory access used only once. And if I "square" the outer product you can get somewhat comparable timings (tests slightly altered to make sure no left-over cache gets used in the next iteration):
ti = zeros(10,1);
for t=1:10
a = rand(100000000,1);
b = rand(100000000,1);
tic
x=a.*b;
ti(t) = toc;
end
mean(ti)
ans = 0.0899
to = zeros(10,1);
for t=1:10
c = rand(10000,1);
d = rand(1,10000);
tic
y=c*d;
to(t) = toc;
end
mean(to)
ans = 0.1695
So, the outer product takes longer than I would have expected compared to the element-wise multiply (about double the time), but the fact that it takes longer is to be expected (at least by me) because of the cache issue. I would guess some memory gets pulled into the cache more than once. I haven't looked at core usage, but I would be surprised if all cores were not used in both cases. The large sizes would certainly justify using all cores in the background.