Improving the performance of vectorized code possible?

Question

0 votes

Hello,

I rewrote a calculation script with nested loops in vectorized code, which greatly improved computation time. Now I have done several test runs of different size mxn and am a bit surprised by the results:

for m = 16 and n = 68 the improvement is about 40 ... 45%

for m = 160 and n = 680 the improvement is about 45 ... 55%

and for m = 1600 and n = 6800 the improvement is "only" about 30%.

I would have expected that the improvement by vectorization with small problems is rather small and increases with increasing problem size or at least remains at a certain level. This behavior is unclear to me and I would like to understand it to maybe make it better.

The calculation itself uses 8 matrices / vectors that are traversed and linked in different ways. These are the basic arithmetic and sin / cos functions. In addition, the quantities m and n play a role as well as the matrices generated by ndgrid. The result is a vector of size m x 1. To achieve the result, the Matlab functions diag and sum are also used.

I have used the following techniques to speed up the calculation:

I used ndgrid instead of the nested loop
all nested loop calculations were vectorized
previously doubly calculated terms are calculated only once and used again
Instead of working in a script, I've written a function that executes the code and returns the result.

My question on this topic is divided into two parts - a more theoretical and a practical one:

1.) How is it that the improvement of the calculation speed - after it first increases - decreases again with increasing problem size? I would like to understand technically what could be or is the cause.

2.) Are there any other or additional techniques that can be used, provided that the applied techniques are implemented correctly? Which could that be? Are the Matlab functions diag and sum used for such a thing or are there better alternatives?

Thank you for your efforts and kind regards

Stephan

1 Comment
Show -1 older comments Hide -1 older comments

Julian Hapke on 3 May 2018

Open in MATLAB Online

it's hard to tell if there are better ways than

diag

and

sum

if you do not share your code.

The decreasing performance increase has propably to do with memory access, but I'm no expert there.

https://de.mathworks.com/matlabcentral/answers/301008-vectorized-code-slower-than-loops

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

John D'Errico on 3 May 2018

Edited: John D'Errico on 3 May 2018

Open in MATLAB Online

3 votes

I looked quickly at the code you were playing with in the original post.

The biggest problem was arguably that no arrays were preallocated. ALWAYS preallocate arrays that will be grown in size dynamically. This is perhaps the biggest reason for code (written by novices) running slowly.

Failing to preallocate an array that is then grown iteratively causes a quadratic time penalty. As the array size grows, the time required to dynamically re-allocate memory, then copying the entire array contents grows quadratically.

Of course, if you never grow any arrays or vectors, then there is no need to preallocate. That is often a consequence of "vectorization". But there are many ways to vectorize code. No single way exists. Often, vectorization just means re-thinking the code flow, re-thinking the basic algorithm. Very often vectorization means you need to trade off memory for external explicit loops. But the use of great gobs of memory can be expensive. And creating those large arrays, then working with them requires internal, implicitly generated loops. You will always have loops. But implicitly generated loops in compiled code tend to be much faster than explicit MATLAB loops.

Where possible in vectorized code, use either bsxfun or implicit scalar expansion. They can reduce the memory load.

You point out that the vectorization gain is not always uniform, or even consistent. That sometimes can be explained by too heavy use of memory. Thus if your temporary arrays grow so large that MATLAB has trouble finding the room, your computer will start swapping, using slow memory in exchange of fast memory.

Next, don't forget that different parts of your algorithm will have different time scalings as the problem size grows. So if one piece of the code is O(n^3) in time complexity, another is O(n), and another O(exp(n)), then the O(exp(n)) part will begin to dominate the computation as the problem grows in size, even if the constant out front was very small. So small computations may be dominated by the O(n) piece, but eventually, that O(exp(n)) term will grow until the problem becomes computationally impossible to solve.

Even vectorized code can be terribly slow, if the vectorization is terribly done. For example, suppose you want to scale every column of a matrix, using a vector of constants?

N = 10000;
A = rand(N);
K = 1:N;

So I want to multiply the j'th column by K(j).

You might use a double loop, thus effectively crap code written as if it was written in some low level language. At least loops are pretty well optimized in MATLAB these days. But still that will be slow as hell.

tic
B = A; % preallocate B!
for I = 1:N
  for J = 1:N
    B(I,J) = A(I,J)*K(J);
  end
end
toc
Elapsed time is 4.396439 seconds.

So not very fast. Had I not preallocated B, I would still be here days later waiting though.

We could do this using a matrix multiply, with a diagonal matrix. No need to preallocate. but way slower!

tic
B = A*diag(K);
toc
Elapsed time is 48.600078 seconds.

That is because a matrix*matrix multiply os an O(N^3) operation. But we only had to do N*2 multiplies! Most of those multiplies were multiplies by zero, then adding up a lot of mainly zeros.

Making K into a sparse diagonal matrix makes that part much faster.

tic
B = A*spdiags(K(:),0,N,N);
toc
Elapsed time is 0.641262 seconds.

But even sparse matrix multiplies are not perfect. A simple repmat is better here.

tic
B = A.*repmat(K,N,1);
toc
Elapsed time is 0.501364 seconds.

Of course, this is why bsxfun was invented, to replace those repmats.

tic
B = bsxfun(@times,A,K);
toc
Elapsed time is 0.289946 seconds.

If you have a new release of MATLAB, then implicit array expansion helps, because bsxfun brings some overhead.

tic
B = A.*K;
toc
Elapsed time is 0.207587 seconds.

And, of course, had N been larger or smaller, then all of these basic variations might change their times, some getting relatively larger or slower.

In the end, there is no single best way to vectorize ANY block of code. The best way will depend on the problem size. It will sometimes depend on the data itself. It will depend on your skill at writing code in MATLAB, and how well you understand how MATLAB uses memory. It will depend on your knowledge of what functions are available to you. (For example, have you blinked, and missed that some new capability was recently introduced into MATLAB?)

5 Comments
Show 3 older comments Hide 3 older comments

Stephan on 3 May 2018

Edited: Stephan on 3 May 2018

Hello John,

Thank you for your very good and detailed answer to my question. This helped a lot to understand the connections a little better.

To replace the nested loop, I also found the bsxfun function while searching the forum here, but did not manage to integrate it into the code, and then succeeded with ndgrid. I will try this approach again.

In an effort to optimize the code, I also found that with several transpositions, I was able to ensure that the intermediate results were not matrices, but a vector, as expected in the end.

However, I was afraid that these multiple operations on the corresponding parts could become more expensive than the diag function at the end. Your explanation of matrix multiplication with the effort of O (N ^ 3) might make this appear in a different light.

Can you say something about the effort of the function B = A '? Can this approach be promising in addition to using bsxfun?

Regarding the preallocation you mentioned, I also provided this, but the Code Check feature has marked this as "not recommended" for the vectorized part of the code.

For the original code part with the nested loops, the code check feature has just criticized this circumstance that you have addressed.

Thank you for the detailed and further answer and wish you a nice evening.

Stephan

John D'Errico on 3 May 2018

Open in MATLAB Online

Transpose requires MATLAB to move elements in memory. So be careful how you use it.

N = 10000;
A = rand(N);
timeit(@() A')
ans =
      0.90569
timeit(@() A.')
ans =
      0.88259

So it looks like the non-conjugate transpose may be slightly faster. When applied to real arrays, that makes some sense.

Lets see what happens with a simple sum?

timeit(@() sum(A,1))
ans =
     0.096468
timeit(@() sum(A,2))
ans =
      0.07991

So it is slightly faster to sum across rows, than down columns. But if we wanted to sum down columns, it would not pay to transpose the array, just to sum across rows.

timeit(@() sum(A',2))
ans =
      0.95092

But is there as faster way to sum down columns? Yes, surprisingly, there does seem to be a faster way, from a non-obvious source - dot products, thus a matrix-vector multiply.

timeit(@() ones(1,N)*A)
ans =
     0.064741

The logic here is that * uses the BLAS, which are fast as blazes. But sum is just compiled code, so apparently not quite as heavily optimized as a multiply.

So when you want to optimize code, knowing the tricks and what works best is all essential. But really, testing a few things out is also hugely valuable.

Stephan on 4 May 2018

Edited: Stephan on 4 May 2018

Open in MATLAB Online

Hi John,

i just tried what happens with the Performance when using

bsxfun

instead of

ndgrid

at the big sized Problem with m=1600 and n=6800.

time ndgrid:
Elapsed time is 0.238491 seconds.
time bsxfun:
Elapsed time is 0.069867 seconds.

This step improved the overall performance by about 8...10% for the big problem. I am thrilled and thank you for your support!

Best regards

Stephan

Nadeem Ahmed on 29 Nov 2022

Open in MATLAB Online

Hello @John D'Errico, because you explain the reasons of vectorization in a very efficient manner, I also would like to know how can we use the technique of vectorization efficiently if we have this kind of problem

clc
clear
close all
n=1000;
C1=zeros(n,n);
C2=zeros(n,n);
A=rand(n,n);
B=rand(n,n);
tic
for i=2:n-1
    for j=2:n-1
        C1(i,j) = (A(i,j)*B(i,j-1) + A(i-1,j)*B(i+1,j-1))/(A(i,j+1)*B(i+1,j));
    end
end
toc
Elapsed time is 0.068575 seconds.
%VECTORIZATION
tic
C2(2:n-1,2:n-1)=((A(2:n-1,2:n-1).*B(2:n-1,1:n-2) + A(1:n-2,2:n-1).*B(3:n,1:n-2)))./(A(2:n-1,3:n).*B(3:n,2:n-1));
toc;
Elapsed time is 0.023579 seconds.
norm(C1-C2)
ans = 0

This is a very basic example, although it is showing the improvement after vectorization but not that enough. If I make more divison and multiplication in the same function, "vectorization" will become even worse than "for loop". It would be very helpful for me if you provide some suggestions on how I should improve my code.

Sign in to comment.

Improving the performance of vectorized code possible?

1 Comment
Show -1 older comments Hide -1 older comments

Accepted Answer

5 Comments
Show 3 older comments Hide 3 older comments

More Answers (0)

Categories

Products

Tags

Community Treasure Hunt

Improving the performance of vectorized code possible?

1 Comment Show -1 older comments Hide -1 older comments

Accepted Answer

5 Comments Show 3 older comments Hide 3 older comments

More Answers (0)

Categories

Products

Tags

See Also

Community Treasure Hunt

1 Comment
Show -1 older comments Hide -1 older comments

5 Comments
Show 3 older comments Hide 3 older comments