Is vectorization preserve scaling of runtime?

1 view (last 30 days)
Mr M.
Mr M. on 10 May 2017
Edited: Jan on 19 May 2017
Suppose I have a calculation, in which some for cycles is required inside each other. For example: for n = 1:N; for j = 1:J; for m = 1:M*J; In this case runtime scales with N, and M linearly, but runtime grows quadratically with J. If we can vectorize the calculation somehow, vectorization can speed up. But is scaling remain the same?

Accepted Answer

Jan
Jan on 11 May 2017
Edited: Jan on 11 May 2017
Even with vectorization the runtime will depend on the number of calculations. Sometimes vectorized code requires large temporary arrays, which do not match into the processor cache or the RAM. Then the slower RAM or even slower hard disk is used, which can degrade the speed massively. But as long as the same kind of memory can be used, the order of the execution time is the same for loops and vectorized code. Roughly.
But sometimes, vectorized code is multi-threaded, e.g. in the sum command. Then the number of free cores matters in addition above a certain size of the input. While sum(2e5) needs about twice as long as sum(1e5), this does not hold true for sum(10e5) and sum(5e5).
  3 Comments
Walter Roberson
Walter Roberson on 19 May 2017
For sufficiently large arrays, prod() would be handed over to one of the fast mathematical libraries such as MKL or BLAS. Those libraries are highly optimized, and can take advantage of multiple cores, and can take advantage of advanced hardware instructions such as various SIMD (Single Instruction Multiple Data) that might allow higher hardware performance rates than might be obvious. Your exact primary and secondary cache sizes can turn out to matter, your exact CPU can matter, the exact array size can matter...
Jan
Jan on 19 May 2017
Edited: Jan on 19 May 2017
The code I've posted before did not show anything relevant.
Neither BLAS nor MKL has a function for the calculation of the product of the elements.
SUM and PROD use multiple cores, when the input is larger than 88999:
x = rand(1, 88999);
tic; for i = 1:50000; v = sum(x); clear('v'); end, toc
tic; for i = 1:50000; v = prod(x); clear('v'); end, toc
x = rand(1, 89000);
tic; for i = 1:50000; v = sum(x); clear('v'); end, toc
tic; for i = 1:50000; v = prod(x); clear('v'); end, toc
% Matlab 2016b, Core2Duo
Elapsed time is 6.926055 seconds.
Elapsed time is 10.872182 seconds.
Elapsed time is 4.241430 seconds.
Elapsed time is 6.319897 seconds.
See also the Performance tab of the task manager: For the smaller problem only 1 core is used, but 2 (or all?) for the larger one. The code is not 50% faster (on my 2-core machine), but only 40%, because starting a thread is rather expensive. That SUM and PROD have the same limit does not look "highly optimized", because the point of break even would be reached for the more expensive product ealier. In opposite to this, the optimized BLAS libraries like ATLAS or MKL consider the cache sizes and costs of starting threads.
In older Matlab versions (2009a) the multi-threading was implemented cheaply, because the output of the threads was collected in the order the thread get ready. Then rounding errors could cause different results for sum depending on random factors.
By the way: Look at the timings under R2009a on the same machine:
Elapsed time is 6.628261 seconds.
Elapsed time is 10.488722 seconds.
Elapsed time is 9.719854 seconds.
Elapsed time is 11.743312 seconds.
As in R2016b the 88999 elements cause 50% processor load, and the 89000 100%, but the code was not faster, but substantially slower. Scary.

Sign in to comment.

More Answers (0)

Categories

Find more on Loops and Conditional Statements in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!