Optimize/speed up a big and slow matrix operation with addition and bsxfun?

Question

PetterS on 19 Apr 2015

0
Link

Direct link to this question

https://ch.mathworks.com/matlabcentral/answers/210352-optimize-speed-up-a-big-and-slow-matrix-operation-with-addition-and-bsxfun

Commented: schrodinbug on 10 Jul 2019

Hi. I have a fairly long line of code in my script that takes about 7 seconds to run, it was originally three separate calculations but are now done in one line that looks like this:

X = bsxfun(@times,reshape(bsxfun(@times,A,B),[1440,1,2251]),(1-C)./2)...
    +bsxfun(@times,reshape(E,1440,1,2251),D)...
    +bsxfun(@times,reshape(F,1440,1,[]),((1+C)/2));

Since I need to run it several tens of thousands times it is causing a fair amount of anxiety in my life at the moment. I don’t know how I would go about speeding it up further or if it is even possible. So I would really appreciate it if any of you optimization gurus here could give me some advice on how and if I can go about making this faster.

The input variables are of size:

A = 2251x1
B = 1440x2251
C = 1x181
D = 1440x181
E = 1440x2251
F = 1440x2251

Thanks!

2 Comments
Show NoneHide None

James Tursa on 20 Apr 2015

Do you have a C compiler available for compiling mex functions?

James Tursa on 20 Apr 2015

I assume A is really 1 x 2251 (otherwise bsxfun(@times,A,B) wouldn't work as written)

Sign in to comment.

Sign in to answer this question.

Answer 1

James Tursa on 20 Apr 2015

0
Link

Direct link to this answer

https://ch.mathworks.com/matlabcentral/answers/210352-optimize-speed-up-a-big-and-slow-matrix-operation-with-addition-and-bsxfun#answer_175968

Edited: James Tursa on 20 Apr 2015

Open in MATLAB Online

Assuming you have a C compiler available, here is a simple C mex routine to do the calculation. Could possibly be sped up even more with multi-threading (e.g., OpenMP) as well but I did not include that code below. Let me know if you have an OpenMP compilant compiler since that addition would be fairly simple.

As is (with no C-mex multi-threading) here is an example run on my machine:

Elapsed time is 6.655025 seconds. % M-code with bsxfun
Elapsed time is 1.700077 seconds. % C-mex code (not multi-threaded)

CAUTION: Code below is bare bones with no argument checking.

// X =   bsxfun(@times,reshape(bsxfun(@times,A,B),[1440,1,2251]),(1-C)./2)...
//     + bsxfun(@times,reshape(E,1440,1,2251),D)...
//     + bsxfun(@times,reshape(F,1440,1,[]),((1+C)/2));
// 
// The input variables are of size:
// 
// A =    1 x       2251
// B = 1440 x       2251
// C =    1 x 181
// D = 1440 x 181
// E = 1440 x       2251
// F = 1440 x       2251
//
// Output:
//
// X = 1440 x 181 x 2251
//
// Calling sequence:
//
// X = this_file_name(A,B,C,D,E,F);
#include "mex.h"
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
    mwSize i, j, k, m, n, p, km, ikm, jm;
    mwSize dims[3];
    double *A, *B, *C, *D, *E, *F, *X, *Cm, *Cp;
    double a, cp, cm;
      m = mxGetM(prhs[3]);
      n = mxGetN(prhs[3]);
      p = mxGetN(prhs[4]);
      A = mxGetPr(prhs[0]);
      B = mxGetPr(prhs[1]);
      C = mxGetPr(prhs[2]);
      D = mxGetPr(prhs[3]);
      E = mxGetPr(prhs[4]);
      F = mxGetPr(prhs[5]);
      dims[0] = m;
      dims[1] = n;
      dims[2] = p;
      plhs[0] = mxCreateNumericArray(3, dims, mxDOUBLE_CLASS, mxREAL);
      X = mxGetPr(plhs[0]);
      Cm = (double *) mxMalloc(n*sizeof(*Cm));
      Cp = (double *) mxMalloc(n*sizeof(*Cp));
      for( j=0; j<n; j++ ) {
          Cm[j] = (1.0 - C[j]) / 2.0;
          Cp[j] = (1.0 + C[j]) / 2.0;
      }
      for( k=0; k<p; k++ ) {
          a = A[k];
          km = k * m;
          for( j=0; j<n; j++ ) {
              jm = j * m;
              cm = Cm[j] * a;
              cp = Cp[j];
              for( i=0; i<m; i++ ) {
                  ikm = i+km;
                  *X++ = B[ikm] * cm + E[ikm] * D[i+jm] + F[ikm] * cp;
              }
          }
      }
      mxFree(Cp);
      mxFree(Cm);
}

6 Comments
Show 4 older commentsHide 4 older comments

PetterS on 20 Apr 2015

Amazing, it worked and it runs in 1.8 seconds! I will definitely be using it and so far It hasn’t complained about inputs in any way. I have two small questions though:

• If I compare the output of this with the output produced by the original, slower way, with isequal(X,X_old) I get 0 as if they were not identical. But when I plot parts of the data to compare, they look the same as far as I can tell and the matrices have the same mean values. So is the inequality simply created by tiny differences in the ways the two code languages calculate it and for all intents and purposes they are the same or is this something I should be worried about?

• Does the same concept of single-point numbers equals roughly twice the speed apply here so that I theoretically could go down below one second? I tried just feeding the inputs to the mex file as singles instead of doubles but that crashed matlab again. I see the term “double” in a few places in your mex code, would it simply be a matter of replacing that with “single” if I wanted to try that?

James Tursa on 21 Apr 2015

(1) If you plan on using this long term, I will add code to check the arguments so it won't crash if something is not right.

(2) Part of the speed-up is consolidating the arithmetic so the same operations aren't done multiple times. This involves pulling some operations out of the inner loop into the outer loops, which results in some of the operations being done in a different order. This is nothing to worry about, and you will find that the results are on the order of eps(result) in difference. If you are concerned about this, I can post a version that does the computations in exactly the same order, giving exactly the same results. It will not run as fast because it will not take advantage of what I mentioned earlier, but it should still run faster than your m-code because it does not need to make large intermediate arrays.

(3) If you want to try single vs double, then make these changes:

Change all instances of "double" to "float"

Change the mxDOUBLE_CLASS to mxSINGLE_CLASS

Change all of the mxGetPr(etc) to (float *)mxGetData(etc)

Change 1.0 to 1.0f and 2.0 to 2.0f

PetterS on 21 Apr 2015

Switching to singles brought it down to 1.3 seconds, so not really half but still, since I started with 7 seconds I’m really pleased with 1.3. I will run it continuously for around two weeks now and after that I don’t know if or when I will be using it again so you don’t have to put any effort in argument checking it.

Thank you so much for your help on this, it really saved me a lot of time!

James Tursa on 23 Apr 2015

Another potential speed-up I should mention is to replace mxCreateNumericArray with mxCreateUninitNumericArray.

Sign in to comment.

Answer 2

John D'Errico on 19 Apr 2015

1
Link

Direct link to this answer

https://ch.mathworks.com/matlabcentral/answers/210352-optimize-speed-up-a-big-and-slow-matrix-operation-with-addition-and-bsxfun#answer_175827

Open in MATLAB Online

You are trying to do billions of multiplies here, resulting in an array that is of size

1440*181*2251
ans =
   586700640

So roughly 600 million elements. since they are double precision numbers, the array will require 8 bytes per element. So the final result will require nearly 5 gigabytes of RAM to store.

Oh, by the way, there are three parts to that computation, so your code must generate three such temporary arrays in that computation, that are then summed together. So the internal storage requirements might be as large as 15 gigabytes of RAM, depending on how that code is processed internally.

That it ran in 7 seconds is a testament to the speed of your system, not the slowness of the code.

If you want it to run faster, get more memory. A good idea would be to use a flash based hard drive to allow your CPU to access virtual memory much faster.

5 Comments
Show 3 older commentsHide 3 older comments

PetterS on 19 Apr 2015

Edited: PetterS on 19 Apr 2015

I have 16 GB of RAM which peaks at 95-98% when I run it, so I think I’m pretty fortunate with the size matching my memory limit and the computer seems to be able to hold it all in the RAM. But are you saying then that it’s not possible to speed up the ways in which the operations are performed? Is it for example not possible to perform the calculations in such a way that the three additions never happen and it’s all done with one temporary array? Or simply with a different approach than the bsxfun method I have now?

Update: And also since you mentioned flash based hard drives – I do have flash based hard drives, two of them connected in raid 0 even. Do I need to change Matlab’s settings somehow to specify which drives it creates virtual memory on or is this done thought the paging file set up in windows? And virtual memory is only used by Matlab once RAM is full right? So unless all RAM is used virtual memory shouldn’t influence runtime, or can it?

John D'Errico on 19 Apr 2015

Open in MATLAB Online

I see no commonality in those terms to compress things down into something more efficient.

So first I'd watch your disk to see if it is swapping in that computation. My guess is it is being used if you are that close to the edge. That will cause a serious time loss. That you are seeing all your memory being used tells me that MATLAB is doing as I expected.

If there is disk thrashing involved, the simple answer is to get more RAM, OR a faster disk. RAM is cheap. A terabyte sized flash drive is relatively cheap too. Both are VERY cheap compared to YOUR time. People forget these things too often.

Of course, you could also write MEX code to do this, avoiding the creation of those large internal temporary arrays.

Perhaps another option, IF you can tolerate some precision loss in the result, is to convert the arrays to single precision. That will cut you RAM needs in half here, making it run MUCH faster. In fact, this will be a good test of my recommendation, since that computation will not need any swapping done.

Therefore, IF the computation runs much more quickly using singles versus doubles, then it says that the problem is insufficient RAM, NOT the number of computations that needed to be done. A single precision multiply or add seems to take about half the time as the same operation using doubles. But more importantly, if swapping can be avoided, then it will take MUCH less time.

A = single(rand(100,1,100));
B = single(rand(1,200,100));
timeit(@() bsxfun(@times,A,B))
ans =
    0.0024701
Ad = double(A);
Bd = double(B);
timeit(@() bsxfun(@times,Ad,Bd))
ans =
    0.0048494

So try converting your arrays to single precision. If the overall run time is cut by 50%, then you know that swapping is not a factor. But even then, you have a way to cut your time by roughly 50%.

If the run time is much less than 50% for the single computation, and you cannot afford the loss of precision, then you know that you really do need more memory and/or a faster drive.

PetterS on 19 Apr 2015

Ok, I’ve tried that now;

Time with doubles = 6.576590 sec

Time with singles = 3.378048 sec

So almost exactly half! If I understand your reasoning correctly this should then mean that the RAM was sufficient when the calculation was done with doubles? When I run the line with singles my RAM usage is only around 50% so clearly that can’t be a problem anymore (if it ever was). Maybe this memory usage is now low enough for me to make a parfor loop and run two of these simultaneously.

But I’ve never calculated things with single-point numbers before, I read a bit about them on Wikipedia but it seemed too technical for me to understand exactly what it meant. Is it possible to estimate in layman’s terms how much I’m reducing the accuracy by converting to single numbers? For example if I have an integer followed by a long line of decimal numbers, at approximately which decimal number am I rounding my number to by making it into single? (If that’s even how it works).

And thank you for enabling me to cut the runtime in half, that could be a game changer!

John D'Errico on 20 Apr 2015

Open in MATLAB Online

The funny thing is, much computation in the old days was done using single precision. Double precision is more robust though. You can get away with things with doubles that might hurt you with single.

When you start swapping to disk, things suddenly get much slower. So my argument is that IF the time drops by 50%, then you were not doing any serious amount of swapping with the doubles.

Eps tells you about the potential error in those numbers.

eps('double')
ans =
      2.22044604925031e-16
eps('single')
ans =
    1.192093e-07

Eps is the smallest number you can add to 1, and get a different result. So think of eps as the granularity for that precision, the least significant bit.

Much will depend on what you are doing with those numbers. Is there noise in them anyway, so that trash down around 1 part in 1e7 won't bother you? What are you doing with this result afterwards? On some computations that noise won't matter a bit.

schrodinbug on 10 Jul 2019

if you're creating a mex function would using a library like Eigen provide value?

Sign in to comment.

Optimize/speed up a big and slow matrix operation with addition and bsxfun?

2 Comments
Show NoneHide None

Accepted Answer

6 Comments
Show 4 older commentsHide 4 older comments

More Answers (1)

5 Comments
Show 3 older commentsHide 3 older comments

See Also

Categories

Tags

Community Treasure Hunt

Optimize/speed up a big and slow matrix operation with addition and bsxfun?

2 Comments Show NoneHide None

Accepted Answer

6 Comments Show 4 older commentsHide 4 older comments

More Answers (1)

5 Comments Show 3 older commentsHide 3 older comments

See Also

Categories

Tags

Community Treasure Hunt

2 Comments
Show NoneHide None

6 Comments
Show 4 older commentsHide 4 older comments

5 Comments
Show 3 older commentsHide 3 older comments