# Fastest way to dot product all columns of two matricies of same size

22 views (last 30 days)
Dan Ryan on 20 Feb 2012
Edited: Jan on 11 Mar 2014
I have come up against this problem over and over, and I have a nice solution, but it seems non-optimal from a speed sense. Does anybody have a better way?
Problem: I will typically have two matricies of the same size (A and B) that have a small number of rows (m ~ 5-10) and a large number of columns (n ~ 1000) (think of each row is a "state" and each column representing the state value at a point in time). Is there a nice efficient way to take the corresponding columns of each matrix and dot product them to end up with a row vector of length n?
My current solution is:
C = sum(A.*B, 1);
The reason I don't like this is that for single column vectors
sum(v1.*v2, 1)
is about twice as slow as the regular dot product, so I am thinking there is a better way to do this for the matrix problem (it is typically executed 100's of thousands of times in my code is why I care).
Btw, looping over the columns and taking dot product is reallllllly slow, so don't bother trying anything with a for loop...

Jan on 20 Feb 2012
Edited: Jan on 11 Mar 2014
A direct C-Mex can avoid the creation of the large temporary array A.*B, but collect the sum immediately:
// Compile:
// DOT as C-loop: mex -O SumAB1.c libmwblas.lib -DUSE_C_LOOP
// DOT as BLAS call: mex -O SumAB1.c libmwblas.lib
//
// Tested: Matlab 7.8, 7.13, Win7/64, Compiler: MSVC2008
// Author: Jan Simon, Heidelberg, (C) 2012 matlab.2010ATnMINUSsimonDOTde
#include "mex.h"
#ifndef USE_C_LOOP
double ddot(mwSignedIndex *k, double *a, mwSignedIndex *ai,
double *b, mwSignedIndex *bi);
void CoreBlas(double *a, double *b, double *c, mwSignedIndex M, mwSignedIndex N);
#else
void CoreLoop(double *a, double *b, double *c, mwSize M, mwSize N);
#endif
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
mwSize M, N;
const mxArray *A, *B;
// Proper number of arguments:
if (nrhs != 2) {
mexErrMsgTxt("*** SumAB1[mex]: 2 inputs required.");
}
if (nlhs > 1) {
mexErrMsgTxt("*** SumAB1[mex]: 1 output allowed.");
}
// Pointers to inputs:
A = prhs;
B = prhs;
M = mxGetM(A);
N = mxGetN(A);
// Inputs must be double matrices of the same size:
if (!mxIsDouble(A) || mxGetNumberOfDimensions(A) != 2 ||
!mxIsDouble(B) || mxGetNumberOfDimensions(B) != 2 ||
mxGetM(B) != M || mxGetN(B) != N) {
mexErrMsgTxt("*** SumAB1[mex]: "
"Inputs must be real double matrices of the same size.");
}
// Create output:
plhs = mxCreateDoubleMatrix(1, N, mxREAL);
#ifndef USE_C_LOOP
CoreBlas(mxGetPr(A), mxGetPr(B), mxGetPr(plhs), M, N);
#else
CoreLoop(mxGetPr(A), mxGetPr(B), mxGetPr(plhs), M, N);
#endif
return;
}
// *****************************************************************************
void CoreLoop(double *a, double *b, double *c, mwSize M, mwSize N)
{
// DOT-product as C-loop:
mwSize i, j;
double s;
for (i = 0; i < N; i++) {
s = 0.0;
for (j = 0; j < M; j++) {
s += *a++ * *b++;
}
c[i] = s;
}
return;
}
// *****************************************************************************
void CoreBlas(double *a, double *b, double *c, mwSignedIndex M, mwSignedIndex N)
{
// DOT-product as BLAS call, faster if M > 100:
mwSignedIndex i, one = 1;
for (i = 0; i < N; i++) {
c[i] = ddot(&M, a + i * M, &one, b + i * M, &one);
}
return;
}
With MSVC 2008 I get about the double speed as for sum(A.*B, 1) in Matlab 2009a/64/Win7.
[EDITED] Inserted a compiler directive to perform the DOT using the faster BLAS DDOT call.
[EDITED 2] Sorry, DDOT is faster only for about M>100, not for M=10!
A hard coded loop is 16% faster than the C-loop method above for [10 x 1000] matrices:
void CoreLoop_10(double *a, double *b, double *c, mwSize N)
{
// DOT-product as C-loop for !!! M==10 !!!:
mwSize i;
double s;
register double *a0=a, *a1=a+1, *a2=a+2, *a3=a+3, *a4=a+4,
*b0=b, *b1=b+1, *b2=b+2, *b3=b+3, *b4=b+4;
for (i = 0; i < N; i++) {
s = *a0++ + *b0++ + *a1++ + *b1++ + *a2++ + *b2++ +
*a3++ + *b3++ + *a4++ + *b4++;
s += *a0++ + *b0++ + *a1++ + *b1++ + *a2++ + *b2++ +
*a3++ + *b3++ + *a4++ + *b4++;
c[i] = s;
}
return;
}
But then you have to branch in the main function to the different lengths of M. A SSE3 approach will be faster, but you have to care for the 128 bit alignment then.
Jan on 21 Feb 2012
Doh. I've read the corresponding sentence in the question three times because I've confused this too often in the forums already. Nevertheless, I've measured the timings with 1000x10 matrices instead of 10x1000.
Then the BLAS:DDOT method takes 30% more time than SUM(A.*B)!
New timings:
A = rand(10, 1000); B = rand(10, 1000);
tic; for i=1:10000; C2 = MyMexFunc(A, B); end; toc
Hard coded for M=10: 0.162 sec
The C-loop "s += *a++ * *b++": 0.196 sec
BLAS:DDOT: 0.444 sec
sum(A.*B): 0.344 sec

Jiro Doke on 20 Feb 2012
When you say you're doing the dot product, I assume you're doing this:
v1'*v2
and not
dot(v1, v2)
The equivalent for your scenario, where you have matrices, would be something like this:
diag(A'*B)'
But that adds a lot of overhead with extra calls to diag and transpose.
Have you tried calling sum without the dimension parameter?
C = sum(A.*B);
This is slightly faster than
C = sum(A.*B, 1);
Dan Ryan on 20 Feb 2012
Using diag(A'*B) would compute 1,000,000 values (instead of just the 1000 that I want), so it would be far from efficient.
Dropping the "1" from the sum command seems like a 10% or so speed up, which is fairly nice. Thanks for that one.