3 views (last 30 days)

Show older comments

My Matlab code has a section of code which repeatedly performs matrix multiplication and division (through the backslash operator =A\b). When testing large datasets, a certain function named solvebvp_colloc.m is called millions of times. As this function contains multiple matrix multiplications and divisions, it tends to take up a lot of runtime. I'm curious if I could rewrite solvebvp_collect.m (or the code before solvebvp_collect.m) in order to improve performance. I'm not sure if matrix decomposition will help or if I could remove one of the for loops referencing j=1:Nx completely.

My original implementation is shown below. The function solvebvp_colloc.m is harmful to the overall performance of my Matlab code.

% Test Data

N = 3;

M = 3;

% In real test scenarios, I would have larger N,M such as N=M=30

Nx = 32; % Ny=Nx=Nz

Nz = 32;

Ny = 32;

% My production data doesn't have rand(). I am only creating random

% matrices to test peformance.

Fnmhat = rand(Nx,Nz+1);

Jnmhat = rand(Nx,1);

xi_n_m_hat = rand(Nx,N+1,M+1);

Uhat = zeros(Nx,Nz+1);

Uhat_new = zeros(Nx,Nz+1);

identy = eye(Ny+1,Ny+1);

p = rand(Nx,1);

gammap = rand(Nx,1);

D = rand(Nx+1,Ny+1);

D2 = rand(Nx+1,Ny+1);

gamma = 1.5;

alpha = 0; % this could be non-zero

ntests = 150; % I'm creating a ntests variable to test running the code below multiple times. The

% ntests variable is not part of my production code (and is only used for testing performance).

tic

% Original Implementation

for ii=1:ntests

for n=0:N

for m=0:M

for j=1:Nx

Fnmhat_p = Fnmhat(j,:).';

alphaalpha = 1.0;

betabeta = 0.0; % this could be non-zero

gammagamma = gamma*gamma - p(j)^2 - 2*alpha*p(j);

d_min = 1.0;

n_min = 0.0; % this could be non-zero

r_min = xi_n_m_hat(j,n+1,m+1);

d_max = -1i*gammap(j);

n_max = 1.0;

r_max = Jnmhat(j);

% This subroutine is expensive for large N,M. The problem is that

% solvebvp_collec.m needs to run against three for loops from

% 1:N,1:M,1:Nx and can be called millions of times.

uhat_p = solvebvp_colloc(Fnmhat_p,alphaalpha,betabeta,gammagamma,...

d_min,n_min,r_min,d_max,n_max,r_max,identy,D,D2);

Uhat(j,:) = uhat_p.';

end

end

end

end

toc

where the function solvebvp_colloc is of the form

function [utilde] = solvebvp_colloc(ftilde,alpha,beta,gamma,d_min,n_min,...

r_min,d_max,n_max,r_max,identy,D,D2)

A = alpha*D2 + beta*D + gamma*identy;

b = ftilde;

A(end,:) = n_min*D(end,:);

A(end) = A(end) + d_min;

b(end) = r_min;

A(1,:) = n_max*D(1,:);

A(1,1) = A(1,1) + d_max;

b(1) = r_max;

utilde = A\b;

return;

This code is slow for large N,M (large being anything over N=M=10. I would consider N=M=30 as a resonable test case for testing large datasets.) I think that my code design has three major performance flaws including:

- The function solvebvp_colloc.m is processed a large number of times (as it needs to run for all values of the for loops 1:N, 1:M, and 1:Nx). When N and M are large (say N=M=20), the profiler shows that this function is called 20,829,312 times and takes a total runtime of 1042 seconds or 17.3 minutes. The operation utilde = A\b is expensive and calling this millions of times takes up a lot of runtime. Calling the matrix multiplication operations in solvebvp_colloc.m is also expensive.
- Running an extra for loop with j=1:Nx creates larger runtime.
- I don't use matrix decomposition. Since the matrix A changes in every iteration of j (because of the value of gammagamma is different for different values of j), I'm not sure if I can use matrix decomposition to speed up this calculation.

My initial thought was to avoid writing the additional for loop from j=1:Nx. I implemented this in a second impementation below. I wasn't able to figure out how to move the matrix A out of the for loop because the size of gammagamma is (Nx,1) and the size of the identity matrix (Nx+1,Nx+1) are different.

% Test Data

N = 3;

M = 3;

% In real test scenarios, I would have larger N,M such as N=M=30

Nx = 32; % Ny=Nx=Nz

Nz = 32;

Ny = 32;

% My production data doesn't have rand(). I am only creating random

% matrices to test peformance.

Fnmhat = rand(Nx,Nz+1);

Jnmhat = rand(Nx,1);

xi_n_m_hat = rand(Nx,N+1,M+1);

Uhat = zeros(Nx,Nz+1);

Uhat_new = zeros(Nx,Nz+1);

identy = eye(Ny+1,Ny+1);

p = rand(Nx,1);

gammap = rand(Nx,1);

D = rand(Nx+1,Ny+1);

D2 = rand(Nx+1,Ny+1);

gamma = 1.5;

alpha = 0; % this could be non-zero

ntests = 150;

% New Implementation

tic

for ii=1:ntests

for n=0:N

for m=0:M

% Moved outside of the for loop

Fnmhat_p = Fnmhat.';

alphaalpha = 1.0;

betabeta = 0.0; % this could be non-zero

gammagamma = gamma*gamma - p.^2 - 2*alpha.*p; % size (Nx,1)

d_min = 1.0;

n_min = 0.0; % this could be non-zero

r_min = xi_n_m_hat(:,n+1,m+1);

d_max = -1i.*gammap;

n_max = 1.0;

r_max = Jnmhat;

% Moved b outside of the for loop

b = Fnmhat_p;

% I don't know how to move A outside of the for loop as gammagamma

% is of size(Nx,1) and identy is of size(Ny+1,Ny+1) so writing

% A = alphaalpha*D2 + betabeta*D + gammagamma.*identy

% wouldn't work since the two matrices are of different sizes.

for j=1:Nx

uhat_p_new = solvebvp_colloc_new(b,alphaalpha,betabeta,gammagamma,...

d_min,n_min,r_min,d_max,n_max,r_max,identy,D,D2,j);

Uhat_new(j,:) = uhat_p_new.';

end

end

end

end

toc

diff = norm(Uhat - Uhat_new,inf); % is zero

where the new function solvebvp_colloc_new.m is written as

function [utilde] = solvebvp_colloc_new(b,alpha,beta,gamma,d_min,n_min,...

r_min,d_max,n_max,r_max,identy,D,D2,j)

A = alpha*D2 + beta*D + gamma(j)*identy;

A(end,:) = n_min*D(end,:);

A(end) = A(end) + d_min;

b(end,j) = r_min(j);

A(1,:) = n_max*D(1,:);

A(1,1) = A(1,1) + d_max(j);

b(1,j) = r_max(j);

utilde = A\b(:,j);

return;

A local test shows that the second implementation is "slightly faster."

Elapsed time of original implementation is 4.087457 seconds.

Elapsed time of new implementation 3.934619 seconds.

For large M,N (say N=M=20 or N=M=30) this difference would be larger. However, I wasn't able to significantly speed up the calculation because I still do a large amount of matrix multiplications and divisions inside the new function solvebvp_colloc_new.m. I'm interested in analyzing how I can speed up this calculation (I will need to test N=M=20 and N=M=30) and the profiler shows that this calculation takes up about 1/3 of the total runtime. I'm interested in analyzing the following performance improvements:

- Can Matlab matrix decomposition help here even though A changes for every value of j (since gammagamma is different for every value of j)?
- Is it possible to avoid writing the for loop from j=1:Nx in order to perform utilde = A\b less and matrix multiplication less?
- Are there any other performance improvements that would speed up this calculation (the function solvebvp_colloc.m is called millions of times for large N,M)? I think that bsxfun may help where the backslash operator becomes utilde = bsxfun(@rdivide, A, b); although this doesn't match the dimensions of the inner for loop through j.

Chunru
on 25 Jul 2021

Edited: Chunru
on 25 Jul 2021

Analyse your program to see if the matrix A is independent of which loop variables (it seems A is independent to some loop variables at the first look of the program). Compute A outside the loops of the independent variables. In inner-loop contactate the right-hand b into a matrix B. Then solve multiple linear systems with the same A in one-go: X = A \ B. This way you will only do one matrix "division" for all the independent loops (if any) and hopefully the performance can be improved.

If A is changing over all loop variable, then you may not be able to improve the performance drastically.

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!