GPU optimization of looped vector operations

Question

Lloyd Bumm on 28 Aug 2019

1
Link

Direct link to this question

https://ch.mathworks.com/matlabcentral/answers/477900-gpu-optimization-of-looped-vector-operations

Commented: Joss Knight on 6 Sep 2019

I think I am making a simple mistake. I am comparing a vectrorized integration using the GPU and the CPU. This code takes ~365 sec on the GPU (Nvidia quadro P5200 and 96 sec on the parallel CPU (6 workers, Xeon E-2176M, 2.7 GHz). The integral is a straight forward operation with vectors 90,000 long in this example that repeats 90,000 times in the loop. A test of an array multiplication of two 10,000x10,000 arrays of random numbers takes 0.65 s on my GPU and 10.8 s on my CPU. In the example below the GPU is slower for larger arrays. It seems as though the loop introduces a lot of overhead on the GPU operations.

What is the best strategy to optimize this problem for the GPU?

nubar_low = 2450;
nubar_high = 3350;
p_density = 100; %points per wavenumber
nu_bar = nubar_low:1/p_density:nubar_high;
K = zeros(size(nu_bar));
nub = nu_bar;
n_inf = 0;
nub = nu_bar;
k_max = 0.01; %max k
nub_0 = 2800; %nu bar center to absorption
gamma = 50; %width of the absortion
K = k_max * (gamma/2)^2 * ( ((nub-nub_0).^2 + (gamma/2)^2).^-1 - ((nub+nub_0).^2 + (gamma/2)^2).^-1);
% dK data is the derivative of K --> d(K)/d(nubar)
% Use value on either side of the point where possible
dK = zeros(size(K));
dK(2:end-1) = (K(3:end)-K(1:end-2))./(nu_bar(3:end)-nu_bar(1:end-2));
% Endpoints are special case.
dK(1) = (K(2)-K(1))./(nu_bar(2)-nu_bar(1));
dK(end) = (K(end)-K(end-1))./(nu_bar(end)-nu_bar(end-1));
len=length(nu_bar);
dN_KK = zeros(1,len);
% The integral
tic
  try
      canUseGPU = parallel.gpu.GPUDevice.isAvailable;
  catch ME
      canUseGPU = false;
  end
  
  %canUseGPU = false;
  
  if canUseGPU  
    %integral using GPU
    gnu_bar = gpuArray(nu_bar);
    gK = gpuArray(K);
    gdK = gpuArray(dK);
    gdN_KK = gpuArray(dN_KK);
    for i = 1:len
        gdN_KK(i) = sum(gnu_bar([1:i-1, i+1:end]) .* gK([1:i-1, i+1:end]) ./ (gnu_bar([1:i-1, i+1:end]).^2 - gnu_bar(i).^2));
        gdN_KK(i) = 2*gdN_KK(i) + gK(i)./(2*gnu_bar(i)) + gdK(i);
    end
    dN_KK =gather(gdN_KK);
  else
    %integral using GPU
    parfor i = 1:len 
        dN_KK(i) = sum(nu_bar([1:i-1, i+1:end]) .* K([1:i-1, i+1:end]) ./ (nu_bar([1:i-1, i+1:end]).^2 - nu_bar(i).^2));
        dN_KK(i) = 2*dN_KK(i) + K(i)./(2*nu_bar(i)) + dK(i);
    end
  end
        
        
% Scales data     
dN_KK = (1/(pi*p_density))*dN_KK;
% Adds constant for N infinity
N_KK = dN_KK + n_inf;
toc

4 Comments
Show 2 older commentsHide 2 older comments

Joss Knight on 29 Aug 2019

Lloyd, have you tried running your computation in single precision? Your Quadro P5200 has respectable single precision performance of about 8 or 9 teraflops, but like most graphics cards except special ones, its double precision performance is a small fraction of that at about 280 gigaflops (figures from the Wikipedia page where NVIDIA post their specs). This is why you're not getting much better matrix multiply performance out of your GPU than your CPU - this would be dramatically different in single. It is perfectly normal for an algorithm to run faster on the CPU than on one of these graphics-focussed cards, especially if it is an algorithm with a lot of unvectorized loops and a multiprocess parallelization.

Lloyd Bumm on 6 Sep 2019

I noted the single precision times in benchmarks below. The effect is only a factor of 2. However this is not the sort of problem that should be done in single precision.

Sign in to comment.

Sign in to answer this question.

Answer 1

Matt J on 28 Aug 2019

1
Link

Direct link to this answer

https://ch.mathworks.com/matlabcentral/answers/477900-gpu-optimization-of-looped-vector-operations#answer_389457

Edited: Matt J on 6 Sep 2019

Open in MATLAB Online

This modification uses mat2tiles from the File Exchange, to help divide the computation into bigger, vectorized chunks

https://www.mathworks.com/matlabcentral/fileexchange/35085-mat2tiles-divide-array-into-equal-sized-sub-arrays

It runs in about 2 seconds on my graphics card (GeForce GTX 1080 Ti). Aside from increased vectorization, the key is to eliminate all the indexing expressions x([1:i-1, i+1:end]). Those are costly.

    tic;
    gnu_bar = gpuArray(nu_bar);
    gK = gpuArray(K);
    gdK = gpuArray(dK);
    gdN_KK = gpuArray(dN_KK);
    
    
    chunksize=1000;
    
    vv=gnu_bar.^2;
    vvchunks=mat2tiles(  vv , [1,chunksize]);
    numer=gnu_bar.*gK;
    c=1;
    
    for k=1:numel(vvchunks)
        Q=numer(:)./(vv.'-vvchunks{k});
        Q(c:len+1:end)=0;
        c=c+size(Q,2);
        vvchunks{k}=sum(Q,1);
    end
    
    gdN_KK=[vvchunks{:}];
    gdN_KK = 2*gdN_KK + gK./(2*gnu_bar) + gdK;    
    
    wait(gd)
    toc %Elapsed time is 2.027665 seconds.

15 Comments
Show 13 older commentsHide 13 older comments

Lloyd Bumm on 6 Sep 2019

Edited: Lloyd Bumm on 6 Sep 2019

Open in MATLAB Online

I looked at your mat2tiles implemention. It is not giving the correct result, the graph looks nothing like the correct result. I implemented yiour solution as in the long code block below, which I tacked onto my original code.

I'm not sure I understand what this statement is doing,

vvchunks{k}-vv.'

but I don't think it is correct for what I need mathmatically. The updated optimizations I posted should make it more clear.

I agree that mat2tiles will help the memory probem in the 2D operations. I just traped those for testing in the updated optimizations.

Note, I expanded the range of the vectorized integral and made the number of integrals evaluated a subset of those points (wavenumbers here). This is better mathmatically.

nubar_low = 2450;
nubar_high = 3350;
p_density = 100; %points per wavenumber
nu_bar = nubar_low:1/p_density:nubar_high;
K = zeros(size(nu_bar));
nub = nu_bar;
n_inf = 0;
nub = nu_bar;
k_max = 0.01; %max k
nub_0 = 2800; %nu bar center to absorption
gamma = 50; %width of the absortion
K = k_max * (gamma/2)^2 * ( ((nub-nub_0).^2 + (gamma/2)^2).^-1 - ((nub+nub_0).^2 + (gamma/2)^2).^-1);
% dK data is the derivative of K --> d(K)/d(nubar)
% Use value on either side of the point where possible
dK = zeros(size(K));
dK(2:end-1) = (K(3:end)-K(1:end-2))./(nu_bar(3:end)-nu_bar(1:end-2));
% Endpoints are special case.
dK(1) = (K(2)-K(1))./(nu_bar(2)-nu_bar(1));
dK(end) = (K(end)-K(end-1))./(nu_bar(end)-nu_bar(end-1));
len=length(nu_bar);
dN_KK = zeros(1,len);
    gGPU = gpuDevice(1);
    reset(gGPU);
tic;
    gnu_bar = gpuArray(nu_bar);
    gK = gpuArray(K);
    gdK = gpuArray(dK);
    gdN_KK = gpuArray(dN_KK);
    
    
    chunksize=1000;
    
    vv=gnu_bar.^2;
    vvchunks=mat2tiles(  vv , [1,chunksize]);
    numer=mat2tiles(gnu_bar.*gK,[1,chunksize]);
    c=1;
    
    for k=1:numel(vvchunks)
        Q=numer{k}./(vvchunks{k}-vv.');
        Q(c:len+1:end)=0;
        c=c+size(Q,2);
        vvchunks{k}=sum(Q,1);
    end
    
    gdN_KK=[vvchunks{:}];
    gdN_KK = 2*gdN_KK + gK./(2*gnu_bar) + gdK;
    MJ_dN_KK = (1/(pi*p_density))*gather(gdN_KK);
    wait(gGPU)
    toc 
    figure
    plot(nu_bar,MJ_dN_KK);

Matt J on 6 Sep 2019

Edited: Matt J on 6 Sep 2019

Open in MATLAB Online

Okay, I did make a few fixes, but now to be sure we're on the same page, I share the test code below, and I see strong agreement between the two versions

nubar_low = 2450;
nubar_high = 2451;
p_density = 100; %points per wavenumber
nu_bar = nubar_low:1/p_density:nubar_high;
K = zeros(size(nu_bar));
nub = nu_bar;
n_inf = 0;
nub = nu_bar;
k_max = 0.01; %max k
nub_0 = 2800; %nu bar center to absorption
gamma = 50; %width of the absortion
K = k_max * (gamma/2)^2 * ( ((nub-nub_0).^2 + (gamma/2)^2).^-1 - ((nub+nub_0).^2 + (gamma/2)^2).^-1);
% dK data is the derivative of K --> d(K)/d(nubar)
% Use value on either side of the point where possible
dK = zeros(size(K));
dK(2:end-1) = (K(3:end)-K(1:end-2))./(nu_bar(3:end)-nu_bar(1:end-2));
% Endpoints are special case.
dK(1) = (K(2)-K(1))./(nu_bar(2)-nu_bar(1));
dK(end) = (K(end)-K(end-1))./(nu_bar(end)-nu_bar(end-1));
len=length(nu_bar);
dN_KK = zeros(1,len);
len,
    tic;
    gnu_bar = gpuArray(nu_bar);
    gK = gpuArray(K);
    gdK = gpuArray(dK);
    gdN_KK = gpuArray(dN_KK);
    
    
    %%%% ORIGINAL %%%%%
    
    
    for i = 1:len
        gdN_KK(i) = sum(gnu_bar([1:i-1, i+1:end]) .* gK([1:i-1, i+1:end]) ./ (gnu_bar([1:i-1, i+1:end]).^2 - gnu_bar(i).^2));
        gdN_KK(i) = 2*gdN_KK(i) + gK(i)./(2*gnu_bar(i)) + gdK(i);
    end
    
    version1 = gdN_KK ;
    
    %%%% OPTIMIZED %%%%%%
    
    chunksize=5;
    
    vv=gnu_bar.^2;
    vvchunks=mat2tiles(  vv , [1,chunksize]);
    numer=gnu_bar.*gK;
    c=1;
    
    for k=1:numel(vvchunks)
        Q=numer(:)./(vv.'-vvchunks{k});
        Q(c:len+1:end)=0;
        c=c+size(Q,2);
        vvchunks{k}=sum(Q,1);
    end
    
    gdN_KK=[vvchunks{:}];
    gdN_KK = 2*gdN_KK + gK./(2*gnu_bar) + gdK;
    
    %wait(gd)
    toc %Elapsed time is 2.027665 seconds.
    
    version2 = gdN_KK ;
    
    plot(1:len,version1,'-',1:len,+version2,'x'); legend('Lloyd','Matt')
    

Lloyd Bumm on 6 Sep 2019

Open in MATLAB Online

At first I didn't recognize what was going on becasue the you had the interval set to one wavenumber (2450-2451) far away from the absorption (2800). I changed the interval back to 2450-3350, increased the chunksize to 1000, scaled it properly, and compared it to my 1D vectorized CPU code (a fairer comparison). It is spot on now.

LB CPU 1D vector optimized: 4.036714 seconds

MJ GPU mat2tiles optimized: 2.554221 seconds

I'll need to figure out is I can implement your solution when the integrals are being evaluated at a subset of the points in the integration.

nubar_low = 2450;
nubar_high = 3350;
p_density = 100; %points per wavenumber
nu_bar = nubar_low:1/p_density:nubar_high;
K = zeros(size(nu_bar));
nub = nu_bar;
n_inf = 0;
nub = nu_bar;
k_max = 0.01; %max k
nub_0 = 2800; %nu bar center to absorption
gamma = 50; %width of the absortion
K = k_max * (gamma/2)^2 * ( ((nub-nub_0).^2 + (gamma/2)^2).^-1 - ((nub+nub_0).^2 + (gamma/2)^2).^-1);
% dK data is the derivative of K --> d(K)/d(nubar)
% Use value on either side of the point where possible
dK = zeros(size(K));
dK(2:end-1) = (K(3:end)-K(1:end-2))./(nu_bar(3:end)-nu_bar(1:end-2));
% Endpoints are special case.
dK(1) = (K(2)-K(1))./(nu_bar(2)-nu_bar(1));
dK(end) = (K(end)-K(end-1))./(nu_bar(end)-nu_bar(end-1));
len=length(nu_bar);
dN_KK = zeros(1,len);
len,
    
    
    %%%% 1D vector optimized %%%%%
    tic;
    part_a = nu_bar .* K;
    part_b = nu_bar .^2;
    for i = 1:len
        part_c = part_a ./ (part_b - part_b(i));
        part_c(i) = 0;
        dN_KK(i) = sum(part_c);
    end
    dN_KK = (1/(pi*p_density))*(2*dN_KK + K./(2*nu_bar) + dK);
    version1 = dN_KK ;
    toc %Elapsed time is 4.036714 seconds.
    
    %%%% OPTIMIZED %%%%%%
    tic
    
    gnu_bar = gpuArray(nu_bar);
    gK = gpuArray(K);
    gdK = gpuArray(dK);
    gdN_KK = gpuArray(dN_KK);
    
    chunksize=1000;
    
    vv=gnu_bar.^2;
    vvchunks=mat2tiles(  vv , [1,chunksize]);
    numer=gnu_bar.*gK;
    c=1;
    
    for k=1:numel(vvchunks)
        Q=numer(:)./(vv.'-vvchunks{k});
        Q(c:len+1:end)=0;
        c=c+size(Q,2);
        vvchunks{k}=sum(Q,1);
    end
    
    gdN_KK=[vvchunks{:}];
    gdN_KK = (1/(pi*p_density))*(2*gdN_KK + gK./(2*gnu_bar) + gdK);
    
    %wait(gd)
    toc %Elapsed time is 2.554221 seconds.
    
    version2 = gdN_KK ;
    
    figure
    plot(nu_bar,version1,nu_bar,+version2,'x'); legend('Lloyd','Matt')

Sign in to comment.

Answer 2

Lloyd Bumm on 6 Sep 2019

0
Link

Direct link to this answer

https://ch.mathworks.com/matlabcentral/answers/477900-gpu-optimization-of-looped-vector-operations#answer_390605

Edited: Lloyd Bumm on 6 Sep 2019

Open in MATLAB Online

I had some time to get back to this project today. The code below is set diferently that the OP in that the number of points and the range of the Kramers-Kronig integration is different than the range and density of points over which it is evaluated.

Below I compare the un-optimized method in the OP with an efficient 1D vectorization and the 2D vectorization using the for loops, parfor loops, and the GPU. The primary improvement is the 1D opimization implemented beased on discusssions above.

The parameters and timings are in the code, but I will list them here for convenience.

990001 points per integral; 301 integrals

19.893800 seconds.

CPU for 1D vector optimized: 0.319634 seconds.

CPU 2D vectorized: 1.318735 seconds.

CPU for 1D vector optimized single precision: 0.079821 seconds.

CPU 2D vectorized single precision: 0.861958 seconds.

CPU parfor unoptimized: 7.018130 seconds.

CPU parfor 1D vector optimized: 0.736592 seconds.

GPU for un-optimized: 11.348325 seconds.

GPU for 1D vector optimized: 0.540666 seconds.

GPU for 2D vector optimized: 33.338617 seconds.

GPU for un-optimized single precision: 11.187868 seconds.

GPU for 1D vector optimized single precision: 0.471209 seconds.

GPU for 2D vector optimized single precision: 16.836338 seconds.

% nu_bar_low and nubar_high are starting points for the intergrals in wavenumbers (nu_bar),
% p_density is the number of points between wavenumbers
% The K-K integral is evaluated at subsets of wavenumbers given by
% on the interval nubar_start to nubar_end (nu_bar_eval) with a point density lower than
% the intergral by the factor eval_sub_density
% the indicies for eval_ind must exactly correspond to points in nu_bar
close all
clear all
nubar_low = 100;
nubar_high = 10000;
nubar_start = 2450;
nubar_end = 3350;
p_density = 100; %points per wavenumber for integral
eval_sub_density = 300; % evaluate interval lower than the p_density by this factor
nu_bar = nubar_low:1/p_density:nubar_high; %wavenumber vector for integral
len_nubar = length(nu_bar);
nu_bar_eval = nubar_start:1/(p_density/eval_sub_density):nubar_end; %vector wavenumbers where integral is evaluated
len_eval=length(nu_bar_eval);
bb = (nu_bar >= nubar_start) & (nu_bar <= nubar_end);
ind_start = find(bb,1,'first');
ind_end = find(bb,1,'last');
eval_ind = ind_start:eval_sub_density:ind_end; %indices in nu_bar that correspond to nu_bar_eval
fprintf('%i points per integral; %i integrals\r',len_nubar, len_eval);
n_inf = 0;
%compute K spectrum
K = zeros(size(nu_bar));
k_max = 0.01; %max k
nu_bar_0 = 2800; %nu bar center to absorption
gamma = 50; %width of the absortion
K = k_max * (gamma/2)^2 * ( ((nu_bar-nu_bar_0).^2 + (gamma/2)^2).^-1 - ((nu_bar+nu_bar_0).^2 + (gamma/2)^2).^-1);
% dK data is the derivative of K --> d(K)/d(nubar)
% Use value on either side of the point where possible
dK = zeros(size(K));
dK(2:end-1) = (K(3:end)-K(1:end-2))./(nu_bar(3:end)-nu_bar(1:end-2));
% Endpoints are special case.
dK(1) = (K(2)-K(1))./(nu_bar(2)-nu_bar(1));
dK(end) = (K(end)-K(end-1))./(nu_bar(end)-nu_bar(end-1));
dN_KK = zeros(1,len_eval);
%times for len_nubar = 990001; len_nubar_eval = 301
fprintf('\rCPU for unoptimized\r');  % 19.9 s
    
    tic
    for i = 1:len_eval 
        jj = eval_ind(i);
        dN_KK(i) = sum(nu_bar([1:jj-1, i+1:end]) .* K([1:jj-1, i+1:end]) ./ (nu_bar([1:jj-1, i+1:end]).^2 - nu_bar(jj).^2));
        dN_KK(i) = 2*dN_KK(i) + K(jj)./(2*nu_bar(jj)) + dK(jj);
    end
    dN_KK = 2*dN_KK + K(eval_ind)./(2*nu_bar_eval) + dK(eval_ind);
    dN_KK = (1/(pi*p_density))*dN_KK;
    N_KK = dN_KK + n_inf;
    toc
    
    figure
    plot(nu_bar_eval,dN_KK);
    hold on
    
fprintf('\rCPU for 1D vector optimized\r');  % 0.32 s
    tic
    part_a = nu_bar .* K;
    part_b = nu_bar .^2;
    for i = 1:len_eval 
        jj = eval_ind(i);
        part_c = part_a ./ (part_b - part_b(jj));
        part_c(jj) = 0;  %set singlar points to zero before sum
        dN_KK(i) = sum(part_c);
    end
    dN_KK = 2*dN_KK + K(eval_ind)./(2*nu_bar_eval) + dK(eval_ind);
    dN_KK = (1/(pi*p_density))*dN_KK;
    N_KK = dN_KK + n_inf;
    toc
       
    plot(nu_bar_eval,dN_KK);
    hold on
    fprintf('\rCPU 2D vectorized\r');  % 1.32 s
%test for sufficient available memory
    mem = memory;
    mem_avail = mem.MemAvailableAllArrays;
    mem_need = len_nubar*len_eval*8*4;
if mem_need > mem_avail
   fprintf('not enough memory, need %10.3e, only %10.3e available\r',mem_need, mem_avail);
    proceed = false;
else
    fprintf('have enough memory, need %10.3e, %10.3e available\r',mem_need, mem_avail);
    proceed = true;
end
% proceed if sufficient available memory
if proceed
    tic
    R_temp = repmat(nu_bar.^2,len_eval,1); % conserve memory use this varaiable as a temp
    R_nubar_K = repmat(nu_bar.*K,len_eval,1);
    R_nu_bar_eval_sq = repmat(nu_bar_eval'.^2,1,len_nubar);
    R_temp = (R_nubar_K) ./ (R_temp - R_nu_bar_eval_sq);
    for i = 1:len_eval
        R_temp(i,eval_ind(i)) = 0;  %set singlar points to zero before sum
    end
    dN_KK = sum(R_temp,2)';
    dN_KK = 2*dN_KK + K(eval_ind)./(2*nu_bar_eval) + dK(eval_ind);
    dN_KK = (1/(pi*p_density))*dN_KK;
     N_KK = dN_KK + n_inf;
    toc
    clear R_temp R_nubar_K R_nu_bar_eval_sq; %dump big arrays
    plot(nu_bar_eval,dN_KK);
    hold on
end
%%%%%%%%%%%%%%%%%%%%%%%%%% single precision %%%%%%%%%%%%%%%%
fprintf('\rCPU for 1D vector optimized single precision\r');  % 0.8 s
    tic
    snu_bar = single(nu_bar);
    sK = single(K);
    sdK = single(dK);
    snu_bar_eval = single(nu_bar_eval);
    seval_ind = single(eval_ind);
    sdN_KK = single(dN_KK);
    spart_a = snu_bar .* sK;
    spart_b = snu_bar .^2;
    for i = 1:len_eval 
        jj = seval_ind(i);
        spart_c = spart_a ./ (spart_b - spart_b(jj));
        spart_c(jj) = 0;  %set singlar points to zero before sum
        sdN_KK(i) = sum(spart_c);
    end
    dN_KK = double(2*sdN_KK + sK(seval_ind)./(2*snu_bar_eval) + sdK(seval_ind));
    dN_KK = (1/(pi*p_density))*dN_KK;
    N_KK = dN_KK + n_inf;
    toc
       
    plot(nu_bar_eval,dN_KK);
    hold on
    fprintf('\rCPU 2D vectorized single precision\r');  % 0.86 s
%test for sufficient available memory
    mem = memory;
    mem_avail = mem.MemAvailableAllArrays;
    mem_need = len_nubar*len_eval*4*4;
if mem_need > mem_avail
   fprintf('not enough memory, need %10.3e, only %10.3e available\r',mem_need, mem_avail);
    proceed = false;
else
    fprintf('have enough memory, need %10.3e, %10.3e available\r',mem_need, mem_avail);
    proceed = true;
end
% proceed if sufficient available memory
if proceed
    tic
    sR_temp = repmat(single(nu_bar).^2,len_eval,1); % conserve memory use this varaiable as a temp
    sR_nubar_K = repmat(single(nu_bar).*K,len_eval,1);
    sR_nu_bar_eval_sq = repmat(single(nu_bar_eval)'.^2,1,len_nubar);
    seval_ind = single(eval_ind);
    
    sR_temp = (sR_nubar_K) ./ (sR_temp - sR_nu_bar_eval_sq);
    for i = 1:len_eval
        sR_temp(i,seval_ind(i)) = 0;  %set singlar points to zero before sum
    end
    dN_KK = double(sum(sR_temp,2)');
    dN_KK = 2*dN_KK + K(eval_ind)./(2*nu_bar_eval) + dK(eval_ind);
    dN_KK = (1/(pi*p_density))*dN_KK;
    N_KK = dN_KK + n_inf;
    toc
    clear sR_temp sR_nubar_K sR_nu_bar_eval_sq; %dump big arrays
    plot(nu_bar_eval,dN_KK);
    hold on
end
%^^^^^^^^^^^^^^^^^^^^ end single precision ^^^^^^^^^^^^^^^^^^
fprintf('\rCPU parfor unoptimized\r');  % 7.02 s
poolobj = gcp;
    
    tic
    parfor i = 1:len_eval 
        jj = eval_ind(i);
        dN_KK(i) = sum(nu_bar([1:jj-1, i+1:end]) .* K([1:jj-1, i+1:end]) ./ (nu_bar([1:jj-1, i+1:end]).^2 - nu_bar(jj).^2));
        dN_KK(i) = 2*dN_KK(i) + K(jj)./(2*nu_bar(jj)) + dK(jj);
    end
    dN_KK = 2*dN_KK + K(eval_ind)./(2*nu_bar_eval) + dK(eval_ind);
    dN_KK = (1/(pi*p_density))*dN_KK;
    N_KK = dN_KK + n_inf;
    toc
    
    plot(nu_bar_eval,dN_KK);
    hold on
    
    
fprintf('\rCPU parfor 1D vector optimized\r');  % 0.74 s
poolobj = gcp;
    tic
    part_a = nu_bar .* K;
    part_b = nu_bar .^2;
    parfor i = 1:len_eval 
        jj = eval_ind(i);
        part_c = part_a ./ (part_b - part_b(jj));
        part_c(jj) = 0; %set singlar points to zero before sum
        dN_KK(i) = sum(part_c);
    end
    dN_KK = 2*dN_KK + K(eval_ind)./(2*nu_bar_eval) + dK(eval_ind);
    dN_KK = (1/(pi*p_density))*dN_KK;
    N_KK = dN_KK + n_inf;
    toc
    
    plot(nu_bar_eval,dN_KK);
    hold on
% Test for presence of GPU
  try
      canUseGPU = parallel.gpu.GPUDevice.isAvailable;
  catch ME
      canUseGPU = false;
  end
  
  
  if canUseGPU  
    
    fprintf('\rGPU for un-optimized\r');  % 11.35 s
    gGPU = gpuDevice(1);
    reset(gGPU);
tic
    gnu_bar = gpuArray(nu_bar);
    gK = gpuArray(K);
    gdK = gpuArray(dK);
    gdN_KK = gpuArray(dN_KK);
    for i = 1:len_eval
        jj = eval_ind(i);
        gdN_KK(i) = sum(gnu_bar([1:jj-1, jj+1:end]) .* gK([1:jj-1, jj+1:end]) ./ (gnu_bar([1:jj-1, jj+1:end]).^2 - gnu_bar(jj).^2));
        gdN_KK(i) = 2*gdN_KK(i) + gK(jj)./(2*gnu_bar(jj)) + gdK(jj);
    end
    gdN_KK = (1/(pi*p_density))*gdN_KK;
    dN_KK =gather(gdN_KK);
    wait(gGPU);
    
toc
    plot(nu_bar_eval,dN_KK);
    hold on
    
    fprintf('\rGPU for 1D vector optimized\r');  % 0.54 s
    gGPU = gpuDevice(1);
    reset(gGPU);
tic
    gnu_bar = gpuArray(nu_bar);
    gK = gpuArray(K);
    gdK = gpuArray(dK);
    gnu_bar_eval = gpuArray(nu_bar_eval);
    gdN_KK = gpuArray(dN_KK);
    geval_ind = gpuArray(eval_ind);
    gpart_a = gnu_bar .* gK;
    gpart_b = gnu_bar .^2;
    for i = 1:len_eval
        jj = geval_ind(i);
        gpart_c = gpart_a ./ (gpart_b - gpart_b(jj));
        gpart_c(jj) = 0; %set singlar points to zero before sum
        gdN_KK(i)= sum(gpart_c);
    end
    gdN_KK = 2*gdN_KK + gK(geval_ind)./(2*gnu_bar_eval) + gdK(geval_ind);
    gdN_KK = (1/(pi*p_density))*gdN_KK;
    dN_KK =gather(gdN_KK);
    
    wait(gGPU);
    
    N_KK = dN_KK + n_inf;
    
toc
    plot(nu_bar_eval,dN_KK);
    hold on
    
    fprintf('\rGPU for 2D vector optimized\r'); % 33.34 s
    gGPU = gpuDevice(1);
    reset(gGPU);
    %test for sufficient memory
    mem_GPU_avail = gGPU.AvailableMemory;
    mem_need = len_nubar*len_eval*8*5;
    if mem_need > mem_GPU_avail
        fprintf('not enough memory, need %10.3e, only %10.3e available\r',mem_need, mem_GPU_avail);
        proceed = false;
    else
        fprintf('have enough memory, need %10.3e, %10.3e available\r',mem_need, mem_GPU_avail);
        proceed = true;
    end
    %proceed is sufficient memory is available
    if proceed
    tic
        gnu_bar = gpuArray(nu_bar);
        gnu_bar_eval = gpuArray(nu_bar_eval);
        gK = gpuArray(K);
        gdK = gpuArray(dK);
        gdN_KK = gpuArray(dN_KK);
        geval_ind = gpuArray(eval_ind);
        R_gtemp = repmat(gnu_bar.^2,len_eval,1);
        R_gnubar_K = repmat(gnu_bar.*gK,len_eval,1);
        R_gnu_bar_eval_sq = repmat(gnu_bar_eval'.^2,1,len_nubar);
        R_gtemp = (R_gnubar_K) ./ (R_gtemp - R_gnu_bar_eval_sq);
        for i = 1:len_eval
            R_gtemp(i,eval_ind(i)) = 0; %set singlar points to zero before sum
        end
        gdN_KK = sum(R_gtemp,2)';
        gdN_KK = 2*gdN_KK + gK(eval_ind)./(2*gnu_bar_eval) + gdK(geval_ind);
        gdN_KK = (1/(pi*p_density))*gdN_KK;
        dN_KK =gather(gdN_KK);
        
        wait(gGPU);
        N_KK = dN_KK + n_inf;
    toc
        plot(nu_bar_eval,dN_KK);
        hold on
    end
%%%%%%%%%%%%% single precision tests
    fprintf('\rGPU for un-optimized single precision\r');  % 11.19 s
    gGPU = gpuDevice(1);
    reset(gGPU);
tic
    gnu_bar = gpuArray(single(nu_bar));
    gK = gpuArray(single(K));
    gdK = gpuArray(single(dK));
    gdN_KK = gpuArray(single(dN_KK));
    for i = 1:len_eval
        jj = eval_ind(i);
        gdN_KK(i) = sum(gnu_bar([1:jj-1, jj+1:end]) .* gK([1:jj-1, jj+1:end]) ./ (gnu_bar([1:jj-1, jj+1:end]).^2 - gnu_bar(jj).^2));
        gdN_KK(i) = 2*gdN_KK(i) + gK(jj)./(2*gnu_bar(jj)) + gdK(jj);
    end
    gdN_KK = (1/(pi*p_density))*gdN_KK;
    dN_KK =double(gather(gdN_KK));
    wait(gGPU);
    
toc
    plot(nu_bar_eval,dN_KK);
    hold on
    
    fprintf('\rGPU for 1D vector optimized single precision\r');  % 0.47 s
    gGPU = gpuDevice(1);
    reset(gGPU);
tic
    gnu_bar = gpuArray(single(nu_bar));
    gK = gpuArray(single(K));
    gdK = gpuArray(single(dK));
    gdN_KK = gpuArray(single(dN_KK));
    gnu_bar_eval = gpuArray(single(nu_bar_eval));
    geval_ind = gpuArray(single(eval_ind));
    gpart_a = gnu_bar .* gK;
    gpart_b = gnu_bar .^2;
    for i = 1:len_eval
        jj = geval_ind(i);
        gpart_c = gpart_a ./ (gpart_b - gpart_b(jj));
        gpart_c(jj) = 0; %set singlar points to zero before sum
        gdN_KK(i)= sum(gpart_c);
    end
    gdN_KK = 2*gdN_KK + gK(geval_ind)./(2*gnu_bar_eval) + gdK(geval_ind);
    gdN_KK = (1/(pi*p_density))*gdN_KK;
    dN_KK =double(gather(gdN_KK));
    
    wait(gGPU);
    
    N_KK = dN_KK + n_inf;
    
toc
    plot(nu_bar_eval,dN_KK);
    hold on
    
    fprintf('\rGPU for 2D vector optimized single precision\r'); % 16.84 s
    gGPU = gpuDevice(1);
    reset(gGPU);
    %test for sufficient memory
    mem_GPU_avail = gGPU.AvailableMemory;
    mem_need = len_nubar*len_eval*4*5;
    if mem_need > mem_GPU_avail
        fprintf('not enough memory, need %10.3e, only %10.3e available\r',mem_need, mem_GPU_avail);
        proceed = false;
    else
        fprintf('have enough memory, need %10.3e, %10.3e available\r',mem_need, mem_GPU_avail);
        proceed = true;
    end
    %proceed is sufficient memory is available
    if proceed
    tic
        gnu_bar = gpuArray(single(nu_bar));
        gK = gpuArray(single(K));
        gdK = gpuArray(single(dK));
        gdN_KK = gpuArray(single(dN_KK));
        gnu_bar_eval = gpuArray(single(nu_bar_eval));
        geval_ind = gpuArray(single(eval_ind));
        R_gtemp = repmat(gnu_bar.^2,len_eval,1);
        R_gnubar_K = repmat(gnu_bar.*gK,len_eval,1);
        R_gnu_bar_eval_sq = repmat(gnu_bar_eval'.^2,1,len_nubar);
        R_gtemp = (R_gnubar_K) ./ (R_gtemp - R_gnu_bar_eval_sq);
        for i = 1:len_eval
            R_gtemp(i,eval_ind(i)) = 0; %set singlar points to zero before sum
        end
        gdN_KK = sum(R_gtemp,2)';
        gdN_KK = 2*gdN_KK + gK(eval_ind)./(2*gnu_bar_eval) + gdK(geval_ind);
        gdN_KK = (1/(pi*p_density))*gdN_KK;
        dN_KK =double(gather(gdN_KK));
        
        wait(gGPU);
        N_KK = dN_KK + n_inf;
    toc
        plot(nu_bar_eval,dN_KK);
        hold on
    end
    
    
    
  end

3 Comments
Show 1 older commentHide 1 older comment

Lloyd Bumm on 6 Sep 2019

I edited the above results to include single precsion on the CPU and the GPU.

Joss Knight on 6 Sep 2019

Thanks. The only explanation for that is that your cost is all overhead on the GPU, and not computation.

Sign in to comment.

GPU optimization of looped vector operations

4 Comments
Show 2 older commentsHide 2 older comments

Accepted Answer

15 Comments
Show 13 older commentsHide 13 older comments

More Answers (1)

3 Comments
Show 1 older commentHide 1 older comment

See Also

Categories

Tags

Products

Community Treasure Hunt

GPU optimization of looped vector operations

4 Comments Show 2 older commentsHide 2 older comments

Accepted Answer

15 Comments Show 13 older commentsHide 13 older comments

More Answers (1)

3 Comments Show 1 older commentHide 1 older comment

See Also

Categories

Tags

Products

Community Treasure Hunt

4 Comments
Show 2 older commentsHide 2 older comments

15 Comments
Show 13 older commentsHide 13 older comments

3 Comments
Show 1 older commentHide 1 older comment