# Sum of squares profiling on GPU

1 view (last 30 days)
Dan Ryan on 5 Oct 2013
Commented: Dan Ryan on 7 Oct 2013
I was profiling some code that runs on my GPU and came across something rather puzzling that I haven't been able to sort out... maybe it has something to do with the way the profiler interacts with the GPU, so I also tried on the CPU and got very different results. Here is the code:
clear all
g = gpuArray.rand(600, 600, 400, 'single');
for i = 1:100
x = sum(g, 3)/400;
gSq = g.^2;
y = sum(gSq, 3)/400;
g = g+.01;
end
This code is just an example of the problem, not the actual code I am running, so don't try to wonder why anybody would do this...
On the GPU the profiler shows basically ALL of the time is spent on the line
y = sum(gSq, 3)/400;
On the CPU, the profiler shows most of the time being spent on
g = g+.01;
and the remainder of the time is evenly distributed among the other lines.
Why is summing the gSq array so expensive on the GPU relative to summing the x array? They are the same size... I don't think it is a memory issue since my GPU has 4GB memory and almost 3GB is still available with g, x, gSq and y in memory.
Any ideas?
Dan Ryan on 7 Oct 2013
Upon further investigation, I can conclude that the profiler does not actually assign credit to each line in a correct manner when dealing with the GPU. For instance, if I run
g=gpuArray.rand(600, 600, 400, 'single');
for i = 1:1000
gSq = g.^2;
g = g+.01;
end
The whole script terminates in about 16 seconds and almost all of the time is assigned to the line gSq = g.^2;
However, after adding the line where the sum is computed:
g=gpuArray.rand(600, 600, 400, 'single');
for i = 1:1000
gSq = g.^2;
x = sum(gSq, 3);
g = g+.01;
end
The script now takes 40 seconds to run and only about 0.5 seconds in total is assigned to the line gSq = g.^2. This indicates that appropriate credit is not assigned to each line.
Secondly, using the squaring operation, .^2, takes two to three times as much time as explicitly multiplying the quantity by itself. Changing the line
gSq = g.^2;
to
gSq = g.*g;
results in a script that runs in about 5 seconds without the sum and 20 seconds with the sum; indicating about 10 seconds are saved in computing gSq and another 10 seconds are saved when computing sum(gSq, 3)... very strange.

Sean de Wolski on 7 Oct 2013
You might be interested in gputimeit, new in R2013b:
Dan Ryan on 7 Oct 2013
Awesome, I will check this out.