Sensible difference between computation on GPU single type variable and CPU single type

11 views (last 30 days)
I'm using an iterative algorithm based on repeated matrix multiplications; I notice that my algorithm, with the same imputs:
  • converges on CPU using double precision variables;
  • converges on CPU using single precision variables;
  • converges on GPU using double precision variables;
  • diverges on GPU using single precision variables.
the single precision on GPU is required to obtain the greatest speed-up; I think that the problem is connected with the follow behavior:
format longg %just for visualization
M = randn(10,10,'single') % single precision matrix on cpu
Mg = gpuArray(M) <---- % same matrix on gpu
for i=1:20
E(i) = norm(M^i-Mg^i);
end
E = 0
1.184116e-06
5.051582e-06
2.594609e-05
0.0001531584
0.0006053803
0.002351832
0.01166957
0.04795818
0.204383
0.8535618
2.759973
13.55706
48.12731
167.6242
666.1393
3173.625
10306.33
43441.54
154367.1
so the difference between results is very big after just few matrix multiplications;instead, in double precision the difference between the two matrices is very little:
format longg %just for visualization
M = randn(10,10,'double') % double precision matrix on cpu
Mg = gpuArray(M) <---- % same matrix on gpu
for i=1:20
E(i) = norm(M^i-Mg^i);
end
1.794529e-15
7.594314e-15
2.508041e-14
1.038021e-13
2.723854e-13
1.127249e-12
2.581471e-12
7.382443e-12
2.095835e-11
4.506465e-11
9.805409e-11
2.562852e-10
6.968754e-10
2.101229e-09
3.449182e-09
1.083673e-08
2.235593e-08
5.35213e-08
1.378775e-07
I think that it causes very different results using the GPU in single precision, and it can be a problem since using double precision GPU is much slower.
How can I resolve?

Answers (2)

Joss Knight
Joss Knight on 24 Nov 2018
Jan's answer is correct of course; but perhaps the succinct point is to ask the question, which answer is right? You've been assuming that when you compare the GPU result to the CPU result you are getting the error in the GPU result.
format long
M = randn(10,10,'single'); % single precision matrix on cpu
Md = double(M); % double precision version
Mg = gpuArray(M);
for i=1:20
E(i,1:2) = [norm(Md^i-M^i) norm(Md^i-Mg^i)];
end
E
E =
20×2 single gpuArray matrix
1.0e+04 *
0 0
0.0000000 0.0000000
0.0000000 0.0000000
0.0000000 0.0000000
0.0000000 0.0000000
0.0000001 0.0000001
0.0000002 0.0000001
0.0000006 0.0000007
0.0000023 0.0000020
0.0000055 0.0000051
0.0000266 0.0000237
0.0000645 0.0000487
0.0002251 0.0002210
0.0007011 0.0004216
0.0031013 0.0021814
0.0074494 0.0071806
0.0314264 0.0256066
0.1649611 0.0789942
0.5249441 0.2566499
1.3869889 0.9445597
It turns out that, in this case, it was the GPU version that was more accurate. Probably this will always be true, because the GPU computes the sum part of the dot product in parallel in a binary tree fashion, which tends to restrict error growth in floating point by doing most of the sums with smaller values.

Jan
Jan on 24 Nov 2018
Edited: Jan on 24 Nov 2018
This is the expected behavior. The matrix multiplication is affected by the rounding error of the dot products of the rows and columns. The dot product is the sum of the element-wise mutliplication. The sum is a numerically instable operation, which means that a small variation of the input can cause a large difference in the output. Performing a matrix multiplication in single precision must have larger errors compared to double precision. Therefore the observed faster growing of the deviation is exactly what is expected in theory. See:
E_single = [0
1.184116e-06
5.051582e-06
2.594609e-05
0.0001531584
0.0006053803
0.002351832
0.01166957
0.04795818
0.204383
0.8535618
2.759973
13.55706
48.12731
167.6242
666.1393
3173.625
10306.33
43441.54
154367.1];
E_double = [ 1.794529e-15
7.594314e-15
2.508041e-14
1.038021e-13
2.723854e-13
1.127249e-12
2.581471e-12
7.382443e-12
2.095835e-11
4.506465e-11
9.805409e-11
2.562852e-10
6.968754e-10
2.101229e-09
3.449182e-09
1.083673e-08
2.235593e-08
5.35213e-08
1.378775e-07];
plot(log(E_single), 'b')
hold on
plot(log(E_double), 'r')
  2 Comments
Andrea Apiella
Andrea Apiella on 24 Nov 2018
Edited: Andrea Apiella on 24 Nov 2018
no, the point here is the difference on the same operation on the SAME DATATYPE on CPU compared with GPU i.e. GPU single is very different from CPU single, not single vs double that is your point. I expected that results obtained on GPU are similar to the results obtained on CPU using the same datatype.
Jan
Jan on 24 Nov 2018
Edited: Jan on 24 Nov 2018
@Andrea: You did not get my point. It is not the comparison between single and double, but it is expected that both types create an accumulated rounding errors and for singles it is expected to be larger and to grow faster. See this:
sM = randn(10, 10, 'single');
sM1 = sM;
sM2 = sM + eps(sM); % Tiny variation of the input (single)
dM = randn(10, 10, 'double');
dM1 = dM;
dM2 = dM + eps(dM); % Tiny variation of the input (double)
for i=1:20
sE(i) = norm(sM1^i - sM2^i); % Both on CPU
dE(i) = norm(dM1^i - dM2^i); % Both on CPU
end
figure;
axes('NextPlot', 'add');
plot(log(sE), 'b');
plot(log(dE), 'r');
legend({'single', 'double'});
You see the same behavior, if you add a tiny variation to the inputs - here both are computed on the CPU. This is not a proof, but a strong hint, that the difference for the computations on the CPU and GPU are caused by a different rounding behavior for the dot products. You can expect the same effects, when you calculate X^i on different CPUs or with different BLAS libraries. The growing difference is an effect of the instability of the matrix multiplication, not a problem of the CPU versus GPU computation.
The problem is, that you expect the calculations on CPU and GPU give the same results, but the accumulated rounding errors dominate the result and therefore it is the nature of the problem, that the results differ on different hardware.

Sign in to comment.

Products


Release

R2018b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!