(Temporary) Memory requirements of conv2/convn and fft2/fftn computations in GPU and CPU computing

2 views (last 30 days)
Hi,
I have an application in which i need to compute 3D and 4D Convolutions which I have implemented using various methods and combinations of FFTs and linear convolutions. The problem I'm facing is, particularly when using my GPU for these computations, the estimation of the temporary memory requirements.
Basically my lowest-bound estimate for the memory requirements is based on the size I need to preallocate before the computation and my output. This is clearly not enough, seeing as I get Cuda out-of-memory exception a lot sooner than my estimate suggests.
My question therefore is: How much memory does a general convn or fftn operation require? Here is exemplary code of such cases:
padding = [Ydim, Xdim, Zdim];
fftn_out = ifftn(fftn(M,padding).*fftn(P(:,:,1),padding).*fftn(K,padding));
or using convn
result = convn(convn(M,P(:,:,1),'same'),K,'full');
In the case using FFTs I clearly know the size of my output and the padded M,P and K input*. But how much temporary memory is necessary to actually compute output fftn_out? My first would be that I also have to consider storing the output of the 3 fftns padded to the padding vector using 2*double precision for the real and imaginary parts. But even then I don't know the temporary requirements of the fftn calculation itself and I also don't know when this memory is cleared internally.
The same basic question arise using convn.
Any help would be greatly appreciated.
*these are stored additionally to the non-padded M,P,K arrays I suppose. Would it therefore make sense to pre-pad M,P,K using the padding vector and clear the original M,P,K before doing the fftn Multiplication on the pre-padded arrays?

Accepted Answer

Joss Knight
Joss Knight on 11 Nov 2018
FFT requires a workspace size dependent on the radix of the signal, and it can be pretty huge. A rule of thumb says you'll always need 4x your input size (as double complex), input, output and workspace, but that it can go up to 8x with prime signal length. There's no formula so you need to experiment. You can watch the GPU memory from a command window during the calculation using nvidia-smi.
For convn it should be more straightforward, just about enough memory for your output and a copy of your inputs.

More Answers (0)

Products


Release

R2018b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!