flaky GPU memory issues

3 views (last 30 days)
Rodrigo
Rodrigo on 9 Feb 2012
Edited: Cedric on 8 Oct 2013
We have a 580 GTX with 3Gb of ram running in a linux (Ubuntu Lucid with Natty backported kernel) machine with 2011b and I find myself fighting with seemingly random crashes due to memory allocation in the GPU. The first thing I noticed was that overwriting a variable defined on the GPU does not always give me back all the ram that the old variable had minus the size of the new data, so I have to clear the variable instead of overwriting it; is there some collection of best practices to avoid wasting memory in ways similar to this?
I also find that a calculation that has been running for hours, and that has successfully complete before will sometimes crash with an "unexpected error" which seems to correlate with running close to maximum memory capacity. Since the program had completed before, I am left assuming that some other program interfered with memory allocation in the GPU and killed my task. Is there a way to prevent this from happening? Maybe running the server headless, or putting in another, smaller video card to run the display?
Thanks

Accepted Answer

Edric Ellis
Edric Ellis on 9 Feb 2012
In your first observation about overwriting variables on the GPU, I presume you're using the output of "gpuDevice" to check the amount of free memory on the GPU. You're quite right that overwriting an array may not necessarily cause the old memory to be freed immediately; however, it will be freed automatically if necessary to prevent running out of memory.
It's not clear what the 'unexpected error' might be, this is not something I've seen here at The MathWorks on our test machines. Do these errors show up in similar places each time? I.e. does there seem to be a gpuArray operation that particularly causes this?
One final thing to note: like CPU memory, GPU memory can become fragmented over time, and it's possible that this might cause you to run out of GPU memory earlier than you might otherwise anticipate. However, I would not normally expect this to result in 'unexpected errors' - rather, I'd expect to see failed allocations.
  8 Comments
Rodrigo
Rodrigo on 12 Feb 2012
I still have no idea what this command does, but it seems to have solved the problem. I still have about 1 month of back to back calculations that have to happen -- so this may be too early to tell -- but whatever black voodoo this feature command does seems to work.
Rodrigo
Rodrigo on 13 Apr 2012
So this fix seems to break in R2012a. Any ideas for how to unbreak it?

Sign in to comment.

More Answers (3)

Walter Roberson
Walter Roberson on 9 Feb 2012
It is not safe to assume that some other program interfered with the memory allocation. Instead, you have to take in to account that your program might have corrupted memory in a way that does not always cause a crash but does sometimes. For example if the corrupted memory block does not happen to be needed again until a lot of memory is in use...
  2 Comments
Rodrigo
Rodrigo on 9 Feb 2012
I see. So is there a way to periodically flush the GPU memory to avoid this corruption? Right now the full computation takes about 24hrs, and having it crash at the 23rd hour stings. I suppose I can dump the partial results to disk and try to recover after a crash, but since I don't know what the "unexpected error" actually is I have a hard time adjusting my programs to avoid it.
In case a Mathworks engineer is reading, posting a set of best practices and common "unexpected errors" would be really helpful.
Walter Roberson
Walter Roberson on 9 Feb 2012
If you _do_ have a memory corruption problem from your code (or from something on MathWork's implementation), then releasing all memory _or_ using all memory would trigger the problem. However, releasing the gpu from operations could, depending on implementation, potentially have the effect of just throwing away all of the memory without bothering to put all the fragments together.
It would not be impossible for a memory allocator to offer a "Ignore everything known about the current state of memory and just re-initialize back to the starting state". I do not recall ever encountering a memory allocation library that offered that as a user call, however.
I have not examined the memory allocation system used for the GPU routines; I am reflecting back to my past experiences [redacted] years ago, using [redacted] on [redacted] (redactions to protect my delusions that I am not _that_ old...)

Sign in to comment.


Ruby Fu
Ruby Fu on 10 Feb 2012
Hi Rodrigo, Eric and Walter, It is great that I found this post just when I need it! I have the exact same problem as Rodrigo. My experience has been this:
1. my program runs perfectly fine with a smaller resolution problem, meaning smaller matrices and less memory allocation.
2. when i try to run the program in higher resolution, it yells at me for not having enough memory
3. so naturally I clear several intermediate matrices at each iteration after they are done being useful; they get updated at the next iteration anyway.
4. Now I test run the new program (with cleared memory at each iteration) in the _small_ resolution problem.(just to make sure i did not accidentally clear some useful variables)
5.
Error using parallel.gpu.GPUArray/fft
MATLAB encountered an unexpected error in evaluation on the GPU.
Coincidentally, this error occurred at a fft operation. However, it is also the first function call in the program.
Do you think having a bigger GPU will solve the problem? I have a GTX580 as well and it only comes with 1.5GB. Would having a Tesla 6GB solve this problem or is there something else we are missing here?
Eric, I have the latest CUD driver so that should not be an issue.
Thank you! Ruby
  1 Comment
Edric Ellis
Edric Ellis on 13 Feb 2012
The error message you are getting is due to CUFFT - NVIDIA's FFT algorithm - running out of memory. Unfortunately, it sometimes reports back to us this out-of-memory condition as an "unexpected error", which we then report to you. This sort of unpredictable behaviour can sometimes be helped by the "feature" command I suggested to Rodrigo - but if you're that close to running out of memory, you may still have problems. A bigger memory card would almost certainly help you.

Sign in to comment.


Max W.K. Law
Max W.K. Law on 9 May 2013
I got the same error while trying to ifftn (complex to complex) a 256*256*516 complex-single 3D array. It is a 258MB chunk of data. It fails on my 4GB GTX 680 card. YES, if it is about running short of memory, that means 4GB memory couldn't take a 258MB data chunk, and give the error "MATLAB encountered an unexpected error in evaluation on the GPU."
There are some other data in the GPU that many cause fragmentation. The code that produces this error is just "temp=ifftn(temp);" Please, is there any way to enforce the in-place transform?
Here is the gpuDevice() command result
Name: 'GeForce GTX 680'
Index: 1
ComputeCapability: '3.0'
SupportsDouble: 1
DriverVersion: 5
ToolkitVersion: 5
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [2.147483647000000e+09 65535 65535]
SIMDWidth: 32
TotalMemory: 4.294639616000000e+09
FreeMemory: 1.676050432000000e+09
MultiprocessorCount: 8
ClockRateKHz: 1163000
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 1
CanMapHostMemory: 1
DeviceSupported: 1
DeviceSelected: 1

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!