Limit total GPU memory usage

I'm using the Matlab Parallel Toolbox to do work on GPUs. I need to find a way to place a limit on the total GPU memory usage of a Matlab process.
I know about feature('GpuAllocPoolSizeKb', X) and that's what I'm using for the time being, but that only limits the pool size, which means I have to know the maximum non-pool usage of my Matlab code ahead of time in order to figure out what pool size to choose in order to limit my total (pool + non-pool) usage to a specific number. This maximum non-pool usage turns out to be very difficult to estimate beforehand.
The reason I need to limit my GPU memory usage is that I am running many simultaneous Matlab processes on a large multicore server with many GPUs. Without limiting the GPU memory usage, the Matlab processes on one GPU all compete with one another, and since Matlab uses "lazy" garbage collection, it doesn't take long before a few processes squeeze out another process and cause it to run out of memory and crash. This is completely unnecessary since most of the memory they are taking up is actually no longer in use and just represents freed allocations that have not yet been garbage collected.
My group is trying to determine whether the Parallel Toolbox will be a good purchase for other groups in our lab too, and this issue is overwhelmingly our biggest headache so far. It has caused a great deal of frustration. Hopefully some better solution is possible!
Thank you

Answers (1)

Joss Knight
Joss Knight on 23 Jun 2017
Since several releases ago MATLAB releases variables as soon as they go out of scope, are overwritten, cleared (using clear or clearvars), or set to empty. Even if you have MATLAB R2015a or earlier, you should still find that overwriting or setting gpuArray variables to empty will aggressively release GPU memory (back to the pool, or back to the system if the pool is full). I'd be interested to see any examples you have of gpuArrays not being released when they are no longer being referenced.
There is no trivial way to restrict the GPU memory available to each MATLAB process. My best suggestion would be to write a gpuArray allocator class that keeps track of all the allocations. However, for task parallel work you may find it's just as easy to make your applications fault tolerant to GPU memory shortages. So for instance, you might catch parallel:gpu:array:OOM errors and handle them (perhaps by waiting for memory to come available).
It's not usual to configure a machine to allow multiple processes to share the same GPU so this is why I don't have a better answer for you. The 'normal' approach would be to use the NVIDIA tools to restrict GPU access to a single process per node, and divide your cluster or your tasks appropriately; or you can put the GPU in exclusive mode and force other MATLABs to wait for the GPU to come available if they need it.
The point is that not only does CUDA provide no means of partitioning memory on a per-process basis, it also has a single process execution model; unless you're using NVIDIA's multi-process server all GPU kernels from different processes are launched sequentially. Even if you are using MPS, the GPU is not really efficient unless it is being fully utilised, so in a typical scenario work from each process will always be serialised. This isn't really anything to do with MATLAB, there just aren't anything more than very crude system tools to help with this. Although when a single client MATLAB is in control of a whole node there are some places we could do better at dividing up work - suggestions welcome!
The word as I understand it from NVIDIA is that they don't have any kind of virtualization technology for compute similar to what they have for graphics. I'm seeing this kind of thing more often though, so maybe they will start to think about how to do better in that regard.

4 Comments

Jesse Ziser
Jesse Ziser on 3 Jul 2017
Edited: Jesse Ziser on 3 Jul 2017
No, I have no examples of gpuArrays not being released when they are no longer being referenced. That is not the problem. The problem is that each Matlab process has its own pool, and the pools grow with each allocation and do not shrink until GPU memory is exhausted. This means that the pools for some processes will tend to crowd out other processes over time and eventually cause them to die, even though most of the memory in those pools is not actively in use.
I don't think a gpuArray allocator class would be helpful, as I need to know the maximum simultaneous non-pool allocation amount beforehand so that I can determine how much I can safely allow for the pool. If there is a way to change the GpuAllocPoolSizeKb dynamically whenever the non-pool allocation amount changes, then that might solve my problem, but I haven't been able to find a way to do that. Strange things seem to happen if I call feature('GpuAllocPoolSizeKb', X) any time other than at the beginning of the code.
I hadn't thought of catching parallel:gpu:array:OOM. I will look into that. Maybe we could catch that and somehow signal the other processes to release some memory. Is there a way to explicitly trigger Matlab's garbage-collection facility so that the pool will shrink?
Using only one process per GPU is still a last-ditch option, but it would be less efficient as it would mean that we would be making use of a smaller subset of the available CPU cores for the work that needs to be done on the CPU side. Multithreaded computation in Matlab has been of limited utility to us so far because we are not always in a situation where there is some clear way to write things that Matlab will be able to parallelize internally, and because I/O is a significant cost. I will look into reading from multiple files simultaneously in a parfor loop. That might make this approach a little better, if it is efficient.
Regarding sequential execution of kernels on the GPU, that's a surprise to me as we have been seeing extremely good performance with this approach -- even competitive with custom C/CUDA code! Out-of-memory errors have been the only problem.
If you want suggestions, here are two. First off, if you want users to put one Matlab process in charge of a whole node (which IMO is not always ideal in a research environment with many users), then there would need to be some way to use more than one GPU at the same time. Currently, each Matlab process only seems to be able to use one GPU. Second, there should be some more efficient way to do work in multiple threads, preferably using a persistent thread pool and shared memory. Currently, things like parfor seem to have an unreasonably large overhead.
I tried some of these things, and unfortunately nothing worked.
  • parfor seems to have too much overhead to be useful (especially in the transfer of data back to the master thread).
  • I can't figure out any way to tell Matlab to return its pool memory back to the system so that other Matlabs can use it, so there's no way I can see to make use of catching the OOM exception.
  • Performing GPU operations from one Matlab process seems to be about 2x less efficient than launching them simultaneously from multiple Matlabs, even when using just one GPU.
  • Trying to do multiple GPU operations in parallel using parfor runs into not only the overhead problem mentioned above, but also a more severe version of the out-of-memory problem: Matlab seems to hang once memory is exhausted! It appears that the different threads in the parfor are not able to tell each other to release pool memory, so when they need memory, they all just block and wait for memory to become available, which it never does because none of the other Matlabs are willing to release it. My diagnosis could be wrong, but that's the best I can figure.
If there are no other options, we will have to continue spending some time estimating the non-pool memory usage of every job before running it so that we will know how much pool space to statically allocate. This is very disappointing. I hope Mathworks will consider implementing some kind of solution to this problem in the near future.
Again, if there is some "right way" to parallelize GPU computation in Matlab that we're missing, please let me know! I would have thought that surely there is some way to do it, but it seems that no approach currently works. You can't do it in a parfor without hanging, and you can't do it in separate processes without crashing. What to do?
Okay, well, there's too much here to answer well in a MATLAB Answers thread. I suggest you contact support and we can look at doing an investigation into your problem.
I think I understand what you mean by 'garbage collection' - you mean MATLAB releasing its pooled GPU memory. Well, this happens whenever you call reset(gpuDevice), or deselect and reselect your GPU e.g. by gpuDevice([]); gpuDevice(1);. Perhaps that will help you. You should be able to safely change the pool size when you've done this. Personally, however, I'd just keep the pool size low and let the processes do more raw allocations, rather than trying to tune it dynamically.
But are you certain the pool is the fundamental problem here? If enough processes try to run a GPU operation at the same time and the data size is sufficient to make using the GPU worthwhile, then you can get out-of-memory regardless. Even writing your own CUDA, even using Unified Memory you couldn't avoid that eventuality. You can't prevent one process requesting memory before another has released it.
To say you get a speed-up running GPU operations simultaneously on multiple MATLAB processes is difficult to respond to without knowing what you're doing. A lot of the GPU functions do a significant amount of work on the CPU, so maybe the benefit is really coming from CPU parallelism.
Joss Knight
Joss Knight on 6 Jul 2017
Edited: Joss Knight on 6 Jul 2017
I don't quite understand your last point about having one MATLAB process be in charge of a whole node. If you had multiple GPUs then you could assign each worker to a different one and they wouldn't interfere. What I was getting at was, admittedly coming from a place of complete ignorance of how your environment is managed, is having two completely separate clusters, one with no GPUs and one with, sharing resources and with the latter only having one worker per node. A user wanting to run GPU code would have to request workers from the GPU-enabled cluster.
Your parfor issue doesn't sound unlikely when your workers are all sharing a GPU. The fundamental answer to your question of the "right way" to parallelize GPU computation in MATLAB is, at the moment, to author highly vectorized data parallel MATLAB code that ensures the GPU is continuously occupied and fully utilized from the host MATLAB. Doing this from multiple processes isn't something that any environment, MATLAB or otherwise, supports well. I'm hoping that NVIDIA will eventually provide virtualization environments for compute similar to what they do graphics, restricting each process's access to GPU resources. This is something we are actively interested in ourselves so I hope we'll have better answers in the future.

Sign in to comment.

Products

Asked:

on 23 Jun 2017

Edited:

on 6 Jul 2017

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!