Background Data Dispatch with Custom Training Loop

I have a question regarding the training of a deep neural network with Matlab.
I have built a custom training loop for the training of a regression network on a machine with 2 GPUs.
The training loop performs fine, however it is rather slow in comparison to the automatic trainNetwork function.
The trainNetwork function does not provide the type of network progress monitor i like. The trainNetwork function also seems to error unpredictably on my machine and sometimes the network are not "finished" properly. This is why i make use of a custom training loop.
I use a parallel pool with 2 workers and the randomPatchExtraction Datastore (which is partitionable). The parallel operations
are written in an spmd block.
What would be the best way to use data dispatching in the background in a custom training loop?
I have tried to scale up the number of workers in the parallel pool. This leads to the case that some workers
cannot read data since the Datastores are only partitioned according to the number of GPUs, not the number of workers.
Which operations do i have to assign to the workers that are supposed to preload data?
Has anybody tried using a "self-written" data dispatching in a custom training loop?
Thanks in advance!

 Accepted Answer

Use a minibatchqueue with the DispatchInBackground option.

4 Comments

Hi Joss, Thanks for the answer.
The use of a minibatchqueue only partly solves my issue.
I can build a mbq with the "DispatchInBackground" option but yet i have to assign specific workers
of the spmd block to the "dispatching operation". With 2 GPUs i can use a parpool with 4 workers.
Two workers should be assigned to the training computation and two (or more) workers should be assigned to the dispatch operation. Currently with these settings i (obviously) have 4 workers working on the training computation even with the usage of a minibatchqueue and the "DispatchInBackground" option enabled. My question is how do i build an spmd block where:
spmd
if labindex == 1
[X,Y] = next(mbq);
"training computation"
elseif labindex == 2
[X,Y] = next(mbq);
"training computation"
elseif labindex == 3
"dispatch operation for worker 1"
elseif labindex == 4
"dispatch operation for worker 2"
end
end
How do i assign the proper sequence of workers so that worker 3 and 4 operate first and then worker 1 and 2 operate second? How do i solve data consistency so that data is available on workers 1 and 2 in the first iteration? How do i find out which workers are specifically tied to GPUs?
Thanks in advance!
Hi, you can't use DispatchInBackground and use SPMD, is the simple answer, not in a custom training loop. Only trainNetwork supports both parallel training and background dispatch, and it uses a bunch of clever stuff involving MPI communicators.
In a future release this will become possible using a thread pool nested inside a process pool, but not yet.
You can do this right now using some complex point-to-point communication, not pretty:
spmd
% Workers 1 and 2 are background workers. 3 and 4
% are compute workers.
% Partition datastore into 2 parts. Read first part
% and send to compute workers.
if labindex < 3
subds = partition(ds, 2, labindex);
data = read(subds); % Add some batching logic here
labSend(data, labindex+2);
else
data = labReceive(labindex-2);
end
loop = true;
while loop
% Background workers read next batch while compute
% workers process the current one
if labindex < 3
loop = hasdata(subds);
if loop
data = read(subds); % Again, might need batching logic
end
labSend(data, labindex+2);
else
% Do some computation to compute gradients
% Send and receive gradients between compute workers
otherComputeWorker = mod(labindex-2,2)+3;
theirGradients = labSendReceive(otherComputeWorker,otherComputeWorker,myGradients);
% Do something with the two sets of gradients,
% probably add them together and update a model
% Then receive next batch
data = labReceive(labindex-2);
end
% Detect when either partition is finished
loop = gop(@and, loop);
end
end
You can do a version of this using gop or gplus to sum gradients as in the examples, but you need to make sure the background workers also participate - every worker has to call gop. One advantage of doing that is that you'll get fast peer-to-peer data transfer for gpuArray data. At the moment labSendReceive doesn't use fast data transfer. You could also implement labSendReceive using a call to labSend then to labReceive on worker 3, and the opposite way round on the other compute worker. That will use fast GPU-GPU communication, but loses out on the asynchronicity.
I haven't actually checked this works so maybe there are issues but I'm sure you can debug them.
This example definitely helps and will solve my issue.
I was not able to make it run yet but i can see the schematic behind the idea. I understand that i have to tell each specific worker when and what to send to the other workers in order to mimic a communication between the workers. As of now i run into "deadlock" issues where (i assume) at a certain point i want to receive data on one worker where there is no data to be received (yet). Probably this results from the usage of labSend and labReceive instead of labSendReceive to make use of NV-Link communication between the GPUs.
Thanks again for the help!
Great! labSend is blocking, so you can't have both workers 3 and 4 call labSend at the same time. You need to choose which one goes first.

Sign in to comment.

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!