Cancel parfeval causes workers to throw errors

I'm writing a GUI in App Designer to run Simulink simulations in parallel, using parfeval because I need the ability to cancel in-progress tasks. The app has an Abort button that sets a global variable when it is pressed, which the fetchNext loop checks for. The function myFunc runs the Simulink models and does a lot of extra setup and processing as well. Here's some abbreviated code:
% get or create pool
pool = gcp;
% schedule tasks
futures(1:numIterations) = parallel.FevalFuture;
for p = 1:numIterations
futures(p) = parfeval(@myFunc);
end
% collect results
numCompleted = 0;
timeout = 180; % 3 minute timeout
while numCompleted < numIterations
[completedIdx, results] = fetchNext(futures,timeout);
numCompleted = numCompleted + 1;
% collect results from tasks that didn't time out
if ~isempty(completedIdx)
workerTasks{completedIdx} = results;
end
% check for abort
if app.abortRequested
break;
end
end
% check for abort
if app.abortRequested
cancel(pool.FevalQueue.QueuedFutures);
cancel(pool.FevalQueue.RunningFutures);
end
This runs fine, and the abort feature works the first time you use it.
Then if you immediately run the code again, several of the tasks come back Failed with the error message "Dot indexing is not supported for variables of this type." If you close down the pool and re-run the program with fresh workers, the problem disappears.
It seems like cancel doesn't fully delete the job from the threads, and there's something left over that interferes with new tasks. Has anyone run into this? Am I missing a step when canceling the futures?
Thanks in advance.

Answers (1)

My guess is that whatever myFunc is doing is not safe against being interrupted with CTRL-C. When you cancel a parallel.Future, the execution of myFunc on the worker is (effectively) interrupted as if you had typed CTRL-C.
From the code you've posted, it's not completely clear what's going wrong, but my main suggestion would be: use onCleanup inside myFunc to make sure your code is "interrupt safe".
A secondary point - your "cancel" could be refined very slightly to this:
% check for abort
if app.abortRequested
cancel(futures);
end
This will ensure you don't cancel anything other than the executions of myFunc

4 Comments

Thanks for the reply. I apologize I can't be more detailed, but I'll describe as much as I can.
myFunc takes a list of Simulink models, and a table of simulation tasks to run. The table contains rows for different input parameter values, so one worker can receive multiple simulation sets.
The worker loads all models in fast restart mode. Then, it does some initialization to create appropriate input values, which get stored to the appropriate model workspace. The first models are executed, and then their outputs become the inputs of the last model in the list. This process is repeated for each entry in the task table.
After all tasks are complete, fast restart is disabled and all models are closed, and simulation results are returned.
My first thought was that being interrupted leaves models open in fast restart. So I added onCleanup to close any open models:
function workerCleanup(modelList)
loadedModels = find_system('SearchDepth',0);
names = modelList(matches(modelList, loadedModels));
for j = 1:numel(names)
set_param(names(j), 'FastRestart','off');
close_system(names(j), 0, 'SaveModelWorkspace','false');
end
end
However this didn't seem to help. I'm not sure what qualifies as "interrupt safe" code.
Thanks for your help.
By "interrupt safe" - I mean simply that if you imagine the code being CTRL-Cd at any point, it does not leave behind any state that would interfere with it running a second time. Closing your Simulink models using onCleanup is one piece of this, but perhaps there are other things getting left behind. The usual culprits are things like persistent or even global state somewhere.
The other approach you could take is trying to diagnose in more detail what precisely is causing the error you're seeing. Look at the stack, maybe try adding fprintf statements (these show up in the future.Diary field). Unfortunately it's not currently possible to run the MATLAB debugger on workers, but you can usually get a reasonable idea of what's going wrong using fprintf or similar.
Thanks for the replies. I added some print statements and the workers that fail, fail at the sim command. I verified that all of the inputs to sim are correct, the model's workspace variables are set properly, and the Simulink models are not still open from the previously-aborted run. Not sure what else to check at this point.
To speed up execution, I call start_simulink on each worker before it receives its task list. I can't seem to find a stop_simulink command, though. Is there a way to close Simulink on the workers, so I can get a fresh session?
If all else fails, I guess I'll just have to shut down and re-open the parallel pool whenever a run is Aborted.
I don't think there is an equivalent of stop_simulink - although there is bdclose('all') which closes all models - that's my usual "big hammer" for this sort of situation.
I'm not sure what else to suggest. It might be worth contacting MathWorks Support as they might be able to help you go through more debugging steps.

Sign in to comment.

Categories

Find more on General Applications in Help Center and File Exchange

Products

Release

R2023b

Asked:

on 12 Jun 2024

Commented:

on 18 Jun 2024

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!