Can't use as many cores as available
3 views (last 30 days)
Apologies first because I probably don't know enough about this to adequately describe the problem, but please ask questions (and maybe help me figure out how to answer them).
I'm using my company's computing resources to run Matlab and run scripts with parallel computing. There's some set up, but, as best I understand, once I'm in Matlab, I'm essentially remoted into a computer with a lot of cores, though those are a shared resource. To run a script, I start with parpool('local',<number of cores>). This only works if I request 64 or fewer cores.
Prior to this, I set up the cluster by validating it with 5 cores and then resetting the number of workers to 512, which is the maximum we're allowed. Before setting up the parpool, I have checked the number of cores available with feature('numCores') to ensure I'm not requesting more than available and/or checked the number of idle cores by running cee-lan-status -c in the terminal (I assume this is a standard command, but I don't know bash).
When I request more than 64 cores, I always get this error:
Error using parpool (line 149)
Parallel pool failed to start with the following error. For more detailed information, validate the profile 'local' in the Cluster Profile Manager.
Error using parallel.internal.pool.InteractiveClient>iThrowWithCause (line 678)
Failed to initialize the interactive session.
Error using parallel.internal.pool.InteractiveClient>iThrowIfBadParallelJobStatus (line 789)
The interactive communicating job failed with no message.
What else can I try?
Raymond Norris on 22 Mar 2021
I believe what you're saying is that from your desktop machine you connect to some server. From there, you run MATLAB on a machine that has 512 cores.
You validate your local profile by changing the worker count to 5, then set it back to 512. Note: not sure which version of MATLAB, but there's a field in the validation to state the number of workers to use so that you don't have to toggle this.
You then run the following
nc = feature('numcores');
p = parpool('local',nc);
The caveat is that you can't be sure that you have access to number of cores, c. That's just want MATLAB sees. Are there other applications/users running on the same machine?
I've heard of issues crossing 64 local workers, but I think that was more on Windows and not Linux (which I'm guessing is what you're running on?). To capture the error, try the following:
c = parcluster('local');
p = c.parpool(nc);
% Parpool errors out. Look at log file.
% After you look at the error, delete the job