Strange core usage when running Slurm jobs

Question

Fredrik P on 16 Mar 2024

0
Link

Direct link to this question

https://ch.mathworks.com/matlabcentral/answers/2095101-strange-core-usage-when-running-slurm-jobs

Commented: Damian Pietrus on 21 Mar 2024

I'm trying to run jobs on an HPC cluster using Slurm, but I run into problems both when I'm running interactive jobs and when I'm submitting batch jobs.

When I run interactive jobs and I book one node, then I manage to use all of the node's 20 cores. But when I book more than one node for an interactive job, then the cores on the extra nodes are just left unused.
When I run a batch job, then the job uses only one core per node.

Do you have any idea what I might be doing wrong?

1. I book my interactive job from the command prompt using the following commands:

interactive -A myAccountName -p devel -n 40 -t 0:30:00
module load matlab/R2023a
matlab

to submit a 30-minute 40-core job to the "devel" partition using my account (not actually called "myAccountName"), load the Matlab module and launch Matlab as an X application. Once in Matlab, I first choose the "Processes" parallel profile and second run the "Setup" and "Interactive" sections in the silly little script at the bottom of this question. In two separate terminal sessions, I then use

ssh MYNODEID
htop

where MYNODEID is either of the two nodes assigned to the interactive job. Then I see that the job uses all of the cores on one of the nodes and none of the cores on the second node.

2. To book my batch job, I load and launch Matlab from the command prompt using the following commands

module load matlab/R2023a
matlab

and then run the "Setup" and "Batch" sections in the silly little script at the bottom of this question. Using the same procedure as above, htop lets me see that the job uses two cores (one on each node) and leaves the remaining 38 cores (19 on each node) unused.

Silly little script

%% Setup
clear;
close all;
clc;
N = 1000; % Length of fmincon vector
%% Interactive
x = solveMe(randn(1, N));
%% Batch
Cluster = parcluster('rackham R2023a');
Cluster.AdditionalProperties.AccountName = 'myAccountName';
Cluster.AdditionalProperties.QueueName = 'devel';
Cluster.AdditionalProperties.WallTime = '0:30:00';
Cluster.batch( ...
    @solveMe, ...
    0, ...
    {}, ...
    'pool', 39 ...
); % Submit a 30-minute 40-core job to the "devel" partition using my account (not actually called "myAccountName")
%% Helper functions
function A = slowDown()
    A = randn(5e3);
    A = A + randn(5e3);
end
function x = solveMe(x0)
    opts = optimoptions( ...
        "fmincon", ...
        "MaxFunctionEvaluations", 1e6, ...
        "UseParallel", true ...
    );
    x = fmincon( ...
        @(x) 0, ...
        x0, ...
        [], [], ...
        [], [], ...
        [], [], ...
        @(x) nonlinearConstraints(x), ...
        opts ...
    );
    function [c, ceq] = nonlinearConstraints(x)
        c = [];
        A = slowDown();
        ceq = 1 ./ (1:numel(x)) - cumsum(x);
    end
end

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Damian Pietrus on 19 Mar 2024

0
Link

Direct link to this answer

https://ch.mathworks.com/matlabcentral/answers/2095101-strange-core-usage-when-running-slurm-jobs#answer_1427846

Open in MATLAB Online

Based on your code, it looks like you have correctly configured a cluster profile to submit a job to MATLAB Parallel Server. In this case, your MATLAB client will always submit a secondary job to the scheduler. It is in this secondary job that you should request the bulk of your resources. As an example, on the cluster login node you should only ask for a few cores (enough to run your MATLAB serial code), as well as a longer WallTime:

% Two cores, 1 hour WallTime
interactive -A myAccountName -p devel -n 2 -t 1:00:00
module load matlab/R2023a
matlab

Next, you should continue to use the AdditionalProperties fields to shape your "inner" job:

%% Batch
Cluster = parcluster('rackham R2023a');
Cluster.AdditionalProperties.AccountName = 'myAccountName';
Cluster.AdditionalProperties.QueueName = 'devel';
Cluster.AdditionalProperties.WallTime = '0:30:00';

When you call the MATLAB batch command, this is where you can then request the total amount of cores that you would like your parallel code to run on:

myJob40 = Cluster.batch(@solveMe, 0, {},'pool', 39);
myJob100 = Cluster.batch(@solveMe, 0, {},'pool', 99);

Notice that since this submits a completely separate job to the scheduler queue, you can choose a pool size larger than you requested in your 'interactive' CLI command. Also notice that the Cluster.AdditionalProperties WallTime value is shorter than the 'interactive' value. This is to account for the time that the inner job may wait in the queue.

Long story short -- when you call batch or parpool within a MATLAB session that has a Parallel Server cluster profile setup, it will submit a secondary job to the scheduler that can have its own separate resources. You can verify this by manually veiwing the scheduler's job queue.

Please let me know if you have any further questions!

4 Comments
Show 2 older commentsHide 2 older comments

Fredrik P on 21 Mar 2024

Open in MATLAB Online

Alright. I didn't know that "Processes" could only handle a single machine. Good to know.

I'll send you a private message as well.

Here are the two files that you requested.

communicatingSubmitFcn.m

function communicatingSubmitFcn(cluster, job, environmentProperties)
%COMMUNICATINGSUBMITFCN Submit a communicating MATLAB job to a Slurm cluster
%
% Set your cluster's IntegrationScriptsLocation to the parent folder of this
% function to run it when you submit a communicating job.
%
% See also parallel.cluster.generic.communicatingDecodeFcn.
% Copyright 2010-2018 The MathWorks, Inc.
% Get the MATLAB version being used
if verLessThan('matlab', '9.6')
    before19A = 'true';
else
    before19A = 'false';
end
% Store the current filename for the errors, warnings and dctSchedulerMessages
currFilename = mfilename;
if ~isa(cluster, 'parallel.Cluster')
    error('parallelexamples:GenericSLURM:NotClusterObject', ...
        'The function %s is for use with clusters created using the parcluster command.', currFilename)
end
decodeFunction = 'parallel.cluster.generic.communicatingDecodeFcn';
if ~cluster.HasSharedFilesystem
    error('parallelexamples:GenericSLURM:NotSharedFileSystem', ...
        'The function %s is for use with shared filesystems.', currFilename)
end
if ~strcmpi(cluster.OperatingSystem, 'unix')
    error('parallelexamples:GenericSLURM:UnsupportedOS', ...
        'The function %s only supports clusters with unix OS.', currFilename)
end
enableDebug = 'false';
if isprop(cluster.AdditionalProperties, 'EnableDebug') ...
        && islogical(cluster.AdditionalProperties.EnableDebug) ...
        && cluster.AdditionalProperties.EnableDebug
    enableDebug = 'true';
end
% The job specific environment variables
% Remove leading and trailing whitespace from the MATLAB arguments
matlabArguments = strtrim(environmentProperties.MatlabArguments);
variables = {'MDCE_DECODE_FUNCTION', decodeFunction; ...
    'MDCE_STORAGE_CONSTRUCTOR', environmentProperties.StorageConstructor; ...
    'MDCE_JOB_LOCATION', environmentProperties.JobLocation; ...
    'MDCE_MATLAB_EXE', environmentProperties.MatlabExecutable; ...
    'MDCE_MATLAB_ARGS', matlabArguments; ...
    'PARALLEL_SERVER_DEBUG', enableDebug; ...
    'MDCE_BEFORE19A', before19A; ...
    'MLM_WEB_LICENSE', environmentProperties.UseMathworksHostedLicensing; ...
    'MLM_WEB_USER_CRED', environmentProperties.UserToken; ...
    'MLM_WEB_ID', environmentProperties.LicenseWebID; ...
    'MDCE_LICENSE_NUMBER', environmentProperties.LicenseNumber; ...
    'MDCE_STORAGE_LOCATION', environmentProperties.StorageLocation; ...
    'MDCE_CMR', cluster.ClusterMatlabRoot; ...
    'MDCE_TOTAL_TASKS', num2str(environmentProperties.NumberOfTasks); ...
    'MDCE_NUM_THREADS', num2str(cluster.NumThreads)};
% Set each environment variable to newValue if currentValue differs.
% We must do this particularly when newValue is an empty value,
% to be sure that we clear out old values from the environment.
for ii = 1:size(variables, 1)
    variableName = variables{ii,1};
    currentValue = getenv(variableName);
    newValue = variables{ii,2};
    if ~strcmp(currentValue, newValue)
        setenv(variableName, newValue);
    end
end
% Deduce the correct quote to use based on the OS of the current machine
if ispc
    quote = '"';
else
    quote = '''';
end
% Specify the job wrapper script to use.
if isprop(cluster.AdditionalProperties, 'UseSmpd') && cluster.AdditionalProperties.UseSmpd
    scriptName = 'communicatingJobWrapperSmpd.sh';
else
    scriptName = 'communicatingJobWrapper.sh';
end
% The wrapper script is in the same directory as this file
dirpart = fileparts(mfilename('fullpath'));
quotedScriptName = sprintf('%s%s%s', quote, fullfile(dirpart, scriptName), quote);
% Choose a file for the output. Please note that currently, JobStorageLocation refers
% to a directory on disk, but this may change in the future.
logFile = cluster.getLogLocation(job);
quotedLogFile = sprintf('%s%s%s', quote, logFile, quote);
jobName = sprintf('Job%d', job.ID);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% CUSTOMIZATION MAY BE REQUIRED %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% You might want to customize this section to match your cluster,
% for example to limit the number of nodes for a single job.
additionalSubmitArgs = sprintf('--ntasks=%d --cpus-per-task=%d', environmentProperties.NumberOfTasks, cluster.NumThreads);
commonSubmitArgs = getCommonSubmitArgs(cluster, environmentProperties.NumberOfTasks);
if ~isempty(commonSubmitArgs) && ischar(commonSubmitArgs)
    additionalSubmitArgs = strtrim([additionalSubmitArgs, ' ', commonSubmitArgs]) %#ok<NOPRT>
end
dctSchedulerMessage(5, '%s: Generating command for task %i', currFilename, ii);
commandToRun = getSubmitString(jobName, quotedLogFile, quotedScriptName, ...
    additionalSubmitArgs);
% Now ask the cluster to run the submission command
dctSchedulerMessage(4, '%s: Submitting job using command:\n\t%s', currFilename, commandToRun);
try
    % Make the shelled out call to run the command.
    [cmdFailed, cmdOut] = system(commandToRun);
catch err
    cmdFailed = true;
    cmdOut = err.message;
end
if cmdFailed
    error('parallelexamples:GenericSLURM:SubmissionFailed', ...
        'Submit failed with the following message:\n%s', cmdOut);
end
dctSchedulerMessage(1, '%s: Job output will be written to: %s\nSubmission output: %s\n', currFilename, logFile, cmdOut);
jobIDs = extractJobId(cmdOut);
% jobIDs must be a cell array
if isempty(jobIDs)
    warning('parallelexamples:GenericSLURM:FailedToParseSubmissionOutput', ...
        'Failed to parse the job identifier from the submission output: "%s"', ...
        cmdOut);
end
if ~iscell(jobIDs)
    jobIDs = {jobIDs};
end
% set the job ID on the job cluster data
cluster.setJobClusterData(job, struct('ClusterJobIDs', {jobIDs}));

communicatingJobWrapper.sh

#!/bin/sh
# This wrapper script is intended to be submitted to Slurm to support
# communicating jobs.
#
# This script uses the following environment variables set by the submit MATLAB code:
# MDCE_CMR            - the value of ClusterMatlabRoot (may be empty)
# MDCE_MATLAB_EXE     - the MATLAB executable to use
# MDCE_MATLAB_ARGS    - the MATLAB args to use
# PARALLEL_SERVER_DEBUG    - used to debug problems on the cluster
# MDCE_BEFORE19A      - the MATLAB version number being used
#
# The following environment variables are forwarded through mpiexec:
# MDCE_DECODE_FUNCTION     - the decode function to use
# MDCE_STORAGE_LOCATION    - used by decode function
# MDCE_STORAGE_CONSTRUCTOR - used by decode function
# MDCE_JOB_LOCATION        - used by decode function
#
# The following environment variables are set by Slurm:
# SLURM_NODELIST - list of hostnames allocated to this Slurm job
# Copyright 2015-2018 The MathWorks, Inc.
# Echo the nodes that the scheduler has allocated to this job:
echo The scheduler has allocated the following nodes to this job: ${SLURM_NODELIST:?"Node list undefined"}
if [ "${MDCE_BEFORE19A}" == "true" ]; then
    module load intelmpi/17.2
    FULL_MPIEXEC=mpiexec.hydra
    # Override default bootstrap
    # Options are: ssh, rsh, slurm, lsf, and sge
    export I_MPI_HYDRA_BOOTSTRAP=slurm
    # Ensure that mpiexec is not using the Slurm PMI library
    # I_MPI_PMI_LIBRARY must not be defined
    unset I_MPI_PMI_LIBRARY
else
    # Create full path to mw_mpiexec if needed.
    FULL_MPIEXEC=${MDCE_CMR:+${MDCE_CMR}/bin/}mw_mpiexec
fi
export TZ="Europe/Stockholm"
# Label stdout/stderr with the rank of the process
MPI_VERBOSE=-l
# Increase the verbosity of mpiexec if PARALLEL_SERVER_DEBUG or MDCE_DEBUG (for backwards compatibility) is true
if [ "X${PARALLEL_SERVER_DEBUG}X" = "XtrueX" ] || [ "X${MDCE_DEBUG}X" = "XtrueX" ]; then
    MPI_VERBOSE="${MPI_VERBOSE} -v -print-all-exitcodes"
fi
# Construct the command to run.
CMD="\"${FULL_MPIEXEC}\" ${MPI_VERBOSE} -n ${MDCE_TOTAL_TASKS} \"${MDCE_MATLAB_EXE}\" ${MDCE_MATLAB_ARGS}"
# Echo the command so that it is shown in the output log.
echo $CMD
# Execute the command.
eval $CMD
MPIEXEC_EXIT_CODE=${?}
if [ ${MPIEXEC_EXIT_CODE} -eq 42 ] ; then
    # Get here if user code errored out within MATLAB. Overwrite this to zero in
    # this case.
    echo "Overwriting MPIEXEC exit code from 42 to zero (42 indicates a user-code failure)"
    MPIEXEC_EXIT_CODE=0
fi
echo "Exiting with code: ${MPIEXEC_EXIT_CODE}"
exit ${MPIEXEC_EXIT_CODE}

Damian Pietrus on 21 Mar 2024

Thanks for including that -- It looks like your integration scripts are from around 2018. Since they are a bit out of date, they don't include some changes that will hopefully fix the core binding issue you're experiencing. I'll reach out to you directly, but for anyone else that finds this post in the future, you can get an updated set of integration scripts here:

MathWorks Plugin Scripts on GitHub

Sign in to comment.

Strange core usage when running Slurm jobs

0 Comments
Show -2 older commentsHide -2 older comments

Accepted Answer

4 Comments
Show 2 older commentsHide 2 older comments

More Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

Strange core usage when running Slurm jobs

0 Comments Show -2 older commentsHide -2 older comments

Accepted Answer

4 Comments Show 2 older commentsHide 2 older comments

More Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

4 Comments
Show 2 older commentsHide 2 older comments