Bug 5722

Summary: dynamic adjustment of MaxStepCount
Product: Slurm Reporter: S Senator <sts>
Component: slurmstepdAssignee: Tim Wickberg <tim>
Status: OPEN --- QA Contact:
Severity: 5 - Enhancement    
Priority: ---    
Version: 17.11.8   
Hardware: Cray XC   
OS: Linux   
Site: LANL Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: trinitite CLE Version: UP05
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description S Senator 2018-09-12 13:21:08 MDT
A small number of our users have a nightly regression verfication workload which consists of a large number (>40k) job steps. At present, we are reserving nodes for this workload and adjusting the on-node slurm.conf to increase MaxStepCount to 80k, which accommodates this particular workload.

Would it be possible to have the MaxStepCount adjusted dynamically, similar to the other dynamic list allocators already within slurm, so that we do not have to force a difference onto these particular nodes, for this particular workload?

We recognize that this is likely to be reclassified as an enhancement. If so, please provide scoping for this task under the LANL/SchedMD contract.
Comment 1 Tim Wickberg 2018-09-12 15:23:49 MDT
(In reply to S Senator from comment #0)
> A small number of our users have a nightly regression verfication workload
> which consists of a large number (>40k) job steps. At present, we are
> reserving nodes for this workload and adjusting the on-node slurm.conf to
> increase MaxStepCount to 80k, which accommodates this particular workload.
> 
> Would it be possible to have the MaxStepCount adjusted dynamically, similar
> to the other dynamic list allocators already within slurm, so that we do not
> have to force a difference onto these particular nodes, for this particular
> workload?
> 
> We recognize that this is likely to be reclassified as an enhancement. If
> so, please provide scoping for this task under the LANL/SchedMD contract.

Can you explain why you can't just put this at higher value?

The additional resource demands for additional job steps should be fairly trivial in the grand scheme of things - it doesn't compete with the primary job scheduling mechanics within slurmctld, and while there are some additional records generated they're much smaller than for jobs as a whole - and I don't see a reason why a higher overall limit should not just be applied to the system. I certainly would not encourage you to change this dynamically with the current code base.

And the procedure you've outlined for how you have been adjusting this does not make sense. This limit is only checked and enforced by the slurmctld, not the slurmd on a given compute node, so varying it by altering the slurm.conf on the compute nodes wouldn't do anything.

While giving you some sort of dynamic control over this is obviously an enhancement, I'm not currently inclined to spec this out as I don't think this is of value to Slurm as a whole.
Comment 2 S Senator 2018-09-12 15:34:48 MDT
We certainly could just increase this limit if the impact of doing so is low. Thank you for confirming that.

The regression tests would fail after exceeding 40k job steps on those nodes without this limit increase. They are no longer reporting any error after increasing this limit on those specific nodes and restarting the local slurmd. This difference in behavior is the basis for my question.