Ticket 10489 - Transition plan proposals for breaking change in 20.11 regarding --exclusive/--whole
Summary: Transition plan proposals for breaking change in 20.11 regarding --exclusive/...
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: User Commands (show other tickets)
Version: 20.11.1
Hardware: Linux Linux
: --- 2 - High Impact
Assignee: Marcin Stolarek
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-12-18 11:17 MST by Luke Yeager
Modified: 2021-01-26 10:36 MST (History)
9 users (show)

See Also:
Site: NVIDIA (PSLA)
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Luke Yeager 2020-12-18 11:17:25 MST
Hi SchedMD,

At our site, we are grappling with the breaking changes in 20.11 regarding the --exclusive and --whole flags for srun (existing discussion at bug#10383), and trying to figure out a transition plan for our users. Some of our supported applications don't use OpenMPI, so patching all MPI's on the cluster is not only infeasible (as pointed out by others), but it's also insufficient.

We need a transition plan for two groups of users:

* GroupA: users who can opt in to the new 20.11 behavior right away and who don't need to support older Slurms

* GroupB: users who need their code to work on 20.11 and also on older Slurms, at least for a while


# OptionA

Set SLURM_WHOLE=1 in /etc/environment so that all users get an implicit --whole flag added to srun, which mostly restores the 20.02 behavior. Leave this in place for a period of time, and then remove it.

Most user scripts will just keep working on our cluster, but they will not know their scripts are deprecated unless we tell them. We will have to start a communications campaign to try to get enough users transitioned to the new 20.11 behavior before we remove this global envvar and inevitably break user scripts for any users who we haven't managed to reach yet.

## OptionA, Phase1

GroupA will need to add 'unset SLURM_WHOLE' at the top of their user scripts. GroupB will need to add 'export SLURM_WHOLE=1' to the top of their scripts, since other clusters might not set it for them, and since their script will break even on our cluster if they don't do this before we remove the workaround (see Phase2).

## OptionA, Phase2

At some point, we'll remove the workaround. GroupA will want to clean up their scripts to remove 'unset SLURM_WHOLE'. GroupB will need to leave 'export SLURM_WHOLE=1' in place until they no longer care to support older Slurms. At which point they can finally remove the envvar and modernize their script (probably with a liberal sprinkling of --whole).


# OptionB

Leave SLURM_WHOLE=1 set globally forever. At some point, our cluster will be the only one left behind and it will be a burden for users to need to consider our cluster when writing scripts.


# OptionC

Let it burn. Transition the cluster to 20.11 and don't set SLURM_WHOLE=1 globally. Most/all user scripts will break immediately. Give clear documentation and instructions about how to workaround this. Basically, you skip OptionA Phase1 and move straight to OptionA Phase2.


# OptionD

We delay the upgrade to 20.11 and start instructing users to future-proof their user scripts. Once we feel that enough of the users have transitioned, then we do the upgrade and risk breaking user scripts which have not been updated yet.


As you can see, there are no options which don't involve breaking user scripts at some point.


Questions:

1. Do you have any suggestions as to how to detect automatically when user scripts have been updated? I was considering using a SPANK plugin with spank_getenv() to detect when SLURM_WHOLE is set, but that doesn't really help. For one, we might be setting it for them globally (in which case we don't know if they've updated anything). For another, they might be intentionally unsetting that envvar to opt-in to the new behavior (in which case a deprecation notice would be incorrect). I suppose I could grep for SLURM_OVERLAP in their sbatch script but I don't think I need to explain how that could go wrong. We could try to detect when most cores are under-utilized in an application, guess that they forgot --whole, and send them a warning, but that seems likely to generate a lot of false positives.

2. Are there any internal plans to add additional configuration flags or environment variables related to exclusive/whole/overlap? If so, perhaps you can share them and help us craft a new transition plan option with a little more automation and/or a little less pain?

3. Are there any plans to revert this breaking change? If so, we'll presumably set SLURM_OVERLAP=1 globally until we get the new, reverted Slurm and can remove the workaround.

Thanks,
Luke
Comment 1 Luke Yeager 2020-12-18 13:36:18 MST
(In reply to Luke Yeager from comment #0)
> Some of our supported applications don't use OpenMPI
There was some confusion about this. I meant to type "MPI". We support applications which don't use any flavor of MPI at all. In particular, I'm thinking of Pytorch. One supported workflow is, essentially:

#SBATCH -N2
srun --ntasks-per-node=1 python multi_node_launch.py

The Python script expects to have access to all resources on the local node - it will launch processes on the node and bind them to CPU/GPU resources as needed. Analogous to orted, I believe.

Fortunately, due to the way we handle binding, the job fails early and loudly for us on 20.11 with an error from numactl like 'libnuma: Warning: cpu argument 4,40-44 out of range'.

I share all this just to clarify my point that patching OpenMPI (and/or other MPIs) doesn't fix all applications. Which seems to be the implication made by the title of bug#10453.


For further reading, in case anyone is interested:
* sbatch script: https://github.com/mlcommons/training_results_v0.7/blob/master/NVIDIA/benchmarks/ssd/implementations/pytorch/run.sub#L58-L60
* automatic conversion from Slurm envvars to Pytorch envvars: https://github.com/NVIDIA/enroot/blob/v3.2.0/conf/hooks/extra/50-slurm-pytorch.sh
* Pytorch code which consumes these envvars: https://github.com/pytorch/pytorch/blob/v1.7.1/torch/distributed/rendezvous.py#L140-L183
Comment 2 Luke Yeager 2020-12-21 08:12:58 MST
Two points of clarification about the Pytorch example in comment #1:

1. We assume that all partitions are configured with OverSubscribe=EXCLUSIVE, which is why you don't see --exclusive or --cpus-per-task in that sbatch script.

2. I mistakenly put "--ntasks-per-node=1" as the pseudo-code for the sbatch script, which refers to an older version of that sbatch script. The current version is more like "--ntasks-per-node=$ngpus". But, since we don't set --cpus-per-task explicitly, the new behavior still doesn't give us the full allocation of CPUs:

$ srun -l --ntasks-per-node=8 grep Cpus_allowed_list /proc/self/status
0: Cpus_allowed_list:   0-7,40-47
1: Cpus_allowed_list:   0-7,40-47
...

vs. with SLURM_WHOLE=1 set:

$ srun -l --ntasks-per-node=8 grep Cpus_allowed_list /proc/self/status
0: Cpus_allowed_list:   0-79
1: Cpus_allowed_list:   0-79
...


More relevant configuration settings from our cluster[s]:

slurm.conf:SelectType=select/cons_tres                                                                                                                                                                               slurm.conf:SelectTypeParameters=CR_CPU
slurm.conf:PartitionName=DEFAULT OverSubscribe=EXCLUSIVE ...
slurm.conf:TaskPlugin=affinity,cgroup                                                                                                                                                                                slurm.conf:TaskPluginParam=None
cgroup.conf:ConstrainCores=yes
cgroup.conf:ConstrainRAMSpace=no


And I used the Pytorch example to show that not everyone uses MPI all the time, but the same problem goes for our applications which DO use MPI, since we launch them with srun instead of mpirun/mpiexec (so, the finer-grained OMPI_MCA_plm_slurm_args workaround doesn't help us): https://github.com/mlcommons/training_results_v0.7/blob/master/NVIDIA/benchmarks/resnet/implementations/mxnet/run.sub#L70-L72


Finally, I verified that we have yet another bad option available to us:

OptionE: disable task/cgroup and task/affinity

When both TaskPlugins are disabled, Slurm is unable to restrict resources from our jobs anymore. This fixes the MLPerf scripts linked above, but it causes problems for other workloads which DO rely on the srun command-line options like -c and --exclusive. Plus, it removes the possibility of making any of ever setting OverSubscribe=NO for any of our partitions in the future.
Comment 3 Kilian Cavalotti 2020-12-21 08:13:06 MST
Hi,

I am currently out of office, returning on January 4. 

If you need to
reach Stanford Research Computing, please email srcc-support@stanford.edu

Cheers,
Comment 4 Adam DeConinck 2020-12-21 08:13:13 MST
I'm out of the office until Monday, December 28. I'll review email and reply when back in the office.

Thanks,
Adam
Comment 5 Marcin Stolarek 2020-12-22 02:08:26 MST
Luke,                                                                            
                                                                                 
Let me share my thoughts, but honestly the main part of transition plan for specific site is on site administrators and as the nature of the question is rather a configuration advisory than an outage, I'd say severity 4 is more appropriate for the ticket? 
                                                                                 
From my (I've be site admin for over 10years) a mixture of all the options is the best approach, so the path I'd go is: 
1) Identify applications that should always use --whole starting form 20.11.        
For those applications I'd add export of SLURM_WHOLE environment variable in module files, if application users are details aware I suggest printing information about that to standard error.
                                                                                 
2) Identify users that may be affected by Slurm upgrade, let them now about the change and the fact that you're preparing to upgrade to 20.11. If you have many users that should use --whole, maybe set appropriate MOTD?
                                                                                 
---                                                                              
On questions..                                                                   
1) I've started to think that the most common situation leading to issues is the use of srun --ntasks-per-node=1 inside an existing allocation, if this is the case we can discover it in CliFilter plugin and either set --whole automatically or just print an message to end user about the consequences of it under slurm-20.11. Just having message printed is what I'd do. I played with the below snippet:
>function slurm_cli_pre_submit(options, pack_offset)                             
>                                                                                   
>         local runningJobId = os.getenv("SLURM_JOB_ID")                            
>         if runningJobId ~=nil and  tonumber(options["ntasks-per-node"]) == 1      
>         then                                                                   
>                 slurm.log_info("You are using --ntasks-per-node=1 inside an existing allocation, starting from Slurm 20.11 the step will by default be limited to requested resources. If you still want     the step to have access to the whole allocation check the new --whole option of srun")
>         end                                                                    
>                                                                                   
>         return slurm.SUCCESS                                                      
> end                                                                               
                                                                                    
                                                                                    
2)At the moment we don't plan further changes/additions in this area.               
                                                                                    
3)As above - we don't plan on reverting this behavior. We believe it's good way to go long term.

Let me know what you think, especially about the CliFilter approach.

cheers,
Marcin
Comment 6 Marcin Stolarek 2020-12-22 02:56:35 MST
PS.
>OptionE: disable task/cgroup and task/affinity

This is something I'd call strongly not recommended. Your environment may behave very different from the other environments your users know. Disabling cgroup TaskPlugin you'll lose its control over application allocated memory. The alternative mechanism JobAcctGatherParams=OverMemoryKill relies on the peridical check which means that when user app allocates memory quickly the OS OOM killer may start killing selected(not very different from random) processes.
Comment 7 Marcin Stolarek 2021-01-04 06:19:12 MST
Luke,

Did you have some time to take a look at comment 5 and 6? What are your thoughts?

cheers,
Marcin
Comment 8 Adam DeConinck 2021-01-04 06:19:30 MST
I'm out of the office until Monday, Jan 4. I'll review email and reply when back in the office.

Thanks,
Adam
Comment 9 Marcin Stolarek 2021-01-11 03:42:52 MST
Luke,

As our final decision on Bug 10383 is to make the --whole allocation a default in 20.11.3 I believe we can close this ticket.

If you'd like to further discuss it please don't hesitate to reopen.

cheers,
Marcin
Comment 10 Luke Yeager 2021-01-12 09:33:23 MST
(In reply to Marcin Stolarek from comment #9)
> As our final decision on Bug 10383 is to make the --whole allocation a
> default in 20.11.3 I believe we can close this ticket.
Yes, I agree. Now the only real "transition plan" we need is to explain the new behavior which requires the '--overlap' flag (as discussed on bug#10450). But that's less impactful, easier to detect, and easier to explain. Thanks for looking into it for us!
Comment 11 Luke Yeager 2021-01-26 10:36:16 MST
I have verified that the previous behavior has been restored with 20.11.3 (--whole is now the default again), thanks for the revert! Upgrading now becomes simple.