Hi SchedMD, At our site, we are grappling with the breaking changes in 20.11 regarding the --exclusive and --whole flags for srun (existing discussion at bug#10383), and trying to figure out a transition plan for our users. Some of our supported applications don't use OpenMPI, so patching all MPI's on the cluster is not only infeasible (as pointed out by others), but it's also insufficient. We need a transition plan for two groups of users: * GroupA: users who can opt in to the new 20.11 behavior right away and who don't need to support older Slurms * GroupB: users who need their code to work on 20.11 and also on older Slurms, at least for a while # OptionA Set SLURM_WHOLE=1 in /etc/environment so that all users get an implicit --whole flag added to srun, which mostly restores the 20.02 behavior. Leave this in place for a period of time, and then remove it. Most user scripts will just keep working on our cluster, but they will not know their scripts are deprecated unless we tell them. We will have to start a communications campaign to try to get enough users transitioned to the new 20.11 behavior before we remove this global envvar and inevitably break user scripts for any users who we haven't managed to reach yet. ## OptionA, Phase1 GroupA will need to add 'unset SLURM_WHOLE' at the top of their user scripts. GroupB will need to add 'export SLURM_WHOLE=1' to the top of their scripts, since other clusters might not set it for them, and since their script will break even on our cluster if they don't do this before we remove the workaround (see Phase2). ## OptionA, Phase2 At some point, we'll remove the workaround. GroupA will want to clean up their scripts to remove 'unset SLURM_WHOLE'. GroupB will need to leave 'export SLURM_WHOLE=1' in place until they no longer care to support older Slurms. At which point they can finally remove the envvar and modernize their script (probably with a liberal sprinkling of --whole). # OptionB Leave SLURM_WHOLE=1 set globally forever. At some point, our cluster will be the only one left behind and it will be a burden for users to need to consider our cluster when writing scripts. # OptionC Let it burn. Transition the cluster to 20.11 and don't set SLURM_WHOLE=1 globally. Most/all user scripts will break immediately. Give clear documentation and instructions about how to workaround this. Basically, you skip OptionA Phase1 and move straight to OptionA Phase2. # OptionD We delay the upgrade to 20.11 and start instructing users to future-proof their user scripts. Once we feel that enough of the users have transitioned, then we do the upgrade and risk breaking user scripts which have not been updated yet. As you can see, there are no options which don't involve breaking user scripts at some point. Questions: 1. Do you have any suggestions as to how to detect automatically when user scripts have been updated? I was considering using a SPANK plugin with spank_getenv() to detect when SLURM_WHOLE is set, but that doesn't really help. For one, we might be setting it for them globally (in which case we don't know if they've updated anything). For another, they might be intentionally unsetting that envvar to opt-in to the new behavior (in which case a deprecation notice would be incorrect). I suppose I could grep for SLURM_OVERLAP in their sbatch script but I don't think I need to explain how that could go wrong. We could try to detect when most cores are under-utilized in an application, guess that they forgot --whole, and send them a warning, but that seems likely to generate a lot of false positives. 2. Are there any internal plans to add additional configuration flags or environment variables related to exclusive/whole/overlap? If so, perhaps you can share them and help us craft a new transition plan option with a little more automation and/or a little less pain? 3. Are there any plans to revert this breaking change? If so, we'll presumably set SLURM_OVERLAP=1 globally until we get the new, reverted Slurm and can remove the workaround. Thanks, Luke
(In reply to Luke Yeager from comment #0) > Some of our supported applications don't use OpenMPI There was some confusion about this. I meant to type "MPI". We support applications which don't use any flavor of MPI at all. In particular, I'm thinking of Pytorch. One supported workflow is, essentially: #SBATCH -N2 srun --ntasks-per-node=1 python multi_node_launch.py The Python script expects to have access to all resources on the local node - it will launch processes on the node and bind them to CPU/GPU resources as needed. Analogous to orted, I believe. Fortunately, due to the way we handle binding, the job fails early and loudly for us on 20.11 with an error from numactl like 'libnuma: Warning: cpu argument 4,40-44 out of range'. I share all this just to clarify my point that patching OpenMPI (and/or other MPIs) doesn't fix all applications. Which seems to be the implication made by the title of bug#10453. For further reading, in case anyone is interested: * sbatch script: https://github.com/mlcommons/training_results_v0.7/blob/master/NVIDIA/benchmarks/ssd/implementations/pytorch/run.sub#L58-L60 * automatic conversion from Slurm envvars to Pytorch envvars: https://github.com/NVIDIA/enroot/blob/v3.2.0/conf/hooks/extra/50-slurm-pytorch.sh * Pytorch code which consumes these envvars: https://github.com/pytorch/pytorch/blob/v1.7.1/torch/distributed/rendezvous.py#L140-L183
Two points of clarification about the Pytorch example in comment #1: 1. We assume that all partitions are configured with OverSubscribe=EXCLUSIVE, which is why you don't see --exclusive or --cpus-per-task in that sbatch script. 2. I mistakenly put "--ntasks-per-node=1" as the pseudo-code for the sbatch script, which refers to an older version of that sbatch script. The current version is more like "--ntasks-per-node=$ngpus". But, since we don't set --cpus-per-task explicitly, the new behavior still doesn't give us the full allocation of CPUs: $ srun -l --ntasks-per-node=8 grep Cpus_allowed_list /proc/self/status 0: Cpus_allowed_list: 0-7,40-47 1: Cpus_allowed_list: 0-7,40-47 ... vs. with SLURM_WHOLE=1 set: $ srun -l --ntasks-per-node=8 grep Cpus_allowed_list /proc/self/status 0: Cpus_allowed_list: 0-79 1: Cpus_allowed_list: 0-79 ... More relevant configuration settings from our cluster[s]: slurm.conf:SelectType=select/cons_tres slurm.conf:SelectTypeParameters=CR_CPU slurm.conf:PartitionName=DEFAULT OverSubscribe=EXCLUSIVE ... slurm.conf:TaskPlugin=affinity,cgroup slurm.conf:TaskPluginParam=None cgroup.conf:ConstrainCores=yes cgroup.conf:ConstrainRAMSpace=no And I used the Pytorch example to show that not everyone uses MPI all the time, but the same problem goes for our applications which DO use MPI, since we launch them with srun instead of mpirun/mpiexec (so, the finer-grained OMPI_MCA_plm_slurm_args workaround doesn't help us): https://github.com/mlcommons/training_results_v0.7/blob/master/NVIDIA/benchmarks/resnet/implementations/mxnet/run.sub#L70-L72 Finally, I verified that we have yet another bad option available to us: OptionE: disable task/cgroup and task/affinity When both TaskPlugins are disabled, Slurm is unable to restrict resources from our jobs anymore. This fixes the MLPerf scripts linked above, but it causes problems for other workloads which DO rely on the srun command-line options like -c and --exclusive. Plus, it removes the possibility of making any of ever setting OverSubscribe=NO for any of our partitions in the future.
Hi, I am currently out of office, returning on January 4. If you need to reach Stanford Research Computing, please email srcc-support@stanford.edu Cheers,
I'm out of the office until Monday, December 28. I'll review email and reply when back in the office. Thanks, Adam
Luke, Let me share my thoughts, but honestly the main part of transition plan for specific site is on site administrators and as the nature of the question is rather a configuration advisory than an outage, I'd say severity 4 is more appropriate for the ticket? From my (I've be site admin for over 10years) a mixture of all the options is the best approach, so the path I'd go is: 1) Identify applications that should always use --whole starting form 20.11. For those applications I'd add export of SLURM_WHOLE environment variable in module files, if application users are details aware I suggest printing information about that to standard error. 2) Identify users that may be affected by Slurm upgrade, let them now about the change and the fact that you're preparing to upgrade to 20.11. If you have many users that should use --whole, maybe set appropriate MOTD? --- On questions.. 1) I've started to think that the most common situation leading to issues is the use of srun --ntasks-per-node=1 inside an existing allocation, if this is the case we can discover it in CliFilter plugin and either set --whole automatically or just print an message to end user about the consequences of it under slurm-20.11. Just having message printed is what I'd do. I played with the below snippet: >function slurm_cli_pre_submit(options, pack_offset) > > local runningJobId = os.getenv("SLURM_JOB_ID") > if runningJobId ~=nil and tonumber(options["ntasks-per-node"]) == 1 > then > slurm.log_info("You are using --ntasks-per-node=1 inside an existing allocation, starting from Slurm 20.11 the step will by default be limited to requested resources. If you still want the step to have access to the whole allocation check the new --whole option of srun") > end > > return slurm.SUCCESS > end 2)At the moment we don't plan further changes/additions in this area. 3)As above - we don't plan on reverting this behavior. We believe it's good way to go long term. Let me know what you think, especially about the CliFilter approach. cheers, Marcin
PS. >OptionE: disable task/cgroup and task/affinity This is something I'd call strongly not recommended. Your environment may behave very different from the other environments your users know. Disabling cgroup TaskPlugin you'll lose its control over application allocated memory. The alternative mechanism JobAcctGatherParams=OverMemoryKill relies on the peridical check which means that when user app allocates memory quickly the OS OOM killer may start killing selected(not very different from random) processes.
Luke, Did you have some time to take a look at comment 5 and 6? What are your thoughts? cheers, Marcin
I'm out of the office until Monday, Jan 4. I'll review email and reply when back in the office. Thanks, Adam
Luke, As our final decision on Bug 10383 is to make the --whole allocation a default in 20.11.3 I believe we can close this ticket. If you'd like to further discuss it please don't hesitate to reopen. cheers, Marcin
(In reply to Marcin Stolarek from comment #9) > As our final decision on Bug 10383 is to make the --whole allocation a > default in 20.11.3 I believe we can close this ticket. Yes, I agree. Now the only real "transition plan" we need is to explain the new behavior which requires the '--overlap' flag (as discussed on bug#10450). But that's less impactful, easier to detect, and easier to explain. Thanks for looking into it for us!
I have verified that the previous behavior has been restored with 20.11.3 (--whole is now the default again), thanks for the revert! Upgrading now becomes simple.