Bug 10450 - Tricky new behavior for srun --job=X related to --exclusive
Summary: Tricky new behavior for srun --job=X related to --exclusive
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: User Commands (show other bugs)
Version: 20.11.1
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Director of Support
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-12-15 15:29 MST by Luke Yeager
Modified: 2021-01-12 12:45 MST (History)
6 users (show)

See Also:
Site: NVIDIA (PSLA)
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Yeager 2020-12-15 15:29:04 MST
With 20.11's new behavior of defaulting to '--exclusive' for srun, our internal documentation regarding 'srun --jobid=X' is now incorrect.

> # previous behavior, second srun does not block
> (shell_1)$ salloc -N1 --exclusive
> (shell_1)$ srun --pty bash
> (shell_2)$ srun --jobid=X hostname
It is now recommended to use 'srun --whole' to get equivalent behavior for the first jobstep (see bug#10383), but that causes the second jobstep to block, unless the first srun also adds the '--overlap' flag. So the new documentation becomes:

> # new recommendation?
> (shell_1)$ salloc -N1 --exclusive
> (shell_1)$ srun --whole --overlap --pty bash
> (shell_2)$ srun --jobid=X hostname
Is there a better way for users to get back to the old behavior?

Also, for some reason adding --overlap to the second srun doesn't work. I would have expected that adding the flag to either of the sruns would be enough to ensure that neither jobstep is blocked. Seems like there is potential for a race condition here, if two jobsteps are submitted at the same time.

> # second srun is blocked
> (shell_1)$ salloc -N1 --exclusive
> (shell_1)$ srun --whole --pty bash
> (shell_2)$ srun --jobid=X --overlap hostname
Even crazier, if I swap the order of which srun is executed first, then I need the '--overlap' flag added to the second srun (from shell_1), even though that jobstep was submitted second! Is this intended?

> # second srun (NOTE - it's from shell_1 this time!) not blocked
> (shell_1)$ salloc -N1 --exclusive
> (shell_2)$ srun --jobid=X --whole --pty bash
> (shell_1)$ srun --overlap hostname


Note that if you try to reproduce this you should look out for bug#10449, too. If the first srun is not done within an salloc, then the '--exclusive' flag will not be applied and the second srun will not be blocked. Which further highlights the fact that this is hard to document cleanly.
Comment 1 Luke Yeager 2020-12-15 16:28:51 MST
Looks like I got a little mixed up. '--overlap' is always needed for the second srun.

So scratch these complaints/questions:

* for some reason adding --overlap to the second srun doesn't work
* Seems like there is potential for a race condition here
* Even crazier ...

Remaining questions:

1. Is it expected that 'srun --exclusive --whole' doesn't actually protect the jobstep from sharing resources with a second jobstep which adds the '--overlap' flag?

2. If I were to set 'SLURM_OVERLAP=1' and 'SLURM_WHOLE=1' for all users, is there any way for them to opt back into the new Slurm 20.11 behavior without needing to do 'unset SLURM_*'? Can they override the envvars with any command-line flags?

Still trying to wrap my head around how best to explain the new behavior to our users. They have thus far been pampered since all our partitions are exclusive, so they never need to add any flags related to resource constraints - all resources have been available to all steps unless the opt in with a CLI flag.
Comment 3 Michael Hinton 2020-12-17 15:05:41 MST
Hi Luke,

(In reply to Luke Yeager from comment #0)
> With 20.11's new behavior of defaulting to '--exclusive' for srun, our
> internal documentation regarding 'srun --jobid=X' is now incorrect.
> 
> > # previous behavior, second srun does not block
> > (shell_1)$ salloc -N1 --exclusive
> > (shell_1)$ srun --pty bash
> > (shell_2)$ srun --jobid=X hostname
> It is now recommended to use 'srun --whole' to get equivalent behavior for
> the first jobstep (see bug#10383), but that causes the second jobstep to
> block, unless the first srun also adds the '--overlap' flag. So the new
> documentation becomes:
> 
> > # new recommendation?
> > (shell_1)$ salloc -N1 --exclusive
> > (shell_1)$ srun --whole --overlap --pty bash
> > (shell_2)$ srun --jobid=X hostname
> Is there a better way for users to get back to the old behavior?
Yes. I think you want to use the new interactive step, available in 20.11. I mentioned this in https://bugs.schedmd.com/show_bug.cgi?id=10449#c4, but I'll reiterate the new recommendation here:

Set this in slurm.conf:

    LaunchParameters=use_interactive_step

Then instruct users to do this:

> (shell_1_login)$ salloc -N1 --exclusive
> (shell_2_node) $ srun hostname

(Note that if all your partitions have OverSubscribe=exclusive, then --exclusive for salloc doesn't do anything extra.)

It's much simpler, and should feel the same as the previous behavior in 20.02. salloc will give you an interactive pty terminal with access to the entire allocation, so you don't need to explicitly create this with `srun --whole --overlap --pty bash`. The terminal environment will have SLURM_JOB_ID set, so you won't need `--jobid=X` for subsequent sruns.

I believe that this should drastically simplify things for users and make most of your issues with --overlap and --exclusive mentioned in this ticket irrelevant.

Thanks,
-Michael
Comment 4 Luke Yeager 2020-12-17 15:14:59 MST
Yes, thanks for pointing out that flag here and in bug#10449. As I mentioned there, it's a cool new flag but ultimately doesn't help me much.

The examples given below are intentionally simplified to make the bug report easy to understand and reproduce. The actual use case that we document for our users is how to use GDB inside a container to debug why an application is hanging, which involves 'srun --jobid=X', 'enroot exec', and 'gdb'. If our users start adding '--whole' to their srun's for 20.11, then they'll also need to start adding '--overlap' to their 'srun --jobid=X's. None of that is ameliorated by LaunchParameters=use_interactive_step.

I think the responses boil down to: "Yes, you're going to have to explain --overlap to your users now - deal with it." I can live with that for this particular use-case, which was already tricky and needed documentation. So adding one more flag isn't the end of the world. Plus, this breaking change is pretty loud (the jobstep hangs), unlike the mysteriously missing CPU cores related to --whole reported elsewhere. I'll close this as INFOGIVEN.
Comment 5 Michael Hinton 2020-12-17 15:37:18 MST
(In reply to Luke Yeager from comment #4)
> The examples given below are intentionally simplified to make the bug report
> easy to understand and reproduce. The actual use case that we document for
> our users is how to use GDB inside a container to debug why an application
> is hanging, which involves 'srun --jobid=X', 'enroot exec', and 'gdb'. If
> our users start adding '--whole' to their srun's for 20.11, then they'll
> also need to start adding '--overlap' to their 'srun --jobid=X's. None of
> that is ameliorated by LaunchParameters=use_interactive_step.
Would you be willing to post your more complicated, real-world example? I'm having a hard time seeing why use_interactive_step doesn't help. But if it indeed does not, I'm also curious to see if there are any improvements we could make in Slurm that could help out in a case like this.
Comment 6 Luke Yeager 2020-12-17 16:18:58 MST
See this public documentation:

> # From the login node
> $ srun --jobid=432788 --container-name=myapp findmnt /data
> TARGET SOURCE               FSTYPE OPTIONS
> /data  /dev/nvme2n1p2[/mnt] ext4   rw,relatime,errors=remount-ro
> https://github.com/NVIDIA/pyxis/wiki/Usage#--container-name
Typically, our users will be using this approach to drop into the existing container namespaces for a running task (e.g. to attach GDB to a process and debug a hang). It's probably a batch job  (so, 'use_interactive_step' is totally irrelevant), and they've probably been tailing the logfile, watching carefully trying to reproduce a known issue. And since we have OverSubscript=EXCLUSIVE set for all partitions, all cores are already allocated.

This works fine on 20.02 because the '--exclusive' flag isn't set.

To fix it for 20.11, we'd need to add the '--overlap' flag to the documentation. But, since our scripts need to work across many clusters with various Slurm versions, we'll probably need to suggest adding 'export SLURM_OVERLAP=1' before 'srun' as a more portable solution.
Comment 7 Luke Yeager 2020-12-18 07:55:00 MST
(In reply to Luke Yeager from comment #6)
> And since we have OverSubscript=EXCLUSIVE set for all partitions, all cores
> are already allocated.
I made the wrong point here. I meant to point out that all the cores in the allocation are used for this main job step where the application is running. So, requires --whole on 20.11. Which is why I need --overlap for the 'srun --jobid=X'.
Comment 8 Matt Ezell 2020-12-18 13:16:25 MST
It seems like you have a need for a "main" process as well as a "tool" process. The "main" process consumes all the cores (as far as Slurm is concerned) but the "tool" needs to be able to co-locate with the application to introspect it.

There are a couple options for that.

- Mark all steps as --overlap, so that the tool can be srun with --overlap as well. That has the compatability issues you mentioned.
- Allow the step to be exclusive, but allow the tool to run without consuming any resources. For an interactive (salloc) job, use_interactive_step could allow the tool process access on the first node. For a batch job, I might recommend setting up ssh hostbased authentication paired with pam_slurm_adopt. Instead of "srun --jobid=xxx gdb proc pid" it would be "ssh $(scontrol show job xxx|grep BatchHost | awk -F'=' '{print $2}') gdb proc pid". Maybe enroot could provide a wrapper to make that easier?
Comment 9 Luke Yeager 2020-12-18 13:26:42 MST
Hi Matt,

(In reply to Matt Ezell from comment #8)
> - Mark all steps as --overlap
Sure, we could mark all steps with that flag, or we could 'export SLURM_OVERLAP=1'. But I feel that setting it just for the "tool process" is fine, as I said in comment 4. I've been convinced that this particular breaking change in 20.11 is acceptable, at least for our site.

> For an interactive (salloc) job, use_interactive_step could allow the tool
> process access on the first node.
That would work for a single node application which doesn't need to be launched with srun. But most of our applications are multi-node (so, needs srun), and most of them use pyxis to containerize their tasks (which is only accessible through srun, not sbatch/salloc). So this new configuration flag is neat but doesn't really help us much.

> For a batch job, I might recommend setting up ssh hostbased authentication
> paired with pam_slurm_adopt.
Yes, we have that setup and it's an alternate workflow which we also document. But it's more low-level (see the documentation linked below).

> Maybe enroot could provide a wrapper to make that easier?
Like this one? ;)
https://github.com/NVIDIA/enroot/blob/master/doc/cmd/exec.md
Comment 10 Michael Hinton 2020-12-18 14:15:22 MST
(In reply to Luke Yeager from comment #9)
> > For an interactive (salloc) job, use_interactive_step could allow the tool
> > process access on the first node.
> That would work for a single node application which doesn't need to be
> launched with srun. But most of our applications are multi-node (so, needs
> srun), and most of them use pyxis to containerize their tasks (which is only
> accessible through srun, not sbatch/salloc). So this new configuration flag
> is neat but doesn't really help us much.
I've just learned that enabling use_interactive_step actually just makes a bare salloc call srun with the (intentionally) undocumented "--interactive" srun argument. So you could just do it explicitly with roughly the same effect:

$ srun --interactive --preserve-env --pty --exclusive $SHELL
$ scontrol show steps $SLURM_JOB_ID
StepId=7.interactive UserId=1000 StartTime=2020-12-18T14:09:21 TimeLimit=UNLIMITED
   State=RUNNING Partition=debug NodeList=test1
   Nodes=1 CPUs=0 Tasks=1 Name=interactive Network=(null)
   TRES=(null)
   ResvPorts=(null)
   CPUFreqReq=Default
   SrunHost:Pid=inspiron:11741
$ grep Cpus_allowed_list /proc/self/status
Cpus_allowed_list:      0-3
$ srun grep Cpus_allowed_list /proc/self/status
Cpus_allowed_list:      0,2

See https://slurm.schedmd.com/slurm.conf.html#OPT_InteractiveStepOptions for the one documented use of --interactive with srun.

This may or may not be of use to you with pyxis, but I at least wanted to make you aware.

-Michael