Ticket 9186 - Slurm install with different SLURM_PREFIX no longer works inside a job run with a different Slurm install
Summary: Slurm install with different SLURM_PREFIX no longer works inside a job run wi...
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: User Commands (show other tickets)
Version: 20.02.3
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Marcin Stolarek
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-06-05 18:58 MDT by Chris Samuel (NERSC)
Modified: 2020-07-17 12:14 MDT (History)
3 users (show)

See Also:
Site: NERSC
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Chris Samuel (NERSC) 2020-06-05 18:58:15 MDT
Hi there,

Just started testing Slurm 20.02 on Gerty and have run into a regression when running our Reframe tests, in this case where they test that they can submit from inside a job on the Cray XC to the Slurm cluster that manages the external compute nodes (for transfer jobs).

The way we set this up is to have separate Slurm builds, one with the usual /usr/bin/sbatch etc that is for the XC part and another build that is has /opt/esslurm/ set as its --prefix and has its own config file.

Here's an example of it working on Cori:

salloccsamuel@cori10:~> salloc -q interactive -C haswell
salloc: Pending job allocation 31395060
salloc: job 31395060 queued and waiting for resources
salloc: job 31395060 has been allocated resources
salloc: Granted job allocation 31395060
salloc: Waiting for resource configuration
salloc: Nodes nid00249 are ready for job
csamuel@nid00249:~> module load esslurm
csamuel@nid00249:~> sbatch -q xfer --test --wrap hostname
sbatch: Job 724062 to start at 2020-06-05T16:46:34 using 1 processors on nodes cori01 in partition xfer

Whereas on Gerty (which has the same module file):

csamuel@gert01:~> salloc -q interactive -C haswell
salloc: Granted job allocation 1936622
salloc: Waiting for resource configuration
salloc: Nodes nid00022 are ready for job
csamuel@nid00022:~> module load esslurm
csamuel@nid00022:~> sbatch -q xfer --test --wrap hostname
sbatch: error: Job request does not match any supported policy for gerty

In the second case the giveaway is that the submit filter says which cluster you're trying to submit to if it doesn't match a rule, so you can see for some reason it's still trying to submit to the XC.


Here's the output of it run with lots of `-v`'s:

csamuel@nid00442:~> sbatch -vvvvvvvvvvvvvv -q xfer --test --wrap hostname
sbatch: debug4: found jobid = 1936627, stepid = 0
sbatch: debug4: found jobid = 1936627, stepid = 4294967295
sbatch: debug:  Leaving stepd_getpw
sbatch: debug3: cli_filter/lua: slurm_lua_loadscript: skipping loading Lua script: /etc/esslurm/cli_filter.lua
sbatch: debug3: cli_filter/lua: slurm_lua_loadscript: skipping loading Lua script: /etc/esslurm/cli_filter.lua
sbatch: defined options
sbatch: -------------------- --------------------
sbatch: qos                 : xfer
sbatch: test-only           : set
sbatch: verbose             : 14
sbatch: wrap                : hostname
sbatch: -------------------- --------------------
sbatch: end of defined options
sbatch: debug2: spank: shifter_slurm.so: init_post_opt = 0
sbatch: debug2: spank: zonesort.so: init_post_opt = 0
sbatch: debug2: spank: sdn_plugin.so: init_post_opt = 0
sbatch: debug2: spank: perf.so: init_post_opt = 0
sbatch: debug:  propagating RLIMIT_CPU=18446744073709551615
sbatch: debug:  propagating RLIMIT_FSIZE=18446744073709551615
sbatch: debug:  propagating RLIMIT_DATA=18446744073709551615
sbatch: debug:  propagating RLIMIT_STACK=18446744073709551615
sbatch: debug:  propagating RLIMIT_CORE=0
sbatch: debug:  propagating RLIMIT_RSS=126701535232
sbatch: debug:  propagating RLIMIT_NPROC=2048
sbatch: debug:  propagating RLIMIT_NOFILE=4096
sbatch: debug:  propagating RLIMIT_MEMLOCK=18446744073709551615
sbatch: debug:  propagating RLIMIT_AS=18446744073709551615
sbatch: debug:  propagating SLURM_PRIO_PROCESS=0
sbatch: debug3: Trying to load plugin /opt/esslurm/lib64/slurm/auth_munge.so
sbatch: debug:  Munge authentication plugin loaded
sbatch: debug3: Success.
sbatch: debug3: Trying to load plugin /opt/esslurm/lib64/slurm/select_cons_res.so
sbatch: select/cons_res loaded with argument 50
sbatch: debug3: Success.
sbatch: debug3: Trying to load plugin /opt/esslurm/lib64/slurm/select_cons_tres.so
sbatch: select/cons_tres loaded with argument 50
sbatch: debug3: Success.
sbatch: debug3: Trying to load plugin /opt/esslurm/lib64/slurm/select_cray_aries.so
sbatch: Cray/Aries node selection plugin loaded
sbatch: debug3: Success.
sbatch: debug3: Trying to load plugin /opt/esslurm/lib64/slurm/select_linear.so
sbatch: Linear node selection plugin loaded with argument 50
sbatch: debug3: Success.
sbatch: debug3: Trying to load plugin /opt/esslurm/lib64/slurm/select_cons_res.so
sbatch: select/cons_res loaded with argument 50
sbatch: debug3: Success.
sbatch: error: Job request does not match any supported policy for gerty
allocation failure: Unspecified error
sbatch: debug2: spank: libAtpSLaunch.so: exit = 0
sbatch: debug2: spank: libslurm_notifier.so: exit = 0
sbatch: debug2: (24788) __del__:840 Unloading slurm_notifier


There's something in the environment with Slurm 20.02 that seems to cause it though, if I dump the environment and start a new shell then it will work correctly (so it's not the install).

csamuel@nid00442:~> env - /bin/bash -l
csamuel@nid00442:/global/u2/c/csamuel> module load esslurm
csamuel@nid00442:/global/u2/c/csamuel> sbatch -q xfer --test --wrap hostname
sbatch: Job 973 to start at 2020-06-05T17:55:42 using 1 processors on nodes gert01 in partition xfer


Any ideas?

All the best,
Chris
Comment 1 Chris Samuel (NERSC) 2020-06-05 23:41:29 MDT
This is all the esslurm module does:

modulecsamuel@gert01:~> module show esslurm
-------------------------------------------------------------------
/usr/common/software/modulefiles/esslurm:

module-whatis	 NERSC es Slurm binary executable module 
prepend-path	 PATH /opt/esslurm/bin 
prepend-path	 LD_LIBRARY_PATH /opt/esslurm/lib64 
-------------------------------------------------------------------
Comment 3 Marcin Stolarek 2020-06-09 05:36:22 MDT
Chris,

This looks like a side-effect of the config-less approach introduced in Slurm 20.02. Specifically, to make sure that commands make use of the same cached version of slurm.conf the SLURM_CONF environment variable gets set.

Although environment modules are outside of SchedMD expertise I think that adding:
>set OLD_SLURM_CONF [ getenv SLURM_CONF]
>unsetenv SLURM_CONF $OLD_SLURM_CONF 
to esslurm module file should work for you. The goal is to unset SLURM_CONF when loading the module, but set it back to the previous value if someone will unload the module attempting to use local build - creating a step within existing allocation on gerty after submission to xfer.

Let me know if that worked.

cheers,
Marcin
Comment 4 Doug Jacobsen 2020-06-09 07:34:02 MDT
We also have a number of use-cases where calling the executable by absolute path, (e.g., /opt/esslurm/bin/scontrol) needs to work, so I do think that we need to assure that an executable can find its own configuration without relying on the environment.
Comment 5 Marcin Stolarek 2020-06-09 09:05:26 MDT
Doug,

Just to make sure we're on the same page.

Calling the binary within a clean environment (without SLURM_CONF variable set) will work as you expect and use slurm.conf from the binary default location. The issue here is calling a Slurm binary from the existing allocation against other configuration than initial salloc/sbatch/srun.

The simple workaround that should work for this case is the use of wrapper that unsets SLURM_CONF(or sets appropriately) before calling a Slurm command. What do you think?

cheers,
Marcin
Comment 6 Chris Samuel (NERSC) 2020-06-09 12:44:06 MDT
Hi Marcin,

Whilst setting up wrappers is technically possible it is a bit of a hack to try and work around something that used to work.

What I'm wondering, looking at _establish_config_source() in src/common/read_config.c, is that it looks like it might be able to fix this by having the check for the file specified by SLURM_CONF go after the check for the default_slurm_config_file rather than before.

Given that (at least to my understanding) configless operation is when that shouldn't exist (or if it does it'll be a symlink to the downloaded config) then that should be right order of operation anyway?

What do you think?

All the best,
Chris
Comment 7 Marcin Stolarek 2020-06-10 00:21:14 MDT
Chris,

The logic is a little bit different - SLURM_CONF had always precedence over built-in default it wasn't introduced as config-less, but it's rather in-use by config-less option. The new part is that SLURM_CONF gets set by salloc(and other commands) not its precedence over default location.

If we ever decide to change this, such default can only be changed as part of a major release. When you think about that from a perspective of a random site using Slurm there are those who use environment modules just updating SLURM_CONF to achieve similar behavior. Another (maybe strongest) argument against that change is a "common sense" of built-in default configuration file, which seems to me being - use that location if nothing provided by other mechanisms like run time option or environment variable.

Another way to consider, but definitively being a bigger change in your environment is Slurm multi-cluster setup, did you consider this option?

cheers,
Marcin
Comment 8 Chris Samuel (NERSC) 2020-06-12 16:23:55 MDT
Hi Marcin,

Thanks for that - I can understand the reasoning behind that.

Is it possible that SLURM_CONF could only be set when running configless (i.e. when "enable_configless" is set in SlurmctldParameters) ?

That would seem like a reasonable compromise, this new behaviour would only occur when requested, leaving existing configurations unaffected by this change in behaviour.

All the best,
Chris
Comment 9 Chris Samuel (NERSC) 2020-06-12 21:56:46 MDT
Hi Marcin,

Having poked around the code some more, and tried some patches myself, I've realised the simplest solution is probably to just use a task prolog to unset SLURM_CONF on our Cray XC systems.

csamuel@gert01:~> cat /etc/slurm/taskprolog.sh
#!/bin/bash
# Unset SLURM_CONF on XC to prevent it breaking "module load esslurm"
echo unset SLURM_CONF

results in:

csamuel@gert01:~> srun  -q interactive -C haswell  bash -c 'env | fgrep -c SLURM_CONF'
srun: job 2224234 queued and waiting for resources
srun: job 2224234 has been allocated resources
0
srun: error: nid00058: task 0: Exited with exit code 1
srun: Terminating job step 2224234.0

which is what we want.

Likely the simplest solution all round (no patches, no wrappers, no changes to RPMs and only deployed where we need it via our ansible setup).

All the best,
Chris
Comment 13 Marcin Stolarek 2020-06-19 05:23:54 MDT
Chris,

We had a short internal discussion on that case and we think that unset/override of SLURM_CONF in environment module for esslurm is the best approach.

Did you check if the absolute path case is used within a job allocation (I asked Doug about that in comment 5) if it's not then there won't be any change for those?

cheers,
Marcin
Comment 14 Marcin Stolarek 2020-06-26 05:41:36 MDT
Chris,

Just a touch base - are you OK with our advice?

cheers,
Marcin
Comment 15 Marcin Stolarek 2020-07-02 08:15:14 MDT
Chris,

Another follow-up. If you don't reply within a week I'll close the report as information given,

cheers,
Marcin
Comment 16 Marcin Stolarek 2020-07-09 05:43:10 MDT
Chris,

I'm closing the case as "information given".

Should you have any question please don't hesitate to reopen.

cheers,
Marcin
Comment 17 Chris Samuel (NERSC) 2020-07-09 10:37:35 MDT
Hi Marcin,

Sorry, I didn't see your previous replies until this was closed.

We can't assume people will load the environment module, so we'll stick with unsetting it in the task prolog, that seems the easiest way to stop it breaking things for us.

All the best,
Chris
Comment 18 Marcin Stolarek 2020-07-15 08:56:56 MDT
Chris,

Understood, however, you have to be aware that we're not running any regression testing with such a disruptive TaskProlog. This may result in some issues on your side in the future that may not be easy to reproduce by us without knowing about the variable being unset.

cheers,
Marcin
Comment 19 Chris Samuel (NERSC) 2020-07-17 12:14:04 MDT
(In reply to Marcin Stolarek from comment #18)

> Understood, however, you have to be aware that we're not running any
> regression testing with such a disruptive TaskProlog. This may result in
> some issues on your side in the future that may not be easy to reproduce by
> us without knowing about the variable being unset.

Understood, it's just an unfortunate and unavoidable consequence of the Cray architecture. :-(