Ticket 17108 - Fix regression: `SLURM_NTASKS` is not set in the job environment if `--ntasks-per-node` is specified
Summary: Fix regression: `SLURM_NTASKS` is not set in the job environment if `--ntasks...
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Documentation (show other tickets)
Version: 23.02.3
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Marshall Garey
QA Contact:
URL:
: 17451 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2023-07-04 14:04 MDT by Olivier Fisette
Modified: 2023-11-20 09:52 MST (History)
7 users (show)

See Also:
Site: Simon Fraser University
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: CentOS
Machine Name: Cedar
CLE Version:
Version Fixed: 23.02.5 23.11.0rc1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Olivier Fisette 2023-07-04 14:04:34 MDT
We recently upgraded one of our clusters to SLURM 23.02.3. In this release, the `SLURM_NTASKS` output environment variable is no longer set when the `--ntasks` option is not specified (e.g. when using only `--nodes`, `--ntasks-per-node`, `--cpus-per-task`). From reading the changelog and Bug 16704, this change was intentional. However, the documentation is not clear as to when this variable is set. In `man sbatch`:

```
SLURM_NTASKS
       Same as -n, --ntasks

SLURM_NTASKS_PER_CORE
       Number  of  tasks  requested per core.  Only set if the --ntasks-per-core
       option is specified.
```

All the other `SLURM_NTASKS_*` options also contain an “Only set if the [...] option is specified” clause. I think it would be clearer if `SLURM_NTASKS` had one, i.e. “Only set if the --ntasks option is set”.

Incidentally, this change broke some user scripts that relied on `SLURM_NTASKS` as an easy way to get the expected total number of tasks in a resource allocation. This seems like a typical pattern for commercial software packaging their own MPI implementation that does not integrate well with SLURM and `srun`. Is there another way to get the value previously held by `SLURM_NTASKS`? For now, we recommend using arithmetic substitution, e.g. `$(( SLURM_NNODES * SLURM_NTASKS_PER_NODE ))`.
Comment 1 Olivier Fisette 2023-07-04 16:01:17 MDT
As a follow-up to my remark about an easy way to get the expected total number of tasks in a resource allocation, perhaps a new variable (e.g. `SLURM_NTASKS_IN_ALLOC`) could be used to hold the value that was previously available? This would not interact with job steps that are allocated a different number of tasks.
Comment 2 Nathan Wielenga 2023-07-06 10:07:51 MDT
Please note that this is for the Cedar cluster at Simon Fraser University, which holds a support contract
Comment 7 Bill Wichser 2023-07-11 09:52:01 MDT
I'd like to add a me too.  Many of our researcher have relied upon SLURM_NTASKS and either I missed this removal or it is a bug.
Comment 8 Marshall Garey 2023-07-11 10:04:28 MDT
I'll make sure this is documented.

Note on the change: Although SLURM_NTASKS was automatically set, it was not always correct.

Consider the example from the commit message:
https://github.com/SchedMD/slurm/commit/ef513023ad87a3870bf575efd2329672819c59f0

```
Only send the number of tasks to the batch script if they were
explicitly requested.

i.e.
sbatch -N2 --ntasks-per-node=1 --wrap="srun -N1 -v env | grep SLURM_STEP"

Before this patch the srun would run 2 tasks instead of just 1.

Bug 15690
```

This example shows how it can be wrong for a job step. In addition to this example, the way that SLURM_NTASKS was calculated was a guess based on some job request parameters, but did not use the actual number of tasks in the job allocation. So it's possible that it was not always correct for the job allocation, too.

Adding a new environment variable SLURM_NTASKS_IN_ALLOC is an interesting idea, but we need to discuss it further and also guarantee that it is always correct.

As a workaround, you can use a job_submit plugin as described here:

https://bugs.schedmd.com/show_bug.cgi?id=16278#c2

However, you'll run into the same problems that I just described where SLURM_NTASKS is not always correct.
Comment 16 Marshall Garey 2023-08-01 14:24:49 MDT
We have pushed a documentation fix in commit 3ae60c6b2e. It will be live on the website when 23.02.5 is released.
Comment 17 Marshall Garey 2023-08-01 14:47:37 MDT
Our dev lead has been busy, so we have not yet discussed adding a new environment variable (SLURM_NTASKS_IN_JOB or something like that).

However, I have some additional notes about the behavior before version 23.02:


SLURM_NTASKS was only set when one of the following options was requested:

--ntasks-per-gpu
--ntasks-per-node
--ntasks

If none of these options were requested, then SLURM_NTASKS was not set.

In addition, we set SLURM_NTASKS based on the job request, not the actual job allocation. There are at least two potential problems with that:

(1) The job allocation can have more tasks than the job request (like with --exclusive node allocations).

A simple example:

salloc --exclusive

This job requests one task, but requests a whole node so it could have any number of cpus. It can run as many tasks as cpus, although the default is one.

srun -n1
srun -n2


Or a more complicated example:

salloc --ntasks-per-gpu=1 --gpus-per-node=2 --exclusive

This job requests a node with at least 2 gpus per node. But it could be allocated a node with more than 2 gpus per node, and would be given all of those gpus. 

srun hostname # Default: 2 tasks
srun --gpus-per-node=4 hostname # If on a node with 4 gpus, this would give 4 tasks



(2) A range of nodes can be requested

salloc -N1-4


So, setting an environment variable with the number of tasks allocated to a job is not as trivial as simply doing the 22.05 behavior, since SLURM_NTASKS was not always set and sometimes incorrect when set.
Comment 20 Olivier Fisette 2023-08-01 20:06:00 MDT
Thank for the explanation, Marshall. It helped me understand that the number of tasks cannot be inferred in the general case. Thus, `SLURM_NTASKS_IN_JOB` would be ill-defined in some situations. At best, it would be possible to compute `SLURM_MAX_NTASKS_IN_JOB` unambiguously, but that sounds silly.

I guess the best options are to leave `SLURM_NTASKS` undefined unless `--ntasks` is set (current behaviour), or to set `SLURM_NTASKS` only when it is unambiguous. The latter feels like more trouble than would be worth.
Comment 21 Marshall Garey 2023-08-02 12:56:20 MDT
Update: We are looking into this request. I'll let you know when I have more information.
Comment 27 Chris Samuel (NERSC) 2023-08-18 12:52:22 MDT
Hi there,

Another "me too" from NERSC, we just upgraded to Slurm 23.02.4 and are getting a bunch of people reporting their scripts no longer work, including our vendor who is trying to run some tests.

I'm going see if I can put a workaround in via the task prolog for the moment to stem the bleeding, but definitely would like to see something set that users could refer to if SLURM_NPROC and SLURM_NTASKS won't get set in this situation.

All the best,
Chris
Comment 28 Marshall Garey 2023-08-21 10:42:59 MDT
Chris, another easy workaround is to use the job_submit plugin as described here:

https://bugs.schedmd.com/show_bug.cgi?id=16278#c2
Comment 36 Marshall Garey 2023-08-24 13:35:28 MDT
*** Ticket 17451 has been marked as a duplicate of this ticket. ***
Comment 41 Marshall Garey 2023-08-30 14:57:03 MDT
Hi folks,

We have decided to revert the change in 23.02: In 23.02.5, SLURM_NTASKS will be set in the job's environment if you request --ntasks-per-node. I am updating the title of this bug. We are also updating the documentation to say that SLURM_NTASKS will be set in the job's environment if any of the following options are requested:

--ntasks
--ntasks-per-node
--ntasks-per-gpu

If none of these options are requested, then SLURM_NTASKS is not set.

SLURM_NTASKS is set correctly if a node range is requested along with --ntasks-per-node. SLURM_NTASKS will be set to however many tasks are in the job allocation, which just depends on the number of nodes allocated to the job.

All of this should restore the 22.05 behavior for SLURM_NTASKS.

One thing to note if you use a job_submit plugin: num_tasks will not be set if --ntasks-per-node is requested.


I will update you when we have pushed changes upstream.
Comment 46 Marshall Garey 2023-08-31 16:17:17 MDT
We have pushed fixes upstream ahead of 23.02.5. See commits f3b93ea3e7..cc6f49d1f4

(In reply to Marshall Garey from comment #41)

> We are also updating the documentation to
> say that SLURM_NTASKS will be set in the job's environment if any of the
> following options are requested:
> 
> --ntasks
> --ntasks-per-node
> --ntasks-per-gpu
> 
> If none of these options are requested, then SLURM_NTASKS is not set.

Correction: I found a few edge cases where SLURM_NTASKS is still set. We opted not to document them because they are rare. For now do not plan on changing the behavior of these edge cases. We are wary of making further changes to how SLURM_NTASKS is set.

I'm closing this as fixed in 23.02.5.