Ticket 19335 - srun: fatal: cpus_per_task set by two different environment variables
Summary: srun: fatal: cpus_per_task set by two different environment variables
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other tickets)
Version: 23.11.4
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Carlos Tripiana Montes
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2024-03-15 19:19 MDT by Kilian Cavalotti
Modified: 2024-05-15 10:08 MDT (History)
2 users (show)

See Also:
Site: Stanford
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Kilian Cavalotti 2024-03-15 19:19:56 MDT
Hi SchedMD!

We got a few users reporting a new error related to SLURM_TRES_PER_TASK, which is sometimes set to a different value than SLURM_CPUS_PER_TASK. That makes `srun` complain and fail with the following message:

"""
Error Message: srun: fatal: cpus_per_task set by two different environment variables SLURM_CPUS_PER_TASK=2 != SLURM_TRES_PER_TASK=cpu:1
"""

I think that mostly happens for jobs submitted with `--mem-per-cpu` to partitions with DefMemPerCPU and MaxMemPerCPU, when the number of allocated CPUs is adjusted to accommodate the job memory request.

It also apparently only happens with `sbatch`, but not with `salloc` or `srun` directly.

Here's an example, with a job requesting 1 CPU per task and 12GB per CPU on a partition with maxMemPerCPU=8000:
```
$ sbatch -p part --mem-per-cpu=12G -n 1 -c 1 --wrap='echo $SLURM_CPUS_PER_TASK $SLURM_TRES_PER_TASK; srun hostname'
Submitted batch job 43044262
$ cat slurm-43044262.out
2 cpu:1
srun: fatal: cpus_per_task set by two different environment variables SLURM_CPUS_PER_TASK=2 != SLURM_TRES_PER_TASK=cpu:1
```

No issue with salloc, the two values match:
```
$ salloc -p part --mem-per-cpu=12G -n 1 -c 1 bash -c 'echo $SLURM_CPUS_PER_TASK $SLURM_TRES_PER_TASK; srun hostname'
salloc: Pending job allocation 43044145
salloc: job 43044145 queued and waiting for resources
salloc: job 43044145 has been allocated resources
salloc: Granted job allocation 43044145
salloc: Waiting for resource configuration
salloc: Nodes sh02-03n46 are ready for job
1 cpu:1
sh02-03n46.int
salloc: Relinquishing job allocation 43044145
```

And no issue with a direct srun either:
```
$ srun -p part --mem-per-cpu=12G -n 1 -c 1 bash -c 'echo $SLURM_CPUS_PER_TASK $SLURM_TRES_PER_TASK'
srun: job 43044359 queued and waiting for resources
srun: job 43044359 has been allocated resources
1 cpu:1
```


Here's the relevant partition config:
```
$ scontrol show partition part | grep MemPer
   DefMemPerCPU=7958 MaxMemPerCPU=8000
```

Any idea why `sbatch` has a different behavior, and where the problem comes from?

Thanks!
--
Kilian
Comment 1 Carlos Tripiana Montes 2024-03-21 07:39:50 MDT
Hey Kilian,

I am waiting a colleague to finish some work on the x_per_y adjustment logic, to see if this happens after that fix.

I'll review this bug once the other is officially fixed.

Thank you,
Carlos
Comment 2 Kilian Cavalotti 2024-04-12 09:51:16 MDT
Hi Carlos!

(In reply to Carlos Tripiana Montes from comment #1)
> I am waiting a colleague to finish some work on the x_per_y adjustment
> logic, to see if this happens after that fix.
> 
> I'll review this bug once the other is officially fixed.

Just checking to see if there's been any progress on the x_per_y work (is there a bug number for this?). We keep getting more users reporting the issue... :\

Thanks!
--
Kilian
Comment 3 Carlos Tripiana Montes 2024-04-30 09:13:42 MDT
Sorry for the delay Kilian,

I'm going to devote some time now on this one, as it seems we finally got the other bug closed but it doesn't seem to finally involve any TRES logic fix.

I'd say the bad value in your case is SLURM_CPUS_PER_TASK=2, so the TRES logic seems smarter...

Well, I need to double check if this can be reproduced in master because we indeed changed some bit on the logic. See:

* bd70f4df7a NEWS for the ntasks precedence fix
* 7b00991bf9 srun - Removed "can't honor" warning on job allocations
* 57c1685f16 salloc - Disable calculation of ntasks if set by the user in the cli
* 9cfff7276e srun - Disable recalculation of ntasks if set by the user in the cli
* ef04c57c9d Add ntasks_opt_set parameter for explicitly set ntasks
Comment 4 Carlos Tripiana Montes 2024-04-30 09:31:26 MDT
I've reproduced you issue and I'll track it down asap. I'll keep you posted.
Comment 7 Kilian Cavalotti 2024-04-30 10:46:18 MDT
Hi Carlos,

(In reply to Carlos Tripiana Montes from comment #4)
> I've reproduced you issue and I'll track it down asap. I'll keep you posted.

Thanks for the update!

Cheers,
--
Kilian
Comment 9 Carlos Tripiana Montes 2024-04-30 10:54:10 MDT
So...

The issue itself seems more like sbatch respecting docs. From --mem-per-cpu:

"Note that if the job's --mem-per-cpu value exceeds the configured MaxMemPerCPU, then the user's limit will be treated as a memory limit per task; --mem-per-cpu will be reduced to a value no larger than MaxMemPerCPU; --cpus-per-task will be set and the value of --cpus-per-task multiplied by the new --mem-per-cpu value will equal the original --mem-per-cpu value specified by the user".

This paragraph is replicated in sbatch, salloc, and srun. But only sbatch is doing this way.

There's also the issue about the TRES logic not respecting in any case this behaviour.

We may just want to just reject any job that has mem-per-cpu>MaxMemPerCPU too.

So we need 1st to decide what to fix/change. Do it, and make the docs changes accordingly. I'll open an internal chat with the team to decide the best course of action.

Cheers,
Carlos.
Comment 10 Kilian Cavalotti 2024-04-30 12:52:50 MDT
(In reply to Carlos Tripiana Montes from comment #9)
> So...
> 
> The issue itself seems more like sbatch respecting docs. From --mem-per-cpu:
> 
> "Note that if the job's --mem-per-cpu value exceeds the configured
> MaxMemPerCPU, then the user's limit will be treated as a memory limit per
> task; --mem-per-cpu will be reduced to a value no larger than MaxMemPerCPU;
> --cpus-per-task will be set and the value of --cpus-per-task multiplied by
> the new --mem-per-cpu value will equal the original --mem-per-cpu value
> specified by the user".
> 
> This paragraph is replicated in sbatch, salloc, and srun. But only sbatch is
> doing this way.
> 
> There's also the issue about the TRES logic not respecting in any case this
> behaviour.
> 
> We may just want to just reject any job that has mem-per-cpu>MaxMemPerCPU
> too.
> 
> So we need 1st to decide what to fix/change. Do it, and make the docs
> changes accordingly. I'll open an internal chat with the team to decide the
> best course of action.

Thanks Carlos!

Rejecting jobs requesting mem-per-cpu>MaxMemPerCPU is certainly an option, although it may get complicated for jobs submitted to multiple partitions with different MaxMemPerCPU values.

Intuitively, I'd expect things to work like they have historically, where the amount of CPUs allocated to the job is dynamically adjusted to satisfy MaxMemPerCPU.

The same way that if today I request a job with --mem=16GB on a partition with MaxMemPerCPU=8GB, the scheduler automatically allocated 2 CPUs to the job.

Cheers,
--
Kilian
Comment 11 Carlos Tripiana Montes 2024-05-01 01:24:47 MDT
> Rejecting jobs requesting mem-per-cpu>MaxMemPerCPU is certainly an option,
> although it may get complicated for jobs submitted to multiple partitions
> with different MaxMemPerCPU values.

This sounds like a reason to respect the documented/current behav, although this means that *only* sbatch is doing to nowadays... And we need then to spot and fix salloc, srun, and TRES logic. Or so this seams to be the case.

I'll let you know hoy this evolves.
Comment 12 Carlos Tripiana Montes 2024-05-01 01:32:01 MDT
But... another option is just drop partitions mem-per-cpu>MaxMemPerCPU from the requested job's partitions at submission. And preserve the valid ones. We may be informing the user about this fact too, just in case.