Ticket 11227

Summary:	--cpu_bind=map_cpu:<cpuid> is silently igonored (info is displayed only in slurmd logs, so end-user is not aware of that)
Product:	Slurm	Reporter:	Kilian Cavalotti <kilian>
Component:	Other	Assignee:	Marcin Stolarek <cinek>
Status:	RESOLVED FIXED	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	acgmecaselog.com
Version:	20.11.5
Hardware:	Linux
OS:	Linux
Site:	Stanford	Alineos Sites:	---
Atos/Eviden Sites:	---	Confidential Site:	---
Coreweave sites:	---	Cray Sites:	---
DS9 clusters:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Linux Distro:	---
Machine Name:		CLE Version:
Version Fixed:	22.05pre1	Target Release:	---
DevPrio:	---	Emory-Cloud Sites:	---
Ticket Depends on:	11247
Ticket Blocks:

Description Kilian Cavalotti 2021-03-25 19:35:04 MDT

Hi SchedMD,

It looks like submitting a job with --cpu_bind=map_cpu:<cpuid> generates a CPU mask of 0xffffffffffffffffffffffffffffffff (ie. no binding) when <cpuid> doesn't exist (like when it's larger than the number of CPUs on a node).

For instance, requesting CPU 132 on a 128-core machine:

$ srun -n 1 --cpu_bind=verbose,map_cpu:132 bash -c 'printf "CPU: %s (pid: %s)\n"  $(ps -h -o psr,pid $$)'
cpu-bind=MASK - sh03-14n03, task  0  0 [225788]: mask 0xffffffffffffffffffffffffffffffff set
CPU: 113 (pid: 225788)

$ srun -n 1 --cpu_bind=verbose,map_cpu:132 bash -c 'printf "CPU: %s (pid: %s)\n" $(ps -h -o psr,pid $$)'
cpu-bind=MASK - sh03-14n03, task  0  0 [229029]: mask 0xffffffffffffffffffffffffffffffff set
CPU: 93 (pid: 229029)


Shouldn't that generate an error instead of running with no binding at all?

Cheers,
--
Kilian

Comment 2 Marcin Stolarek 2021-03-30 06:24:40 MDT

Kilian,

Did you allocate the whole node?
As documented:
>map_cpu:<list>
> [...]
>Not supported unless the entire node is allocated to the job.

Either case non-whole node allocated or map_cpu/mask_cpu pointing to CPUs outside of the available range you should see appropriate info level message coming from slurmstepd (where task binding really happens):

either(src/plugins/task/affinity/dist_tasks.c):
> 413                         info("entire node must be allocated, "                   
> 414                              "disabling affinity");   

or:
> 295                 info("Ignoring user CPU binding outside of job "                 
> 296                      "step allocation");

Unfortunately, we don't send this message to the end-user today, which is a kind of limitation of architecture (sending log messages from slurmstepd to srun). I'll take a look to check if it's something we can improve, but it sounds like falling into the enhancement area.

I hope that makes it more clear for you,
Marcin

Comment 3 Kilian Cavalotti 2021-03-30 10:05:36 MDT

Hi Marcin, 

(In reply to Marcin Stolarek from comment #2)
> Did you allocate the whole node?
> As documented:
> >map_cpu:<list>
> > [...]
> >Not supported unless the entire node is allocated to the job.

Yes, that's with a whole-node allocation. Here's a complete transcript:

$ salloc -p test -N 1 --exclusive
salloc: Pending job allocation 21306623
salloc: job 21306623 queued and waiting for resources
salloc: job 21306623 has been allocated resources
salloc: Granted job allocation 21306623
salloc: Waiting for resource configuration
salloc: Nodes sh02-01n60 are ready for job

$ srun lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                20
On-line CPU(s) list:   0-19
Thread(s) per core:    1
Core(s) per socket:    10
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz
Stepping:              1
CPU MHz:               3129.052
CPU max MHz:           3400.0000
CPU min MHz:           1200.0000
BogoMIPS:              4788.75
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              25600K
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19
[...]

So, we have 32 CPUs to play with. requesting CPU id 99 generates a non-binding mask, and results in a random CPU being used:

$ srun -n 1 --cpu_bind=verbose,map_cpu:99 bash -c 'printf "CPU: %s (pid: %s)\n"  $(ps -h -o psr,pid $$)'
cpu-bind=MASK - sh02-01n60, task  0  0 [48443]: mask 0xfffff set
CPU: 3 (pid: 48443)

$ srun -n 1 --cpu_bind=verbose,map_cpu:99 bash -c 'printf "CPU: %s (pid: %s)\n"  $(ps -h -o psr,pid $$)'
cpu-bind=MASK - sh02-01n60, task  0  0 [48477]: mask 0xfffff set
CPU: 5 (pid: 48477)


> Either case non-whole node allocated or map_cpu/mask_cpu pointing to CPUs
> outside of the available range you should see appropriate info level message
> coming from slurmstepd (where task binding really happens):
> 
> either(src/plugins/task/affinity/dist_tasks.c):
> > 413                         info("entire node must be allocated, "                   
> > 414                              "disabling affinity");   
> 
> or:
> > 295                 info("Ignoring user CPU binding outside of job "                 
> > 296                      "step allocation");

Indeed, from the test above, I get this:

Mar 30 08:29:44 sh02-01n60.int slurmd[55645]: task/affinity: _validate_map: Ignoring user CPU binding outside of job step allocation
Mar 30 08:29:44 sh02-01n60.int slurmd[55645]: task/affinity: lllp_distribution: JobId=21306623 manual binding: verbose,mask_cpu,one_thread

> Unfortunately, we don't send this message to the end-user today, which is a
> kind of limitation of architecture (sending log messages from slurmstepd to
> srun). I'll take a look to check if it's something we can improve, but it
> sounds like falling into the enhancement area.

I see, but I'd classify this more in the defect area :)

The person who will care the most about this message is the user who submits the job: she will be the one most affected by the random binding of her tasks, and she won't get access to the warning message.
On the other hand, the only person who can see the message is the sysadmin, who has no direct interest in knowing that affinity was disabled for that job.

So I think there's a problem with the direction of that message, it really should be presented to the user, not logged for the sysadmin. 

One could even argue that because the requested binding (to a CPU outside the allocation) cannot be granted, the step should not run at all and be rejected. I mean, if a user requests 64 CPUs on a 32-CPU node, the job will be rejected, so if she requests a CPU binding that can't be satisfied, maybe it shouldn't go through either (rather than being executed with a different binding that has been requested).

What do you think?

Thanks,
--
Kilian

Comment 18 acgmecaselog 2021-11-17 22:57:14 MST

Either case non-whole node allocated or map_cpu/mask_cpu pointing to CPUs outside of the available range you should see appropriate info level message coming from slurmstepd 


Michel
https://acgmecaselog.com/

Comment 23 Marcin Stolarek 2021-12-03 09:21:40 MST

Kilian,

The discussed behavior got changed on the master branch[1] (Slurm 22.05 to be). In case of cpu binding failure, task launch will be rejected with the appropriate error message.

cheers,
Marcin
[1]https://github.com/SchedMD/slurm/commit/85af3bcb8c1fa2e9939263d3f5ef3f3625a5997c

Comment 24 Kilian Cavalotti 2021-12-03 09:36:48 MST

Great, thanks Marcin!

Cheers,
--
Kilian