Ticket 11227 - --cpu_bind=map_cpu:<cpuid> is silently igonored (info is displayed only in slurmd logs, so end-user is not aware of that)
Summary: --cpu_bind=map_cpu:<cpuid> is silently igonored (info is displayed only in sl...
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other tickets)
Version: 20.11.5
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Marcin Stolarek
QA Contact:
URL:
Depends on: 11247
Blocks:
  Show dependency treegraph
 
Reported: 2021-03-25 19:35 MDT by Kilian Cavalotti
Modified: 2021-12-03 09:36 MST (History)
1 user (show)

See Also:
Site: Stanford
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 22.05pre1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Kilian Cavalotti 2021-03-25 19:35:04 MDT
Hi SchedMD,

It looks like submitting a job with --cpu_bind=map_cpu:<cpuid> generates a CPU mask of 0xffffffffffffffffffffffffffffffff (ie. no binding) when <cpuid> doesn't exist (like when it's larger than the number of CPUs on a node).

For instance, requesting CPU 132 on a 128-core machine:

$ srun -n 1 --cpu_bind=verbose,map_cpu:132 bash -c 'printf "CPU: %s (pid: %s)\n"  $(ps -h -o psr,pid $$)'
cpu-bind=MASK - sh03-14n03, task  0  0 [225788]: mask 0xffffffffffffffffffffffffffffffff set
CPU: 113 (pid: 225788)

$ srun -n 1 --cpu_bind=verbose,map_cpu:132 bash -c 'printf "CPU: %s (pid: %s)\n" $(ps -h -o psr,pid $$)'
cpu-bind=MASK - sh03-14n03, task  0  0 [229029]: mask 0xffffffffffffffffffffffffffffffff set
CPU: 93 (pid: 229029)


Shouldn't that generate an error instead of running with no binding at all?

Cheers,
--
Kilian
Comment 2 Marcin Stolarek 2021-03-30 06:24:40 MDT
Kilian,

Did you allocate the whole node?
As documented:
>map_cpu:<list>
> [...]
>Not supported unless the entire node is allocated to the job.

Either case non-whole node allocated or map_cpu/mask_cpu pointing to CPUs outside of the available range you should see appropriate info level message coming from slurmstepd (where task binding really happens):

either(src/plugins/task/affinity/dist_tasks.c):
> 413                         info("entire node must be allocated, "                   
> 414                              "disabling affinity");   

or:
> 295                 info("Ignoring user CPU binding outside of job "                 
> 296                      "step allocation");

Unfortunately, we don't send this message to the end-user today, which is a kind of limitation of architecture (sending log messages from slurmstepd to srun). I'll take a look to check if it's something we can improve, but it sounds like falling into the enhancement area.

I hope that makes it more clear for you,
Marcin
Comment 3 Kilian Cavalotti 2021-03-30 10:05:36 MDT
Hi Marcin, 

(In reply to Marcin Stolarek from comment #2)
> Did you allocate the whole node?
> As documented:
> >map_cpu:<list>
> > [...]
> >Not supported unless the entire node is allocated to the job.

Yes, that's with a whole-node allocation. Here's a complete transcript:

$ salloc -p test -N 1 --exclusive
salloc: Pending job allocation 21306623
salloc: job 21306623 queued and waiting for resources
salloc: job 21306623 has been allocated resources
salloc: Granted job allocation 21306623
salloc: Waiting for resource configuration
salloc: Nodes sh02-01n60 are ready for job

$ srun lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                20
On-line CPU(s) list:   0-19
Thread(s) per core:    1
Core(s) per socket:    10
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz
Stepping:              1
CPU MHz:               3129.052
CPU max MHz:           3400.0000
CPU min MHz:           1200.0000
BogoMIPS:              4788.75
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              25600K
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19
[...]

So, we have 32 CPUs to play with. requesting CPU id 99 generates a non-binding mask, and results in a random CPU being used:

$ srun -n 1 --cpu_bind=verbose,map_cpu:99 bash -c 'printf "CPU: %s (pid: %s)\n"  $(ps -h -o psr,pid $$)'
cpu-bind=MASK - sh02-01n60, task  0  0 [48443]: mask 0xfffff set
CPU: 3 (pid: 48443)

$ srun -n 1 --cpu_bind=verbose,map_cpu:99 bash -c 'printf "CPU: %s (pid: %s)\n"  $(ps -h -o psr,pid $$)'
cpu-bind=MASK - sh02-01n60, task  0  0 [48477]: mask 0xfffff set
CPU: 5 (pid: 48477)


> Either case non-whole node allocated or map_cpu/mask_cpu pointing to CPUs
> outside of the available range you should see appropriate info level message
> coming from slurmstepd (where task binding really happens):
> 
> either(src/plugins/task/affinity/dist_tasks.c):
> > 413                         info("entire node must be allocated, "                   
> > 414                              "disabling affinity");   
> 
> or:
> > 295                 info("Ignoring user CPU binding outside of job "                 
> > 296                      "step allocation");

Indeed, from the test above, I get this:

Mar 30 08:29:44 sh02-01n60.int slurmd[55645]: task/affinity: _validate_map: Ignoring user CPU binding outside of job step allocation
Mar 30 08:29:44 sh02-01n60.int slurmd[55645]: task/affinity: lllp_distribution: JobId=21306623 manual binding: verbose,mask_cpu,one_thread

> Unfortunately, we don't send this message to the end-user today, which is a
> kind of limitation of architecture (sending log messages from slurmstepd to
> srun). I'll take a look to check if it's something we can improve, but it
> sounds like falling into the enhancement area.

I see, but I'd classify this more in the defect area :)

The person who will care the most about this message is the user who submits the job: she will be the one most affected by the random binding of her tasks, and she won't get access to the warning message.
On the other hand, the only person who can see the message is the sysadmin, who has no direct interest in knowing that affinity was disabled for that job.

So I think there's a problem with the direction of that message, it really should be presented to the user, not logged for the sysadmin. 

One could even argue that because the requested binding (to a CPU outside the allocation) cannot be granted, the step should not run at all and be rejected. I mean, if a user requests 64 CPUs on a 32-CPU node, the job will be rejected, so if she requests a CPU binding that can't be satisfied, maybe it shouldn't go through either (rather than being executed with a different binding that has been requested).

What do you think?

Thanks,
--
Kilian
Comment 18 acgmecaselog 2021-11-17 22:57:14 MST
Either case non-whole node allocated or map_cpu/mask_cpu pointing to CPUs outside of the available range you should see appropriate info level message coming from slurmstepd 


Michel
https://acgmecaselog.com/
Comment 23 Marcin Stolarek 2021-12-03 09:21:40 MST
Kilian,

The discussed behavior got changed on the master branch[1] (Slurm 22.05 to be). In case of cpu binding failure, task launch will be rejected with the appropriate error message.

cheers,
Marcin
[1]https://github.com/SchedMD/slurm/commit/85af3bcb8c1fa2e9939263d3f5ef3f3625a5997c
Comment 24 Kilian Cavalotti 2021-12-03 09:36:48 MST
Great, thanks Marcin!

Cheers,
--
Kilian