Hi SchedMD, It looks like submitting a job with --cpu_bind=map_cpu:<cpuid> generates a CPU mask of 0xffffffffffffffffffffffffffffffff (ie. no binding) when <cpuid> doesn't exist (like when it's larger than the number of CPUs on a node). For instance, requesting CPU 132 on a 128-core machine: $ srun -n 1 --cpu_bind=verbose,map_cpu:132 bash -c 'printf "CPU: %s (pid: %s)\n" $(ps -h -o psr,pid $$)' cpu-bind=MASK - sh03-14n03, task 0 0 [225788]: mask 0xffffffffffffffffffffffffffffffff set CPU: 113 (pid: 225788) $ srun -n 1 --cpu_bind=verbose,map_cpu:132 bash -c 'printf "CPU: %s (pid: %s)\n" $(ps -h -o psr,pid $$)' cpu-bind=MASK - sh03-14n03, task 0 0 [229029]: mask 0xffffffffffffffffffffffffffffffff set CPU: 93 (pid: 229029) Shouldn't that generate an error instead of running with no binding at all? Cheers, -- Kilian
Kilian, Did you allocate the whole node? As documented: >map_cpu:<list> > [...] >Not supported unless the entire node is allocated to the job. Either case non-whole node allocated or map_cpu/mask_cpu pointing to CPUs outside of the available range you should see appropriate info level message coming from slurmstepd (where task binding really happens): either(src/plugins/task/affinity/dist_tasks.c): > 413 info("entire node must be allocated, " > 414 "disabling affinity"); or: > 295 info("Ignoring user CPU binding outside of job " > 296 "step allocation"); Unfortunately, we don't send this message to the end-user today, which is a kind of limitation of architecture (sending log messages from slurmstepd to srun). I'll take a look to check if it's something we can improve, but it sounds like falling into the enhancement area. I hope that makes it more clear for you, Marcin
Hi Marcin, (In reply to Marcin Stolarek from comment #2) > Did you allocate the whole node? > As documented: > >map_cpu:<list> > > [...] > >Not supported unless the entire node is allocated to the job. Yes, that's with a whole-node allocation. Here's a complete transcript: $ salloc -p test -N 1 --exclusive salloc: Pending job allocation 21306623 salloc: job 21306623 queued and waiting for resources salloc: job 21306623 has been allocated resources salloc: Granted job allocation 21306623 salloc: Waiting for resource configuration salloc: Nodes sh02-01n60 are ready for job $ srun lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 20 On-line CPU(s) list: 0-19 Thread(s) per core: 1 Core(s) per socket: 10 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz Stepping: 1 CPU MHz: 3129.052 CPU max MHz: 3400.0000 CPU min MHz: 1200.0000 BogoMIPS: 4788.75 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 25600K NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19 [...] So, we have 32 CPUs to play with. requesting CPU id 99 generates a non-binding mask, and results in a random CPU being used: $ srun -n 1 --cpu_bind=verbose,map_cpu:99 bash -c 'printf "CPU: %s (pid: %s)\n" $(ps -h -o psr,pid $$)' cpu-bind=MASK - sh02-01n60, task 0 0 [48443]: mask 0xfffff set CPU: 3 (pid: 48443) $ srun -n 1 --cpu_bind=verbose,map_cpu:99 bash -c 'printf "CPU: %s (pid: %s)\n" $(ps -h -o psr,pid $$)' cpu-bind=MASK - sh02-01n60, task 0 0 [48477]: mask 0xfffff set CPU: 5 (pid: 48477) > Either case non-whole node allocated or map_cpu/mask_cpu pointing to CPUs > outside of the available range you should see appropriate info level message > coming from slurmstepd (where task binding really happens): > > either(src/plugins/task/affinity/dist_tasks.c): > > 413 info("entire node must be allocated, " > > 414 "disabling affinity"); > > or: > > 295 info("Ignoring user CPU binding outside of job " > > 296 "step allocation"); Indeed, from the test above, I get this: Mar 30 08:29:44 sh02-01n60.int slurmd[55645]: task/affinity: _validate_map: Ignoring user CPU binding outside of job step allocation Mar 30 08:29:44 sh02-01n60.int slurmd[55645]: task/affinity: lllp_distribution: JobId=21306623 manual binding: verbose,mask_cpu,one_thread > Unfortunately, we don't send this message to the end-user today, which is a > kind of limitation of architecture (sending log messages from slurmstepd to > srun). I'll take a look to check if it's something we can improve, but it > sounds like falling into the enhancement area. I see, but I'd classify this more in the defect area :) The person who will care the most about this message is the user who submits the job: she will be the one most affected by the random binding of her tasks, and she won't get access to the warning message. On the other hand, the only person who can see the message is the sysadmin, who has no direct interest in knowing that affinity was disabled for that job. So I think there's a problem with the direction of that message, it really should be presented to the user, not logged for the sysadmin. One could even argue that because the requested binding (to a CPU outside the allocation) cannot be granted, the step should not run at all and be rejected. I mean, if a user requests 64 CPUs on a 32-CPU node, the job will be rejected, so if she requests a CPU binding that can't be satisfied, maybe it shouldn't go through either (rather than being executed with a different binding that has been requested). What do you think? Thanks, -- Kilian
Either case non-whole node allocated or map_cpu/mask_cpu pointing to CPUs outside of the available range you should see appropriate info level message coming from slurmstepd Michel https://acgmecaselog.com/
Kilian, The discussed behavior got changed on the master branch[1] (Slurm 22.05 to be). In case of cpu binding failure, task launch will be rejected with the appropriate error message. cheers, Marcin [1]https://github.com/SchedMD/slurm/commit/85af3bcb8c1fa2e9939263d3f5ef3f3625a5997c
Great, thanks Marcin! Cheers, -- Kilian