A customer discovered this issue and I reproduced it on an in-house system. I've set the severity here to match our customer bug's severity. The basic configuration is SelectType=select/cray SelectTypeParameters=CR_CPU_Memory,other_cons_res,CR_ONE_TASK_PER_CORE TaskPlugin=task/cgroup,task/cray And in cgroup.conf TaskAffinity=yes. On a node with Sockets=1 CoresPerSocket=8 ThreadsPerCore=2 dgloe@opal-p2:~> srun -n 4 -c 4 -w nid00032 --ntasks-per-core=2 --cpu_bind=v,mask_cpu:0xf,0xf0,0xf00,0xf000 --slurmd-debug=1 grep Cpus_allowed /proc/self/status slurmstepd: Created file /var/opt/cray/alps/spool/status11762 slurmstepd: task/cgroup: task[0] is requesting core level binding slurmstepd: task/cgroup: task[1] is requesting core level binding slurmstepd: task/cgroup: task[2] is requesting core level binding slurmstepd: task/cgroup: task[2] not enough Core objects, disabling affinity slurmstepd: task/cgroup: task[1] not enough Core objects, disabling affinity slurmstepd: task/cgroup: task[0] not enough Core objects, disabling affinity slurmstepd: task/cgroup: task[3] is requesting core level binding slurmstepd: task/cgroup: task[3] not enough Core objects, disabling affinity Cpus_allowed: ffff Cpus_allowed_list: 0-15 Cpus_allowed: ffff Cpus_allowed_list: 0-15 Cpus_allowed: ffff Cpus_allowed_list: 0-15 Cpus_allowed: ffff Cpus_allowed_list: 0-15 Without --slurmd-debug=1 or looking at /proc/self/status there's no indication that the affinity setting was rejected. In addition, this exact same setup works for task/affinity: dgloe@opal-p2:~> srun -n 4 -c 4 -w nid00032 --ntasks-per-core=2 --cpu_bind=v,mask_cpu:0xf,0xf0,0xf00,0xf000 grep Cpus_allowed /proc/self/status cpu_bind_cores=MASK - nid00032, task 0 0 [16100]: mask 0xf set Cpus_allowed: 000f Cpus_allowed_list: 0-3 cpu_bind_cores=MASK - nid00032, task 1 1 [16101]: mask 0xf0 set Cpus_allowed: 00f0 Cpus_allowed_list: 4-7 cpu_bind_cores=MASK - nid00032, task 2 2 [16102]: mask 0xf00 set Cpus_allowed: 0f00 Cpus_allowed_list: 8-11 cpu_bind_cores=MASK - nid00032, task 3 3 [16103]: mask 0xf000 set Cpus_allowed: f000 Cpus_allowed_list: 12-15 * An indication should be made when a cpu binding is rejected, ideally with the reason why. Perhaps only when --cpu_bind=verbose is used. * The customer has asked "if I add --ntasks-per-core=2 (aprun equivalent to -j2) to override CR_ONE_TASK_PER_CORE and specify --cpu_bind=... (without "threads") should that now switch to thread binding instead of core binding?" * I'm also curious as to why there's a difference between task/cgroup and task/affinity cpu binding here.
David, if they change there srun line to ask for threads --cpu_bind=threads,v,mask_cpu:0xf,0xf0,0xf00,0xf000 They should get what they expect. This issue was discovered last year and we haven't finalized on how it should be handled. This below patch will also fix this scenario making the cgroup plugin behave more like the affinity plugin. Some history of the task/cgroup plugin is this: The main goal of the affinity support of the task/cgroup plugin is to propose a "soft" binding (using set_schedaffinity logic) of the tasks inside a "hard" confinement on the allocated cores (using cpuset). The "soft" binding lets users do whatever they want by simply using set_schedaffinity again in their tasks. This is an improvement over the task/affinity plugin that was using cpuset for both confinement and affinity, forbidding the alteration of the chosen affinity by the user once the job was started. We aren't putting this patch into the code just yet, but it will do what the user expects and put thread binding without --cpu_bind=threads. I don't know how to notify the user outside of failing the job. If that is expected then we could go that route, what do you think would be what the user wants? Thanks diff --git a/src/plugins/task/cgroup/task_cgroup_cpuset.c b/src/plugins/task/cgroup/task_cgroup_cpuset.c index 7cb5044..5fbce0b 100644 --- a/src/plugins/task/cgroup/task_cgroup_cpuset.c +++ b/src/plugins/task/cgroup/task_cgroup_cpuset.c @@ -1207,11 +1207,11 @@ extern int task_cgroup_cpuset_set_task_affinity(stepd_step_rec_t *job) hwtype = HWLOC_OBJ_PU; nobj = npus; } - if (ncores >= jnpus || bind_type & CPU_BIND_TO_CORES) { + else if (ncores >= jnpus || bind_type & CPU_BIND_TO_CORES) { hwtype = HWLOC_OBJ_CORE; nobj = ncores; } - if (nsockets >= jntasks && + else if (nsockets >= jntasks && bind_type & CPU_BIND_TO_SOCKETS) { hwtype = socket_or_node; nobj = nsockets;
I don't know if this should be considered a severity 2 bug based on the definition of severity 2 - "A Severity 2 issue is a high-impact problem that is causing sporadic outages or is consistently encountered by end users with adverse impact to end user interaction with the system." I understand why you want to keep it consistent with what you have in your bug tracker, but please consider following the description of severity levels we use.
I'll relay your comments to the customer and get their feedback; thanks.
(In reply to Danny Auble from comment #1) > I don't know how to notify the user outside of failing the job. If that is > expected then we could go that route, what do you think would be what the > user wants? I was expecting something like putting the "slurmstepd: task/cgroup: task[2] not enough Core objects, disabling affinity" message (or something similar) in the --cpu_bind=verbose output rather than just slurmstepd debug, but I've asked the customer what they were expecting.
(In reply to David Gloe from comment #4) > (In reply to Danny Auble from comment #1) > > I don't know how to notify the user outside of failing the job. If that is > > expected then we could go that route, what do you think would be what the > > user wants? > > I was expecting something like putting the "slurmstepd: task/cgroup: task[2] > not enough Core objects, disabling affinity" message (or something similar) > in the --cpu_bind=verbose output rather than just slurmstepd debug, but I've > asked the customer what they were expecting. That is there, from srun... slurmstepd-snowflake1: task/cgroup: task[0] not enough Core objects, disabling affinity
Hmm, I'm not getting that: dgloe@opal-p2:~> srun -n 4 -c 4 -w nid00032 --ntasks-per-core=2 --cpu_bind=verbose,mask_cpu:0xf,0xf0,0xf00,0xf000 grep Cpus_allowed /proc/self/status Cpus_allowed: ffff Cpus_allowed_list: 0-15 Cpus_allowed: ffff Cpus_allowed_list: 0-15 Cpus_allowed: ffff Cpus_allowed_list: 0-15 Cpus_allowed: ffff Cpus_allowed_list: 0-15
It does not appear you have TaskAffinity=yes in your cgroup.conf. Could you check and verify?
dgloe@opal-p2:~> grep TaskAffinity /etc/opt/slurm/cgroup.conf TaskAffinity=yes Those messages appear when I use --slurmd-debug=1 but not otherwise.
Yes you are correct, the --slurmd-debug=1 is needed (sorry I missed that before). I think it is needed to set up the communication back to srun. I'll see if it can happen some other way, but I am not thinking so.
Created attachment 741 [details] patch to provide TaskAffinity=hard to cgroup.conf file David, please test this patch which should get you the desired task/affinity behaviour in task/cgroup. After applying the patch just change your cgroup.conf file to have TaskAffinity=hard instead of TaskAffinity=yes Let me know if you find any corner cases where this doesn't fix the issue. If all is well we will put this into 14.03.1
This has been committed (2a72aa51f4672eae21d7041e4cbc8a50edc81391), if you find issues with this please reopen