Summary: | task/cgroup silently ignores some --cpu_bind requests | ||
---|---|---|---|
Product: | Slurm | Reporter: | David Gloe <david.gloe> |
Component: | Other | Assignee: | Danny Auble <da> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | CC: | da |
Version: | 14.03.0 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | CRAY | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | 14.03.1 | Target Release: | --- |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Attachments: | patch to provide TaskAffinity=hard to cgroup.conf file |
Description
David Gloe
2014-04-09 07:56:33 MDT
David, if they change there srun line to ask for threads --cpu_bind=threads,v,mask_cpu:0xf,0xf0,0xf00,0xf000 They should get what they expect. This issue was discovered last year and we haven't finalized on how it should be handled. This below patch will also fix this scenario making the cgroup plugin behave more like the affinity plugin. Some history of the task/cgroup plugin is this: The main goal of the affinity support of the task/cgroup plugin is to propose a "soft" binding (using set_schedaffinity logic) of the tasks inside a "hard" confinement on the allocated cores (using cpuset). The "soft" binding lets users do whatever they want by simply using set_schedaffinity again in their tasks. This is an improvement over the task/affinity plugin that was using cpuset for both confinement and affinity, forbidding the alteration of the chosen affinity by the user once the job was started. We aren't putting this patch into the code just yet, but it will do what the user expects and put thread binding without --cpu_bind=threads. I don't know how to notify the user outside of failing the job. If that is expected then we could go that route, what do you think would be what the user wants? Thanks diff --git a/src/plugins/task/cgroup/task_cgroup_cpuset.c b/src/plugins/task/cgroup/task_cgroup_cpuset.c index 7cb5044..5fbce0b 100644 --- a/src/plugins/task/cgroup/task_cgroup_cpuset.c +++ b/src/plugins/task/cgroup/task_cgroup_cpuset.c @@ -1207,11 +1207,11 @@ extern int task_cgroup_cpuset_set_task_affinity(stepd_step_rec_t *job) hwtype = HWLOC_OBJ_PU; nobj = npus; } - if (ncores >= jnpus || bind_type & CPU_BIND_TO_CORES) { + else if (ncores >= jnpus || bind_type & CPU_BIND_TO_CORES) { hwtype = HWLOC_OBJ_CORE; nobj = ncores; } - if (nsockets >= jntasks && + else if (nsockets >= jntasks && bind_type & CPU_BIND_TO_SOCKETS) { hwtype = socket_or_node; nobj = nsockets; I don't know if this should be considered a severity 2 bug based on the definition of severity 2 - "A Severity 2 issue is a high-impact problem that is causing sporadic outages or is consistently encountered by end users with adverse impact to end user interaction with the system." I understand why you want to keep it consistent with what you have in your bug tracker, but please consider following the description of severity levels we use. I'll relay your comments to the customer and get their feedback; thanks. (In reply to Danny Auble from comment #1) > I don't know how to notify the user outside of failing the job. If that is > expected then we could go that route, what do you think would be what the > user wants? I was expecting something like putting the "slurmstepd: task/cgroup: task[2] not enough Core objects, disabling affinity" message (or something similar) in the --cpu_bind=verbose output rather than just slurmstepd debug, but I've asked the customer what they were expecting. (In reply to David Gloe from comment #4) > (In reply to Danny Auble from comment #1) > > I don't know how to notify the user outside of failing the job. If that is > > expected then we could go that route, what do you think would be what the > > user wants? > > I was expecting something like putting the "slurmstepd: task/cgroup: task[2] > not enough Core objects, disabling affinity" message (or something similar) > in the --cpu_bind=verbose output rather than just slurmstepd debug, but I've > asked the customer what they were expecting. That is there, from srun... slurmstepd-snowflake1: task/cgroup: task[0] not enough Core objects, disabling affinity Hmm, I'm not getting that: dgloe@opal-p2:~> srun -n 4 -c 4 -w nid00032 --ntasks-per-core=2 --cpu_bind=verbose,mask_cpu:0xf,0xf0,0xf00,0xf000 grep Cpus_allowed /proc/self/status Cpus_allowed: ffff Cpus_allowed_list: 0-15 Cpus_allowed: ffff Cpus_allowed_list: 0-15 Cpus_allowed: ffff Cpus_allowed_list: 0-15 Cpus_allowed: ffff Cpus_allowed_list: 0-15 It does not appear you have TaskAffinity=yes in your cgroup.conf. Could you check and verify? dgloe@opal-p2:~> grep TaskAffinity /etc/opt/slurm/cgroup.conf TaskAffinity=yes Those messages appear when I use --slurmd-debug=1 but not otherwise. Yes you are correct, the --slurmd-debug=1 is needed (sorry I missed that before). I think it is needed to set up the communication back to srun. I'll see if it can happen some other way, but I am not thinking so. Created attachment 741 [details]
patch to provide TaskAffinity=hard to cgroup.conf file
David, please test this patch which should get you the desired task/affinity behaviour in task/cgroup.
After applying the patch just change your cgroup.conf file to have
TaskAffinity=hard
instead of
TaskAffinity=yes
Let me know if you find any corner cases where this doesn't fix the issue.
If all is well we will put this into 14.03.1
This has been committed (2a72aa51f4672eae21d7041e4cbc8a50edc81391), if you find issues with this please reopen |