688 – task/cgroup silently ignores some --cpu_bind requests

Ticket 688 - task/cgroup silently ignores some --cpu_bind requests

Summary: task/cgroup silently ignores some --cpu_bind requests

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Other (show other tickets)
Version:	14.03.0
Hardware:	Linux Linux

Importance:	--- 3 - Medium Impact
Assignee:	Danny Auble
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2014-04-09 07:56 MDT by David Gloe
Modified:	2014-04-10 09:12 MDT (History)
CC List:	1 user (show)

See Also:
Site:	CRAY
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	14.03.1
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
patch to provide TaskAffinity=hard to cgroup.conf file (5.26 KB, patch) 2014-04-09 11:23 MDT, Danny Auble	Details \| Diff
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description David Gloe 2014-04-09 07:56:33 MDT

A customer discovered this issue and I reproduced it on an in-house system. I've set the severity here to match our customer bug's severity.

The basic configuration is 
SelectType=select/cray
SelectTypeParameters=CR_CPU_Memory,other_cons_res,CR_ONE_TASK_PER_CORE
TaskPlugin=task/cgroup,task/cray

And in cgroup.conf TaskAffinity=yes.

On a node with Sockets=1 CoresPerSocket=8 ThreadsPerCore=2

dgloe@opal-p2:~> srun -n 4 -c 4 -w nid00032 --ntasks-per-core=2 --cpu_bind=v,mask_cpu:0xf,0xf0,0xf00,0xf000 --slurmd-debug=1 grep Cpus_allowed /proc/self/status
slurmstepd: Created file /var/opt/cray/alps/spool/status11762
slurmstepd: task/cgroup: task[0] is requesting core level binding
slurmstepd: task/cgroup: task[1] is requesting core level binding
slurmstepd: task/cgroup: task[2] is requesting core level binding
slurmstepd: task/cgroup: task[2] not enough Core objects, disabling affinity
slurmstepd: task/cgroup: task[1] not enough Core objects, disabling affinity
slurmstepd: task/cgroup: task[0] not enough Core objects, disabling affinity
slurmstepd: task/cgroup: task[3] is requesting core level binding
slurmstepd: task/cgroup: task[3] not enough Core objects, disabling affinity
Cpus_allowed:   ffff
Cpus_allowed_list:      0-15
Cpus_allowed:   ffff
Cpus_allowed_list:      0-15
Cpus_allowed:   ffff
Cpus_allowed_list:      0-15
Cpus_allowed:   ffff
Cpus_allowed_list:      0-15

Without --slurmd-debug=1 or looking at /proc/self/status there's no indication that the affinity setting was rejected. In addition, this exact same setup works for task/affinity:

dgloe@opal-p2:~> srun -n 4 -c 4 -w nid00032 --ntasks-per-core=2 --cpu_bind=v,mask_cpu:0xf,0xf0,0xf00,0xf000 grep Cpus_allowed /proc/self/status
cpu_bind_cores=MASK - nid00032, task  0  0 [16100]: mask 0xf set
Cpus_allowed:   000f
Cpus_allowed_list:      0-3
cpu_bind_cores=MASK - nid00032, task  1  1 [16101]: mask 0xf0 set
Cpus_allowed:   00f0
Cpus_allowed_list:      4-7
cpu_bind_cores=MASK - nid00032, task  2  2 [16102]: mask 0xf00 set
Cpus_allowed:   0f00
Cpus_allowed_list:      8-11
cpu_bind_cores=MASK - nid00032, task  3  3 [16103]: mask 0xf000 set
Cpus_allowed:   f000
Cpus_allowed_list:      12-15

* An indication should be made when a cpu binding is rejected, ideally with the reason why. Perhaps only when --cpu_bind=verbose is used.
* The customer has asked "if I add --ntasks-per-core=2 (aprun equivalent to -j2) to override CR_ONE_TASK_PER_CORE and specify --cpu_bind=... (without "threads") should that now switch to thread binding instead of core binding?"
* I'm also curious as to why there's a difference between task/cgroup and task/affinity cpu binding here.

Comment 1 Danny Auble 2014-04-09 08:28:48 MDT

David, if they change there srun line to ask for threads

--cpu_bind=threads,v,mask_cpu:0xf,0xf0,0xf00,0xf000

They should get what they expect.

This issue was discovered last year and we haven't finalized on how it should be handled.

This below patch will also fix this scenario making the cgroup plugin behave more like the affinity plugin.

Some history of the task/cgroup plugin is this: The main goal of the affinity support of the task/cgroup plugin is to propose a "soft" binding (using set_schedaffinity logic) of the tasks inside a "hard" confinement on the allocated cores (using cpuset). The "soft" binding lets users do whatever they want by simply using set_schedaffinity again in their tasks. This is an improvement over the task/affinity plugin that was using cpuset for both confinement and affinity, forbidding the alteration of the chosen affinity by the user once the job was started. 

We aren't putting this patch into the code just yet, but it will do what the user expects and put thread binding without --cpu_bind=threads.

I don't know how to notify the user outside of failing the job.  If that is expected then we could go that route, what do you think would be what the user wants?

Thanks

diff --git a/src/plugins/task/cgroup/task_cgroup_cpuset.c b/src/plugins/task/cgroup/task_cgroup_cpuset.c
index 7cb5044..5fbce0b 100644
--- a/src/plugins/task/cgroup/task_cgroup_cpuset.c
+++ b/src/plugins/task/cgroup/task_cgroup_cpuset.c
@@ -1207,11 +1207,11 @@ extern int task_cgroup_cpuset_set_task_affinity(stepd_step_rec_t *job)
 		hwtype = HWLOC_OBJ_PU;
 		nobj = npus;
 	}
-	if (ncores >= jnpus || bind_type & CPU_BIND_TO_CORES) {
+	else if (ncores >= jnpus || bind_type & CPU_BIND_TO_CORES) {
 		hwtype = HWLOC_OBJ_CORE;
 		nobj = ncores;
 	}
-	if (nsockets >= jntasks &&
+	else if (nsockets >= jntasks &&
 	    bind_type & CPU_BIND_TO_SOCKETS) {
 		hwtype = socket_or_node;
 		nobj = nsockets;

Comment 2 Danny Auble 2014-04-09 08:29:06 MDT

I don't know if this should be considered a severity 2 bug based on the definition of severity 2 -

"A Severity 2 issue is a high-impact problem that is causing sporadic outages or is consistently encountered by end users with adverse impact to end user interaction with the system."

I understand why you want to keep it consistent with what you have in your bug tracker, but please consider following the description of severity levels we use.

Comment 3 David Gloe 2014-04-09 08:46:24 MDT

I'll relay your comments to the customer and get their feedback; thanks.

Comment 4 David Gloe 2014-04-09 08:52:14 MDT

(In reply to Danny Auble from comment #1)
> I don't know how to notify the user outside of failing the job.  If that is
> expected then we could go that route, what do you think would be what the
> user wants?

I was expecting something like putting the "slurmstepd: task/cgroup: task[2] not enough Core objects, disabling affinity" message (or something similar) in the --cpu_bind=verbose output rather than just slurmstepd debug, but I've asked the customer what they were expecting.

Comment 5 Danny Auble 2014-04-09 08:56:34 MDT

(In reply to David Gloe from comment #4)
> (In reply to Danny Auble from comment #1)
> > I don't know how to notify the user outside of failing the job.  If that is
> > expected then we could go that route, what do you think would be what the
> > user wants?
> 
> I was expecting something like putting the "slurmstepd: task/cgroup: task[2]
> not enough Core objects, disabling affinity" message (or something similar)
> in the --cpu_bind=verbose output rather than just slurmstepd debug, but I've
> asked the customer what they were expecting.

That is there, from srun...

slurmstepd-snowflake1: task/cgroup: task[0] not enough Core objects, disabling affinity

Comment 6 David Gloe 2014-04-09 09:00:22 MDT

Hmm, I'm not getting that:

dgloe@opal-p2:~> srun -n 4 -c 4 -w nid00032 --ntasks-per-core=2 --cpu_bind=verbose,mask_cpu:0xf,0xf0,0xf00,0xf000 grep Cpus_allowed /proc/self/status
Cpus_allowed:   ffff
Cpus_allowed_list:      0-15
Cpus_allowed:   ffff
Cpus_allowed_list:      0-15
Cpus_allowed:   ffff
Cpus_allowed_list:      0-15
Cpus_allowed:   ffff
Cpus_allowed_list:      0-15

Comment 7 Danny Auble 2014-04-09 09:02:42 MDT

It does not appear you have

TaskAffinity=yes

in your cgroup.conf.

Could you check and verify?

Comment 8 David Gloe 2014-04-09 09:07:16 MDT

dgloe@opal-p2:~> grep TaskAffinity /etc/opt/slurm/cgroup.conf
TaskAffinity=yes

Those messages appear when I use --slurmd-debug=1 but not otherwise.

Comment 9 Danny Auble 2014-04-09 09:09:31 MDT

Yes you are correct, the --slurmd-debug=1 is needed (sorry I missed that before).  I think it is needed to set up the communication back to srun.

I'll see if it can happen some other way, but I am not thinking so.

Comment 10 Danny Auble 2014-04-09 11:23:04 MDT

Created attachment 741 [details]
patch to provide TaskAffinity=hard to cgroup.conf file

David, please test this patch which should get you the desired task/affinity behaviour in task/cgroup.

After applying the patch just change your cgroup.conf file to have 

TaskAffinity=hard

instead of

TaskAffinity=yes

Let me know if you find any corner cases where this doesn't fix the issue.

If all is well we will put this into 14.03.1

Comment 11 Danny Auble 2014-04-10 09:12:25 MDT

This has been committed (2a72aa51f4672eae21d7041e4cbc8a50edc81391), if you find issues with this please reopen