Ticket 2236

Summary: stepd_add_extern_pid (called from pam_slurm_adopt) doesn't adopt into the cpuset cgroup?
Product: Slurm Reporter: Chris Samuel <samuel>
Component: slurmstepdAssignee: Tim Wickberg <tim>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: da
Version: 15.08.4   
Hardware: Linux   
OS: Linux   
Site: VLSCI Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 15.08.5 16.05.0-pre1 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Chris Samuel 2015-12-09 16:37:12 MST
Hi there,

We've have had a chance to do our upgrade to 15.08.4 now (hooray!), and I've found an odd behaviour which I suspect is a Slurm bug.

Here's an interactive job which works as expected:

[samuel@barcoo001 ~]$ cat /proc/$$/cpuset
/slurm/uid_500/job_4993544/step_0

[samuel@barcoo001 ~]$ cat /proc/$$/cgroup
4:cpuacct:/slurm/uid_500/job_4993544/step_0/task_0
3:memory:/slurm/uid_500/job_4993544/step_0
2:cpuset:/slurm/uid_500/job_4993544/step_0
1:freezer:/slurm/uid_500/job_4993544/step_0

[samuel@barcoo001 ~]$ lstopo | fgrep Core | wc -l
1


All good.

When I SSH into that node I get *almost* the same thanks to pam_slurm_adopt, except my CPUs are not constrained..

[samuel@barcoo001 ~]$ cat /proc/$$/cpuset
/

[samuel@barcoo001 ~]$ cat /proc/$$/cgroup
4:cpuacct:/slurm/uid_500/job_4993544/step_extern/task_0
3:memory:/slurm/uid_500/job_4993544/step_extern/task_0
2:cpuset:/
1:freezer:/slurm/uid_500/job_4993544/step_extern

[samuel@barcoo001 ~]$ lstopo | fgrep Core | wc -l
16


Looking at the code for pam_slurm_adopt it appears that it just tells Slurm to adopt the job through stepd_add_extern_pid() and the rest is up to Slurm, is that correct?

This is RHEL 6.7 x86-64 with latest updates (except for libpng which came out today).

cheers!
Chris
Comment 1 Tim Wickberg 2015-12-10 04:31:48 MST
The bad news: This is definitely a bug as you've noticed.

The good news: A fix is already in for 15.08.5 (due out in the next week or so).

There's a slew of commits related to pam_slurm_adopt that address this and some other issues discovered with the plugin - f52888d, 53a7c34, among others.

cheers,
- Tim
Comment 2 Chris Samuel 2015-12-10 10:21:34 MST
(In reply to Tim Wickberg from comment #1)

> The bad news: This is definitely a bug as you've noticed.
> 
> The good news: A fix is already in for 15.08.5 (due out in the next week or
> so).

Yay.

> There's a slew of commits related to pam_slurm_adopt that address this and
> some other issues discovered with the plugin - f52888d, 53a7c34, among
> others.

We tried to cherry pick but too many interdependencies with other commits.

We'll wait for 15.08.5 to appear and try that.

Thanks!
Chris
Comment 3 Danny Auble 2015-12-10 10:22:25 MST
Just released ;)
Comment 4 Chris Samuel 2015-12-10 10:33:18 MST
(In reply to Danny Auble from comment #3)

> Just released ;)

You rock. :-)

Happy seasonal festival of your choice!
Comment 5 Chris Samuel 2015-12-10 11:34:37 MST
Confirming it works in 15.08.5