Bug 2236 - stepd_add_extern_pid (called from pam_slurm_adopt) doesn't adopt into the cpuset cgroup?
Summary: stepd_add_extern_pid (called from pam_slurm_adopt) doesn't adopt into the cpu...
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmstepd (show other bugs)
Version: 15.08.4
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2015-12-09 16:37 MST by Chris Samuel
Modified: 2015-12-10 11:34 MST (History)
1 user (show)

See Also:
Site: VLSCI
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 15.08.5 16.05.0-pre1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Chris Samuel 2015-12-09 16:37:12 MST
Hi there,

We've have had a chance to do our upgrade to 15.08.4 now (hooray!), and I've found an odd behaviour which I suspect is a Slurm bug.

Here's an interactive job which works as expected:

[samuel@barcoo001 ~]$ cat /proc/$$/cpuset
/slurm/uid_500/job_4993544/step_0

[samuel@barcoo001 ~]$ cat /proc/$$/cgroup
4:cpuacct:/slurm/uid_500/job_4993544/step_0/task_0
3:memory:/slurm/uid_500/job_4993544/step_0
2:cpuset:/slurm/uid_500/job_4993544/step_0
1:freezer:/slurm/uid_500/job_4993544/step_0

[samuel@barcoo001 ~]$ lstopo | fgrep Core | wc -l
1


All good.

When I SSH into that node I get *almost* the same thanks to pam_slurm_adopt, except my CPUs are not constrained..

[samuel@barcoo001 ~]$ cat /proc/$$/cpuset
/

[samuel@barcoo001 ~]$ cat /proc/$$/cgroup
4:cpuacct:/slurm/uid_500/job_4993544/step_extern/task_0
3:memory:/slurm/uid_500/job_4993544/step_extern/task_0
2:cpuset:/
1:freezer:/slurm/uid_500/job_4993544/step_extern

[samuel@barcoo001 ~]$ lstopo | fgrep Core | wc -l
16


Looking at the code for pam_slurm_adopt it appears that it just tells Slurm to adopt the job through stepd_add_extern_pid() and the rest is up to Slurm, is that correct?

This is RHEL 6.7 x86-64 with latest updates (except for libpng which came out today).

cheers!
Chris
Comment 1 Tim Wickberg 2015-12-10 04:31:48 MST
The bad news: This is definitely a bug as you've noticed.

The good news: A fix is already in for 15.08.5 (due out in the next week or so).

There's a slew of commits related to pam_slurm_adopt that address this and some other issues discovered with the plugin - f52888d, 53a7c34, among others.

cheers,
- Tim
Comment 2 Chris Samuel 2015-12-10 10:21:34 MST
(In reply to Tim Wickberg from comment #1)

> The bad news: This is definitely a bug as you've noticed.
> 
> The good news: A fix is already in for 15.08.5 (due out in the next week or
> so).

Yay.

> There's a slew of commits related to pam_slurm_adopt that address this and
> some other issues discovered with the plugin - f52888d, 53a7c34, among
> others.

We tried to cherry pick but too many interdependencies with other commits.

We'll wait for 15.08.5 to appear and try that.

Thanks!
Chris
Comment 3 Danny Auble 2015-12-10 10:22:25 MST
Just released ;)
Comment 4 Chris Samuel 2015-12-10 10:33:18 MST
(In reply to Danny Auble from comment #3)

> Just released ;)

You rock. :-)

Happy seasonal festival of your choice!
Comment 5 Chris Samuel 2015-12-10 11:34:37 MST
Confirming it works in 15.08.5