Bug 2097

Summary:	extern step accounting
Product:	Slurm	Reporter:	Ryan Cox <ryan_cox>
Component:	Accounting	Assignee:	Danny Auble <da>
Status:	RESOLVED FIXED	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	15.08.3
Hardware:	Linux
OS:	Linux
Site:	BYU - Brigham Young University	Alineos Sites:	---
Atos/Eviden Sites:	---	Confidential Site:	---
Coreweave sites:	---	Cray Sites:	---
DS9 clusters:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Linux Distro:	---
Machine Name:		CLE Version:
Version Fixed:	15.08.5 16.05.0-pre1	Target Release:	---
DevPrio:	---	Emory-Cloud Sites:	---
Attachments:	uncleaned patch for having the stepd adopt the process instead of the pam module Patch to make _try_rpc adopt again Patch to cleanup some of the pam_slurm_adopt code Better patch to clean up some of the pam_slurm_adopt code

Description Ryan Cox 2015-11-03 07:42:01 MST

I'm not seeing any accounting for the extern step through sacct or in the mysql database.  Bug 1593 comment 9 makes me think that it is meant to be there.

Comment 1 David Bigagli 2015-11-04 03:26:45 MST

Hi,
   we got this one too and looking at it.

David

Comment 2 Moe Jette 2015-11-05 03:13:22 MST

There are some components in the logic missing. Working on it now.

Comment 3 Moe Jette 2015-11-05 03:54:10 MST

Here is the relevant bit of logic that is probably responsible for no accounting information. If the "external" step accumulates no CPU time use, it does not generate an accounting record right now. Removing the test for zero time use would change that. This is in src/slurmd/slurmstepd/mgr.c:


/* If accounting by the job is minimal (i.e. just our "sleep"), then don't
 * send accounting information to slurmctld */
static bool _minimal_acctg(stepd_step_rec_t *job)
{
	if (!job->jobacct)			/* No accounting data */
		return true;
	if ((job->jobacct->sys_cpu_sec == 0) &&	/* No measurable usage */
	    (job->jobacct->user_cpu_sec == 0))
		return true;

	return false;
}

/*
 * Send a single step completion message, which represents a single range
 * of complete job step nodes.
 */
/* caller is holding step_complete.lock */
static void
_one_step_complete_msg(stepd_step_rec_t *job, int first, int last)
{
	...
	if ((job->stepid == SLURM_EXTERN_CONT) && _minimal_acctg(job))
		return;

Comment 4 Ryan Cox 2015-11-05 08:15:36 MST

I like that approach but I don't think that it is causing the problem.

bash-4.1$ cat /cgroup/cpuacct/slurm/uid_$UID/job_8766977/cpuacct.usage
85473749281
bash-4.1$ cat /cgroup/cpuacct/slurm/uid_$UID/job_8766977/cpuacct.usage_percpu 
0 0 0 85476173929 0 0 0 0 0 0 0 0 
bash-4.1$ cat /cgroup/cpuacct/slurm/uid_$UID/job_8766977/cpuacct.stat 
user 8347
system 191
bash-4.1$ cat /cgroup/cpuacct/slurm/uid_$UID/job_8766977/step_extern/cpuacct.stat 
user 8347
system 191

I'm still not seeing anything in sacct for the extern step, even though there is definitely usage.  I may try printing some debug information inside that function but I might not get around to it today.

Comment 5 Ryan Cox 2015-11-05 10:06:04 MST

I added debug statements and see that it is exiting in the "No measurable usage" block for the extern step.  This was the usage right before I exited:
$ cat /cgroup/cpuacct/slurm/uid_$UID/job_8770410/step_extern/cpuacct.usage
268353773218

Comment 6 Moe Jette 2015-11-05 13:13:57 MST

David, I need to transfer this back to you. You'll need to configure a Prolog script plus set PrologFlags=alloc,contain. The launch of the process which just sleeps for the lifetime of the job is in src/slurmd/slurmstepd/mgr.c. See _spawn_job_container(), which has logic very similar to _fork_all_tasks() in the same file. It will get launched when you just run "salloc bash". Ryan is adding processes the the cgroup and tracking the resource use. I would recommend that you just start by spawning something that consumes resources (rather than "sleep") and see that the accounting of resources gets stored in the database.

Comment 7 Ryan Cox 2015-11-06 02:02:29 MST

I just tried changing the sleep execl to call "dd if=/dev/urandom of=/dev/null".  That usage did show up correctly in sacct.  I then changed it back to sleep then tried again inside a different job.  This time I logged in via ssh, verified I'm in the right cgroup, then ran the same dd command for about two minutes.  This did not get accounted for as it said there was zero usage, though the cpuacct cgroup clearly showed lots of usage:

$ cat /proc/self/cgroup 
4:cpuacct:/slurm/uid_5627/job_8776097/step_extern
3:memory:/slurm/uid_5627/job_8776097/step_extern
2:cpuset:/slurm/uid_5627/job_8776097/step_extern
1:freezer:/slurm/uid_5627/job_8776097/step_extern
$ time dd if=/dev/urandom of=/dev/null bs=1M count=10000
^C923+0 records in
922+0 records out
966787072 bytes (967 MB) copied, 109.105 s, 8.9 MB/s

real	1m49.107s
user	0m0.000s
sys	1m49.103s

$ cat /cgroup/cpuacct/slurm/uid_$UID/job_8776097/step_extern/cpuacct.usage
109264004189

Comment 8 Danny Auble 2015-11-09 11:22:46 MST

Ryan commit 129b5c332d4f6 should fix this for you.  

Commits 6e9b0bfd2 and 74a7c5c71e will also get sstat working for you.

Let me know if you still see trouble on the accounting front after this.  If not we can probably close this one.

Comment 9 Ryan Cox 2015-11-10 02:01:16 MST

Great!  Can I test it by just updating the slurmd on a few test nodes or do I need to update slurmdbd and slurmctld too?  If I need to update everything, should it be safe to run everything up to commit 129b5c332d4f6?

Comment 10 Danny Auble 2015-11-10 02:16:31 MST

You will need the slurmctld and the slurmstepd updated I believe then you should be set. I don't believe a restart of the slurmd's are needed nor the slurmdbd. 

On November 10, 2015 8:01:16 AM PST, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=2097
>
>--- Comment #9 from Ryan Cox <ryan_cox@byu.edu> ---
>Great!  Can I test it by just updating the slurmd on a few test nodes
>or do I
>need to update slurmdbd and slurmctld too?  If I need to update
>everything,
>should it be safe to run everything up to commit 129b5c332d4f6?
>
>-- 
>You are receiving this mail because:
>You are the assignee for the bug.

Comment 11 Ryan Cox 2015-11-10 02:18:39 MST

Should it be safe to run everything at commit 129b5c332d4f6?

Comment 12 Danny Auble 2015-11-10 02:27:15 MST

Yes, I believe so, if you do that I would restart all Daemons just to be safe. 

On November 10, 2015 8:18:39 AM PST, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=2097
>
>--- Comment #11 from Ryan Cox <ryan_cox@byu.edu> ---
>Should it be safe to run everything at commit 129b5c332d4f6?
>
>-- 
>You are receiving this mail because:
>You are the assignee for the bug.

Comment 13 Ryan Cox 2015-11-10 03:17:25 MST

Unless I did something wrong, it does not appear to actually collect the accounting information.  The extern step shows up in sacct but it doesn't account for the CPU and memory usage.  There is some memory usage there but I think that's just the slurmstepd or something.  The CPU columns all have zeroes.

I upgraded the slurmctld and slurmd instances to 129b5c332d4f6 and restarted them in the proper order.  I haven't upgraded all slurmd instances yet but I did upgrade and restart slurmd on the nodes I was testing this on.

Comment 14 Danny Auble 2015-11-10 03:36:38 MST

So, I was noticing a similar thing.  I have noticed the pam module claims to succeed adopting the pid it doesn't make it in.  I just thought it was my VM, as you seem to not have this issue.  If I manually put it in I see it still isn't being grabbed by the acct_gather plugin.  What this patch does is just make sure the data is actually sent to the slurmctld.  It wasn't ever happening before which it needs to whether there is data or not.

I'll see if I can get the accounting plugin to do the correct thing and report back.  Sorry to get your hopes up.

Comment 15 Danny Auble 2015-11-10 09:24:36 MST

So the good news is I am able to find out why this new pid isn't being queried is because the jobacct_gather plugin also needed to be told about the pid (which isn't possible outside of talking to the slurmstepd).  This might require a new
RPC to the slurmd to make this happen which would make the adpot module much simpler.  If you have any ideas let me know else I see what I can do to get this working correctly.

Comment 16 Ryan Cox 2015-11-10 09:29:55 MST

Thanks for continuing to look into this.  I think the best approach is a single RPC to adopt a new process into a specific job and step.  It should adopt it into any relevant plugins that track processes, including jobacct_gather, proctrack, task, etc.  I can't see a reason for a process to be included in one but not another.  This would also allow it to work with plugins other than the cgroups plugins, not that any Linux users should be using anything other than cgroups (AFAIK).

I had thought about looking into that myself a while back but never got around to it.

Comment 17 Danny Auble 2015-11-10 09:32:18 MST

OK, good, this is actually looking to be fairly straight forward.  I'll let you know when I have a patch/commit for you.

Comment 18 Ryan Cox 2015-11-10 09:33:31 MST

(In reply to Danny Auble from comment #17)
> OK, good, this is actually looking to be fairly straight forward.  I'll let
> you know when I have a patch/commit for you.

Awesome! Thanks.

Comment 19 Danny Auble 2015-11-10 10:53:14 MST

Created attachment 2405 [details]
uncleaned patch for having the stepd adopt the process instead of the pam module

Ryan, test this out and see if it works for you.  What doesn't work, is the situation with _rpc_network_callerid() is used, perhaps we just drop that support?  What also doesn't work is (opts.action_unknown == CALLERID_ACTION_USER) I'm not sure about that one either though as it would be fairly difficult to clean that one up.

What are your thoughts?

If you have any improvements to this let me know.  I am having issues with making the pid "stick" in the cgroup though.  It is there one second, and then gone the next.  I am guessing has something to do with my VM which is running Ubuntu 15.10.  Let me know if you have the same issues.

Comment 20 Ryan Cox 2015-11-11 01:33:55 MST

(In reply to Danny Auble from comment #19)
> Created attachment 2405 [details]
> uncleaned patch for having the stepd adopt the process instead of the pam
> module
> 
> Ryan, test this out and see if it works for you.

Hopefully I'll have time to test it today.  I haven't looked at the patch yet.

> What doesn't work, is the
> situation with _rpc_network_callerid() is used, perhaps we just drop that
> support?

Can you rephrase that? I'm not sure what you mean.

> What also doesn't work is (opts.action_unknown ==
> CALLERID_ACTION_USER) I'm not sure about that one either though as it would
> be fairly difficult to clean that one up.
> 
> What are your thoughts?

I can remove that one.  Someone thought it would be a really good idea.  I really disagree but added it anyway since it only took a few minutes.

> If you have any improvements to this let me know.  I am having issues with
> making the pid "stick" in the cgroup though.  It is there one second, and
> then gone the next.  I am guessing has something to do with my VM which is
> running Ubuntu 15.10.  Let me know if you have the same issues.

systemd is likely fighting with you since it's supposed to replace all userland stuff besides the kernel and it wants exclusive control of cgroups.  I'm still waiting for it to include replacements for Firefox and Thunderbird before I adopt it.  I have so far avoided systemd so I don't know for sure, but I would check for any pam modules after pam_slurm_adopt that are related to systemd or cgroups in any way.

Comment 21 David Bigagli 2015-11-11 03:05:22 MST

We did test out cgroup support on CentOS7 which has systemd. We use our
own cgroup hierarchy /cgroup which is different from the one systemd
uses /sys/fs/cgroup.

David

Comment 22 Ryan Cox 2015-11-11 03:20:19 MST

(In reply to David Bigagli from comment #21)
> We did test out cgroup support on CentOS7 which has systemd. We use our
> own cgroup hierarchy /cgroup which is different from the one systemd
> uses /sys/fs/cgroup.
> 
> David

Danny's using Ubuntu so the pam setup could be different.  Something to be aware of if you aren't already is that you can mount cgroup controllers multiple times but they are identical:
# mkdir /tmp/cgrouptest{1..3}
# for a in /tmp/cgrouptest{1..3}; do mount -t cgroup -o memory cgroup $a; done
# mount |grep cgroup
cgroup on /dev/cgroup/cpu type cgroup (rw,relatime,cpu,release_agent=/usr/local/sbin/cgroup_clean)
cgroup on /tmp/cgrouptest1 type cgroup (rw,relatime,memory)
cgroup on /tmp/cgrouptest2 type cgroup (rw,relatime,memory)
cgroup on /tmp/cgrouptest3 type cgroup (rw,relatime,memory)
# mkdir /tmp/cgrouptest1/ryantest
# ls /tmp/cgrouptest{1..3}/ryantest
/tmp/cgrouptest1/ryantest:
cgroup.clone_children  cgroup.procs    memory.force_empty     memory.max_usage_in_bytes  memory.memsw.limit_in_bytes	  memory.memsw.usage_in_bytes	   memory.numa_stat    memory.pressure_level	   memory.stat	      memory.usage_in_bytes  notify_on_release
cgroup.event_control   memory.failcnt  memory.limit_in_bytes  memory.memsw.failcnt	 memory.memsw.max_usage_in_bytes  memory.move_charge_at_immigrate  memory.oom_control  memory.soft_limit_in_bytes  memory.swappiness  memory.use_hierarchy   tasks

/tmp/cgrouptest2/ryantest:
cgroup.clone_children  cgroup.procs    memory.force_empty     memory.max_usage_in_bytes  memory.memsw.limit_in_bytes	  memory.memsw.usage_in_bytes	   memory.numa_stat    memory.pressure_level	   memory.stat	      memory.usage_in_bytes  notify_on_release
cgroup.event_control   memory.failcnt  memory.limit_in_bytes  memory.memsw.failcnt	 memory.memsw.max_usage_in_bytes  memory.move_charge_at_immigrate  memory.oom_control  memory.soft_limit_in_bytes  memory.swappiness  memory.use_hierarchy   tasks

/tmp/cgrouptest3/ryantest:
cgroup.clone_children  cgroup.procs    memory.force_empty     memory.max_usage_in_bytes  memory.memsw.limit_in_bytes	  memory.memsw.usage_in_bytes	   memory.numa_stat    memory.pressure_level	   memory.stat	      memory.usage_in_bytes  notify_on_release
cgroup.event_control   memory.failcnt  memory.limit_in_bytes  memory.memsw.failcnt	 memory.memsw.max_usage_in_bytes  memory.move_charge_at_immigrate  memory.oom_control  memory.soft_limit_in_bytes  memory.swappiness  memory.use_hierarchy   tasks
# echo 16943 > /tmp/cgrouptest1/ryantest/tasks 
# cat /tmp/cgrouptest3/ryantest/tasks
16943

If you're fighting with systemd (not an issue on CentOS7 apparently but maybe an issue for Ubuntu 15.10), then you're fighting with systemd.  It could be something else entirely though.

Comment 23 Ryan Cox 2015-11-12 02:08:18 MST

Created attachment 2421 [details]
Patch to make _try_rpc adopt again

Danny,

I tested everything and it looks fine except for a few things.  I can't figure out why you removed the adoption for after the RPC call but this patch fixes it. It also fixes a bug that was introduced in _indeterminate_multiple when the nodename is null.

The stepd_add_extern_pid RPC seems to work well except that the cpuset cgroup isn't used.

This patch is to be applied after your other patch.  Additionally, I am fairly certain that pam_systemd.so is what is causing the processes to be snatched away.  Try commenting it out anywhere in /etc/pam.d/* or move it to before pam_slurm_adopt.

Comment 24 Ryan Cox 2015-11-12 02:10:22 MST

Created attachment 2422 [details]
Patch to cleanup some of the pam_slurm_adopt code

This patch is mostly to make better comments and rework some of the way returns from functions are done.  Some of the PAM_* and SLURM_* return codes were confused and this should straighten it out.  I separated this out so that it is easier to see functionality changes in the previous patch.

Comment 25 Danny Auble 2015-11-12 07:18:21 MST

(In reply to Ryan Cox from comment #23)
> Created attachment 2421 [details]
> Patch to make _try_rpc adopt again
> 
> Danny,
> 
> I tested everything and it looks fine except for a few things.  I can't
> figure out why you removed the adoption for after the RPC call but this
> patch fixes it. It also fixes a bug that was introduced in
> _indeterminate_multiple when the nodename is null.

The only reason I removed it was because I didn't have time to figure out the code you added in the patch :).  This was what I meant in comment 19 what doesn't work - "situation with _rpc_network_callerid() is used".

I think I might keep the suffix changes, no reason to do an xstrdup if you don't have to.  I'm guessing the reason you changed it was NULL was being printed, I'll just change the initializer to '' instead sorry for breaking it.  I just wanted to get you a patch as soon as I had something working, it wasn't "cleaned" yet as you have done ;).

> 
> The stepd_add_extern_pid RPC seems to work well except that the cpuset
> cgroup isn't used.

I'll see what can be done with this, I am just doing what the normal Slurm code does.  Perhaps it has to with how this step is initialized.  If you figure it out before hand let me know.

> 
> This patch is to be applied after your other patch.  Additionally, I am
> fairly certain that pam_systemd.so is what is causing the processes to be
> snatched away.  Try commenting it out anywhere in /etc/pam.d/* or move it to
> before pam_slurm_adopt.

That was it!  It was the one in common-session, if I comment that out all is well :).  Not sure if that causes anything bad to happen though.  I'm guessing since we keep track of it here it should work out fine right?

Comment 26 Ryan Cox 2015-11-12 07:54:11 MST

(In reply to Danny Auble from comment #25)
> The only reason I removed it was because I didn't have time to figure out
> the code you added in the patch :).  This was what I meant in comment 19
> what doesn't work - "situation with _rpc_network_callerid() is used".

That makes a lot more sense. In my defense, "perhaps we just drop that support" was a little confusing...

> I think I might keep the suffix changes, no reason to do an xstrdup if you
> don't have to.  I'm guessing the reason you changed it was NULL was being
> printed, I'll just change the initializer to '' instead sorry for breaking
> it.  I just wanted to get you a patch as soon as I had something working, it
> wasn't "cleaned" yet as you have done ;).

No worries.  The rough patch was good to have.  I was indeed getting "(null)" in the output.

> > 
> > The stepd_add_extern_pid RPC seems to work well except that the cpuset
> > cgroup isn't used.
> 
> I'll see what can be done with this, I am just doing what the normal Slurm
> code does.  Perhaps it has to with how this step is initialized.  If you
> figure it out before hand let me know.

I doubt I'll get around to looking at it before SC15 or even Thanksgiving, so I'm guessing you'll beat me to it.

> That was it!  It was the one in common-session, if I comment that out all is
> well :).  Not sure if that causes anything bad to happen though.  I'm
> guessing since we keep track of it here it should work out fine right?

As far as I know, all it does is place processes in newly created cgroups (per session?) and does nothing else with it (unless you configure it to do something extra?).  The second patch makes a note about it in the README.  I'm guessing that most HPC sites won't have any use for the pam_systemd module but that is just a guess at this point.

I have to say, I am pleased to have an actual RPC to take care of adoption rather than writing directly to the cgroups.  Writing to cgroups was easy but it's nice to have something more integrated with Slurm handle it.

Comment 27 Ryan Cox 2015-11-12 08:41:10 MST

Created attachment 2423 [details]
Better patch to clean up some of the pam_slurm_adopt code

I should add that accounting with sacct seems to be working, seeing as this was the point of the bug report. :)

I already found a bug in something else... currently _user_job_count returns a stepd.  The function is supposed to return a count of jobs on the node but also returns a stepd so that it can be used if there is only one job on the node.  That way it doesn't have to iterate through the list again if there's only one job from that user.

_user_job_count didn't check for stepid==SLURM_EXTERN_CONT because it's simply trying to return the job count, so it can return a different step, such as step 0.  The new patch fixes that.  This pam module pretty much requires the use of the extern step anyway, so this shouldn't harm anything except when the admin enables this module but doesn't enable the proper prologflags and restart slurmd.  In that case, all ssh connections would be denied by default since it will appear as if the user has no jobs because there are no extern steps.  Maybe I should add something to check for PrologFlags=contain and exit with PAM_IGNORE if it's not there.  I may get around to it eventually but feel free to add it yourself if you get there first.

I attached an updated patch that replaces the second one.  The only difference is that _user_job_count now checks for stepid==SLURM_EXTERN_CONT.

Comment 28 Danny Auble 2015-11-12 10:14:39 MST

Thanks Ryan, these changes are all checked in up to commit 53a7c340513a13f.  Let me know if you see anything.  I'll try to see why cpuset's aren't set up as I would expect them to.

It looks like the issue is happening in the task/cgroup plugin in the task_cgroup_cpuset_create() function.  What it appears to do is add the pids to the cpuset cgroup.  I am thinking we will need to change the code to just add pids if this is called again.  I am guessing this will affect devices and memory as well since this is where those are also set up.  So this will probably fix Kilian's issue with devices as well.

Comment 29 Danny Auble 2015-12-01 04:30:12 MST

Ryan, I just fixed an issue in commit c16f865ec0e6 that will fix the tracking of processes that run after the ssh session closes but a child process remains around.

I'll see about the other cgroups now.

Comment 30 Danny Auble 2015-12-03 07:55:14 MST

I am going to close this one Ryan, I'll handle the adding the pid to the other cgroups in Kilian's bug 2130.

If this needs more attention please reopen.