Ticket 4718 - defunct slurmd process leaves a sleep in the step_extern cgroup
Summary: defunct slurmd process leaves a sleep in the step_extern cgroup
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: 17.11.2
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Tim Wickberg
QA Contact:
URL:
: 4622 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2018-02-01 04:36 MST by Cineca HPC Systems
Modified: 2018-02-07 03:44 MST (History)
2 users (show)

See Also:
Site: Cineca
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 17.11.3
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
contents of /etc/slurm and slurmd logs (6.36 KB, application/x-compressed-tar)
2018-02-01 04:36 MST, Cineca HPC Systems
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Cineca HPC Systems 2018-02-01 04:36:41 MST
Created attachment 6048 [details]
contents of /etc/slurm and slurmd logs

Hi support,

we observe a lot of job which keep being in completing state until we kill the sleep process inside the step_extern cgroup.

In these cases what we see on the involved nodes is a defunct slurmd

[root@r131c17s02 ~]# ps --forest -lfe | egrep '[s]leep|[s]lurm'
1 S root     15957     1  0  80   0 - 923070 inet_c Jan23 ?       00:00:45 /usr/sbin/slurmd
1 Z root     15481 15957  0  80   0 -     0 exit   11:49 ?        00:00:00  \_ [slurmd] <defunct>
0 S root     15487     1  0  80   0 - 26973 hrtime 11:49 ?        00:00:00 sleep 1000000

[root@r131c17s02 ~]# cat /sys/fs/cgroup/cpuset/slurm/uid_29035/job_82290/step_extern/tasks 
15487

we see from UNIX accounting logs that the step_extern slurmstepd died immediately

[root@r131c17s02 ~]# lastcomm --command slurmstepd | grep D
slurmstepd          DX root     __         0.10 secs Thu Feb  1 11:49

[root@r131c17s02 ~]# dump-acct /var/account/pacct | grep 'Feb  1 11:49' | grep slurm
slurmd          |v3|     0.00|     0.00|     0.00|     0|     0|3558912.00|     0.00|   15481    15957|Thu Feb  1 11:49:49 2018
slurmstepd      |v3|     4.00|     6.00|    12.00|     0|     0|195904.00|     0.00|   15482        1|Thu Feb  1 11:49:49 2018

So both the sleep and slurmstepd processes turn to be children of systemd (pid 1).

We tried to setup an UnkillableStepProgram to kill the sleep process but the script is not invoked, we guess because the slurmd is defunct.

We attach /etc/slurm dir (slurm.tgz) and slurmd logs.

Thanks
Ale
Comment 1 Tim Wickberg 2018-02-01 13:00:06 MST
I believe this is the same underlying issue as in bug 4634, and should be resolved in 17.11.3 (due to be release this afternoon) and up.

If it's alright with you, I'd propose we close this as a duplicate of that bug; if after upgrading (or applying those referenced patches directly) you're still seeing issues we can re-open this, or you can file a new separate ticket.

- Tim
Comment 3 Cineca HPC Systems 2018-02-02 03:58:04 MST
Hi Tim,

thanks for the info. It's ok for us to close the bug. We can schedule an upgrade to 17.11.3 next week and we'll let you know if it solves this bug.

I have just 2 questions:

* Can I take a look at bug 4634, please? At the moment your site doesn't give me the access ;)

* We didn't receive the email of your comment. I checked the email preferences and they seem ok. Could you check what's wrong, please?

Thank you very much
Ale
Comment 4 Tim Wickberg 2018-02-02 10:17:58 MST
(In reply to hpc-sysmgt-info from comment #3)
> Hi Tim,
> 
> thanks for the info. It's ok for us to close the bug. We can schedule an
> upgrade to 17.11.3 next week and we'll let you know if it solves this bug.
> 
> I have just 2 questions:
> 
> * Can I take a look at bug 4634, please? At the moment your site doesn't
> give me the access ;)

Ah, sorry about that. That one is tagged private unfortunately.

The relevant patches are in commit d2c838070.

However, we have found a few related issues, and are working on an additional patch that closes a more likely source of these issues. That should be in the 17.11.3 release which we expect to have out early next week.

> * We didn't receive the email of your comment. I checked the email
> preferences and they seem ok. Could you check what's wrong, please?

There does seem to have been a small hiccup getting that email out. I do see that email appears to be flowing (I'd switched into your account briefly to double-check some of your preferences, and can see that alert email made it over), and I'm verifying this response gets sent over to your mail server.

- Tim
Comment 5 Tim Wickberg 2018-02-02 10:21:32 MST
Trying this again after one small tweak to your account. Comment #4 was not sent either, so I'm replying here again:

(In reply to Tim Wickberg from comment #4)
> (In reply to hpc-sysmgt-info from comment #3)
> > Hi Tim,
> > 
> > thanks for the info. It's ok for us to close the bug. We can schedule an
> > upgrade to 17.11.3 next week and we'll let you know if it solves this bug.
> > 
> > I have just 2 questions:
> > 
> > * Can I take a look at bug 4634, please? At the moment your site doesn't
> > give me the access ;)
> 
> Ah, sorry about that. That one is tagged private unfortunately.
> 
> The relevant patches are in commit d2c838070.
> 
> However, we have found a few related issues, and are working on an
> additional patch that closes a more likely source of these issues. That
> should be in the 17.11.3 release which we expect to have out early next week.
> 
> > * We didn't receive the email of your comment. I checked the email
> > preferences and they seem ok. Could you check what's wrong, please?
> 
> There does seem to have been a small hiccup getting that email out. I do see
> that email appears to be flowing (I'd switched into your account briefly to
> double-check some of your preferences, and can see that alert email made it
> over), and I'm verifying this response gets sent over to your mail server.
> 
> - Tim
Comment 6 Tim Wickberg 2018-02-02 10:25:03 MST
(In reply to Tim Wickberg from comment #5)
> Trying this again after one small tweak to your account. Comment #4 was not
> sent either, so I'm replying here again:

Please take a look at Comment #4 when you get a chance.

This is one more test message; this should hopefully get through to you.

I've removed one checkbox in your email preferences setting that was stopping you from getting email. 

Having "The bug is in the UNCONFIRMED state" checked on the Reporter column is what has been skipping messages sent to you. I'm not sure if you intentionally enabled that or not?
Comment 7 Cineca HPC Systems 2018-02-02 10:38:36 MST
Hi Tim
I confirm that comment #6 arrived by mail. I checked the box misunderstanding its meaning, thanks for fixing it.
We will wait for the 17.11.3 to be released.

thank you very much
ale
Comment 8 Tim Wickberg 2018-02-06 15:32:19 MST
This is fixed with commit 108502e9504, and will be in 17.11.3 when released.

Please re-open if you have any further questions, or still have problems after upgrading.

cheers,
- Tim
Comment 9 Tim Wickberg 2018-02-06 15:33:20 MST
*** Ticket 4622 has been marked as a duplicate of this ticket. ***
Comment 10 Alejandro Sanchez 2018-02-07 03:44:05 MST
*** Ticket 4733 has been marked as a duplicate of this ticket. ***