Ticket 5549 - Jobs fail to complete, leave behind sleep zombie horde.
Summary: Jobs fail to complete, leave behind sleep zombie horde.
Status: RESOLVED CANNOTREPRODUCE
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmstepd (show other tickets)
Version: 18.08.7
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Felip Moll
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-08-10 10:16 MDT by John Hanks
Modified: 2019-07-25 13:36 MDT (History)
1 user (show)

See Also:
Site: Stanford
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
SCG SLURM config (7.12 KB, application/gzip)
2018-08-13 16:19 MDT, John Hanks
Details
logs for uv300 for August (861.58 KB, application/gzip)
2018-08-16 09:44 MDT, John Hanks
Details
slurmd log from node with zombies (13.12 MB, application/gzip)
2018-08-23 15:32 MDT, John Hanks
Details
pstree of processes on node (98.84 KB, text/plain)
2018-08-23 15:33 MDT, John Hanks
Details
ps -aux of processes. (621.93 KB, text/plain)
2018-08-23 15:33 MDT, John Hanks
Details
pstree -psal (81.38 KB, text/plain)
2018-11-14 15:26 MST, John Hanks
Details
pstree -psal (38.15 KB, text/plain)
2018-11-15 09:17 MST, John Hanks
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description John Hanks 2018-08-10 10:16:58 MDT
Hi,

I have a special node, an SGI UV300, which recently (a couple of months ago) started having an issue with jobs completing. After a reboot, things work great for some period of time, days or in some cases a few weeks, until jobs begin to fail to complete and the slurmstepd processes leave behind zombie sleep process. For example:

Using pstree, here is an example of a job that has exited, but it's slurmstepd is left hanging:

[root@sgiuv300-srcf-d10-01 ~]# pstree -a -c -p -S 421660
slurmstepd,421660                      
  ├─sleep,421678 100000000
  ├─{slurmstepd},421665
  ├─{slurmstepd},421666
  ├─{slurmstepd},421667
  └─{slurmstepd},421677

Here's a backtrace from that slurmstepd:

(gdb) bt
#0  0x00002b6e1f95722a in wait4 () from /usr/lib64/libc.so.6
#1  0x000000000040fbff in _spawn_job_container (job=0x1690ec0) at mgr.c:1102
#2  job_manager (job=job@entry=0x1690ec0) at mgr.c:1211
#3  0x000000000040ca57 in main (argc=1, argv=0x7ffe93055fe8) at slurmstepd.c:172

The zombies seem to result after this message is logged:

Aug  9 20:14:12 sgiuv300-srcf-d10-01 slurmstepd[377480]: error: *** EXTERN STEP FOR 3092791 STEPD TERMINATED ON sgiuv300-srcf-d10-01 AT 2018-08-09T20:14:11 DUE TO JOB NOT ENDING WITH SIGNALS ***

and for the wait4() slurmstepd, the slurm logs look completely normal. 


Since none of our other nodes are having this behavior I'm inclined to blame the UV300 (a node which is not my friend). But the problem doesn't affect anything else on the machine, just stuff started via SLURM. Since the end result is slurmstepd in wait4() or zombie sleep processes, not quite sure where to prod for more information. Suggestions for the next troubleshooting step are appreciated.

Other details of possible interest. The first instance of this problem I can recall was with 17.11.7 (it may have happened before and gone unnoticed as the system does get rebooted more often that I'd like). The node was upgraded to CentOS 7.5 from 7.4, but worked fine for a while after that upgrade.

jbh
Comment 2 Felip Moll 2018-08-13 06:47:02 MDT
Hi John,

This issue is very similar to your other bug 5443. Isn't it a duplicate?

In case you think it isn't:

It would be useful to get a "thread apply all bt full" to see the full dump of GDB in all the threads. In any case I guess the process failed to send the proper signal to slurmstepd for some reason, so the bt may be correct.

I would also need to see an "scontrol show job" for the CG job and the full slurmd+slurmctld log.

If you can attach your up-to-date slurm.conf + cgroup.conf it would be great.

Finally if you have prolog/epilog scripts, attach them also here please.


To do a quick test, can you submit a:

srun /bin/true

and see if it reproduces.

Are you really running 17.11.9? Kilian commented that you downgraded to 17.11.8, but I don't know in which system.
Comment 3 John Hanks 2018-08-13 16:17:48 MDT
(In reply to Felip Moll from comment #2)
> Hi John,
> 
> This issue is very similar to your other bug 5443. Isn't it a duplicate?

Sorry, I should have mentioned this is a different cluster. Kilian's bug is with Sherlock, this system is the SCG cluster. You can think of us as a poor little system that lives in subsidised housing on Sherlock's street. 

> In case you think it isn't:

It's possible they share underlying causes, but the behavior I'm seeing is different. My jobs either complete, but leave behind the zombie sleeps or they stay in Running state forever, even after the slurmstepd has been killed (again leaving behind an optional zombie sleep.)

> It would be useful to get a "thread apply all bt full" to see the full dump
> of GDB in all the threads. In any case I guess the process failed to send
> the proper signal to slurmstepd for some reason, so the bt may be correct.

Here's a sample, I'll need to increase loglevels and send more later. Job 3092804 ended, but it's slurmstepd is still hanging around and it's not in completing state:

Logs: 
/srv/logs/messages-20180810:Aug 10 10:07:24 sgiuv300-srcf-d10-01 slurmd[36881]: _handle_stray_script: Purging vestigial job script /var/spool/slurm/slurmd/job3092804/slurm_script
/srv/logs/messages-20180810:Aug 10 10:07:39 frankie slurmctld[121446]: Batch JobId=3092804 missing from node 0 (not found BatchStartTime after startup), Requeuing job
/srv/logs/messages-20180810:Aug 10 10:07:39 frankie slurmctld[121446]: _job_complete: JobID=3092804 State=0x1 NodeCnt=1 WTERMSIG 126
/srv/logs/messages-20180810:Aug 10 10:07:39 frankie slurmctld[121446]: _job_complete: JobID=3092804 State=0x1 NodeCnt=1 cancelled by node failure
/srv/logs/messages-20180810:Aug 10 10:07:39 frankie slurmctld[121446]: _job_complete: requeue JobID=3092804 State=0x8000 NodeCnt=1 due to node failure
/srv/logs/messages-20180810:Aug 10 10:07:39 frankie slurmctld[121446]: _job_complete: JobID=3092804 State=0x8000 NodeCnt=1 done
/srv/logs/messages-20180810:Aug 10 10:07:39 frankie slurmctld[121446]: Requeuing JobID=3092804 State=0x0 NodeCnt=0
/srv/logs/messages-20180810:Aug 10 12:20:36 frankie slurmctld[121446]: sched: Allocate JobID=3092804 NodeList=sgiuv300-srcf-d10-01 #CPUs=16 Partition=nih_s10
/srv/logs/messages-20180810:Aug 10 12:20:36 sgiuv300-srcf-d10-01 slurmd[36884]: task_p_slurmd_batch_request: 3092804
/srv/logs/messages-20180810:Aug 10 12:20:36 sgiuv300-srcf-d10-01 slurmd[36884]: task/affinity: job 3092804 CPU input mask for node: 0x0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000FFFFFFFF000000000000000000000000000000000000000000000000000000000000000000000000000000000
/srv/logs/messages-20180810:Aug 10 12:20:36 sgiuv300-srcf-d10-01 slurmd[36884]: task/affinity: job 3092804 CPU final HW mask for node: 0x0000000000000000000000000000000000000000000003FFFC00000000000000000000000000000000000000000000000000000000000000000000000000000000000003FFFC0000000000000000000000000000000000000000
/srv/logs/messages-20180810:Aug 10 12:20:36 sgiuv300-srcf-d10-01 slurmd[36884]: _run_prolog: prolog with lock for job 3092804 ran for 0 seconds
/srv/logs/messages-20180810:Aug 10 12:20:36 sgiuv300-srcf-d10-01 slurmstepd[377334]: task/cgroup: /slurm/uid_22292/job_3092804: alloc=65536MB mem.limit=65536MB memsw.limit=98304MB
/srv/logs/messages-20180810:Aug 10 12:20:36 sgiuv300-srcf-d10-01 slurmstepd[377334]: task/cgroup: /slurm/uid_22292/job_3092804/step_extern: alloc=65536MB mem.limit=65536MB memsw.limit=98304MB
/srv/logs/messages-20180810:Aug 10 12:20:37 sgiuv300-srcf-d10-01 slurmd[36884]: Launching batch job 3092804 for UID 22292
/srv/logs/messages-20180810:Aug 10 12:20:37 sgiuv300-srcf-d10-01 slurmstepd[377399]: task/cgroup: /slurm/uid_22292/job_3092804: alloc=65536MB mem.limit=65536MB memsw.limit=98304MB
/srv/logs/messages-20180810:Aug 10 12:20:37 sgiuv300-srcf-d10-01 slurmstepd[377399]: task/cgroup: /slurm/uid_22292/job_3092804/step_batch: alloc=65536MB mem.limit=65536MB memsw.limit=98304MB
/srv/logs/messages-20180811:Aug 11 12:39:04 frankie slurmctld[121446]: _slurm_rpc_kill_job: REQUEST_KILL_JOB job 3092804 uid 22292
/srv/logs/messages-20180811:Aug 11 12:39:04 sgiuv300-srcf-d10-01 slurmstepd[377399]: error: *** JOB 3092804 ON sgiuv300-srcf-d10-01 CANCELLED AT 2018-08-11T12:39:04 ***
/srv/logs/messages-20180811:Aug 11 12:42:07 frankie slurmctld[121446]: Resending TERMINATE_JOB request JobId=3092804 Nodelist=sgiuv300-srcf-d10-01
/srv/logs/messages-20180811:Aug 11 12:43:35 sgiuv300-srcf-d10-01 slurmstepd[377334]: error: *** EXTERN STEP FOR 3092804 STEPD TERMINATED ON sgiuv300-srcf-d10-01 AT 2018-08-11T12:43:34 DUE TO JOB NOT ENDING WITH SIGNALS ***
/srv/logs/messages-20180811:Aug 11 12:43:35 sgiuv300-srcf-d10-01 slurmstepd[377399]: error: *** JOB 3092804 STEPD TERMINATED ON sgiuv300-srcf-d10-01 AT 2018-08-11T12:43:34 DUE TO JOB NOT ENDING WITH SIGNALS ***
/srv/logs/messages-20180811:Aug 11 12:43:35 frankie slurmctld[121446]: error: slurmd error running JobId=3092804 on node(s)=sgiuv300-srcf-d10-01: Kill task failed
/srv/logs/messages-20180811:Aug 11 12:43:35 frankie slurmctld[121446]: cleanup_completing: job 3092804 completion process took 271 seconds


When I try to attach with gdb, gdb hangs forever (or at least up to the 15 minutes I was willing to wait). Tried this on several of these with the same result.


> I would also need to see an "scontrol show job" for the CG job and the full
> slurmd+slurmctld log.

JOb has finished, so nothing to show. I'll need to increase loglevel and turn on logging to a file, which means I'll have this for you next time.

> If you can attach your up-to-date slurm.conf + cgroup.conf it would be great.
> 
> Finally if you have prolog/epilog scripts, attach them also here please.

Will attach these shortly.

> Are you really running 17.11.9? Kilian commented that you downgraded to
> 17.11.8, but I don't know in which system.

SCG doesn't downgrade. No retreat, no surrender.

Thanks,

jbh
Comment 4 John Hanks 2018-08-13 16:19:25 MDT
Created attachment 7591 [details]
SCG SLURM config

Current SLURM config on SCG.
Comment 5 Felip Moll 2018-08-16 06:30:09 MDT
At a first glance your configuration is fine.

I'd like to see also your slurmd log in the node sgiuv300-srcf-d10-01.

The job was requeued by some reason:

/srv/logs/messages-20180810:Aug 10 10:07:39 frankie slurmctld[121446]: _job_complete: requeue JobID=3092804 State=0x8000 NodeCnt=1 due to node failure

Thanks
Comment 6 Felip Moll 2018-08-16 06:33:15 MDT
(In reply to Felip Moll from comment #5)
> At a first glance your configuration is fine.
> 
> I'd like to see also your slurmd log in the node sgiuv300-srcf-d10-01.
> 
> The job was requeued by some reason:
> 
> /srv/logs/messages-20180810:Aug 10 10:07:39 frankie slurmctld[121446]:
> _job_complete: requeue JobID=3092804 State=0x8000 NodeCnt=1 due to node
> failure
> 
> Thanks

Ah sorry, I see your logs are centralized.

In any case I need more history, I need to see what happened to the node before that time and see why it shown a node failure. Also, can you analyze the system logs for any problem around 10th Aug 10:00 and before?
Comment 7 John Hanks 2018-08-16 08:10:34 MDT
(In reply to Felip Moll from comment #6)
> (In reply to Felip Moll from comment #5)
> > The job was requeued by some reason:
> > 
> > /srv/logs/messages-20180810:Aug 10 10:07:39 frankie slurmctld[121446]:
> > _job_complete: requeue JobID=3092804 State=0x8000 NodeCnt=1 due to node
> > failure

When the zombie processes start to accumulate, the node eventually reaches a condition where no new jobs will start. At that point it is rebooted and any jobs still in Running state (although there are no processes associated with them) get requeued due to node failure. 

> 
> Ah sorry, I see your logs are centralized.
> 
> In any case I need more history, I need to see what happened to the node
> before that time and see why it shown a node failure. Also, can you analyze
> the system logs for any problem around 10th Aug 10:00 and before?

Our workload shifted this week from lots of small jobs turning over relatively quickly to a small number of very large jobs and not much turnover. Currently the node is working fine. When I get some time, hopefully next week, I'll crank up the logging verbosity and try to reproduce with a flood of small test jobs.
Comment 8 Felip Moll 2018-08-16 09:14:52 MDT
> Our workload shifted this week from lots of small jobs turning over
> relatively quickly to a small number of very large jobs and not much
> turnover. Currently the node is working fine. When I get some time,
> hopefully next week, I'll crank up the logging verbosity and try to
> reproduce with a flood of small test jobs.

Ok, don't you have the older logs in the meantime?
Comment 9 John Hanks 2018-08-16 09:42:19 MDT
(In reply to Felip Moll from comment #8)
> > Our workload shifted this week from lots of small jobs turning over
> > relatively quickly to a small number of very large jobs and not much
> > turnover. Currently the node is working fine. When I get some time,
> > hopefully next week, I'll crank up the logging verbosity and try to
> > reproduce with a flood of small test jobs.
> 
> Ok, don't you have the older logs in the meantime?

Only what was logged at info and logged with syslog. Which means I have lots of logs like the ones for the job I posted here earlier, so I can increase that sample size. Will upload logs for August. During this period the node was rebooted to clear the problem on 

1st @ 8:00 AM
5th @ 6:30 AM
9th @ 6:00 AM
9th @ 5:30 PM
10th @ 10:20 AM
13th @ 3:15 PM
Comment 10 John Hanks 2018-08-16 09:44:01 MDT
Created attachment 7620 [details]
logs for uv300 for August
Comment 11 Felip Moll 2018-08-17 02:23:12 MDT
From what I see at some point there is a slurmstepd in uninterruptible sleep (D state) before the problem start.

Aug  9 20:12:36 frankie slurmctld[121446]: Resending TERMINATE_JOB request JobId=3092791 Nodelist=sgiuv300-srcf-d10-01
Aug  9 20:13:16 sgiuv300-srcf-d10-01 kernel: [ 9512.041358] INFO: task slurmstepd:377005 blocked for more than 120 seconds.
Aug  9 20:13:16 sgiuv300-srcf-d10-01 kernel: [ 9512.041361] slurmstepd      D ffff96cac7dd0000     0 377005      1 0x00000080
Aug  9 20:14:12 sgiuv300-srcf-d10-01 slurmstepd[377480]: error: *** EXTERN STEP FOR 3092791 STEPD TERMINATED ON sgiuv300-srcf-d10-01 AT 2018-08-09T20:14:11 DUE TO JOB NOT ENDING WITH SIGNALS ***

Aug  9 20:14:12 frankie slurmctld[121446]: update_node: node sgiuv300-srcf-d10-01 reason set to: Kill task failed
Aug  9 20:14:12 frankie slurmctld[121446]: update_node: node sgiuv300-srcf-d10-01 state set to DRAINING

This could make me think about some I/O issues. What filesystems do you mount in this server? Can you check everything is ok about filesystem I/O?

I also see that the node stops responding until reboot:
Aug  9 21:56:41 frankie slurmctld[121446]: Resending TERMINATE_JOB request JobId=3092808 Nodelist=sgiuv300-srcf-d10-01
Aug 10 09:51:22 frankie slurmctld[121446]: error: Nodes sgiuv300-srcf-d10-01 not responding
Aug 10 09:56:22 frankie slurmctld[121446]: error: Nodes sgiuv300-srcf-d10-01 not responding
Aug 10 10:01:22 frankie slurmctld[121446]: error: Nodes sgiuv300-srcf-d10-01 not responding

Also curious about the "Brain damage", how do you detect that? I guess that's being done in NHC script.

Aug  5 06:27:11 frankie slurmctld[23546]: update_node: node sgiuv300-srcf-d10-01 reason set to: REBOOTME: Brain damage.
Aug  5 06:27:11 frankie slurmctld[23546]: update_node: node sgiuv300-srcf-d10-01 state set to DRAINING


Finally, can you set this in slurm.conf?:

MemLimitEnforce=no
JobAcctGatherParams=NoOverMemoryKill

Currently there are 2 mechanisms for enforcing memory: task/cgroup and jobacctgather plugins. The first one is set by defining ConstrainRAMSpace=yes or ConstrainSwapSpace=yes in cgroup.conf, it kills jobs when the cgroup memory stat exceeds the limit. The second one is enabled by default, and kills jobs if MemLimitEnforce=yes, or steps individually if JobAcctGatherParams is not set to NoOverMemoryKill by just reading the /proc pid stats of all the pids of the job.

Having both mechanisms enabled can give problems, we are going to address that in 18.08 but not in older versions. In your case you have both mechanisms enabled by default, so I would encourage to set the mentioned values and just keep the cgroup mechanism.

I see some oom kills so you may eventually hit some situation.
Comment 12 John Hanks 2018-08-17 11:42:58 MDT
(In reply to Felip Moll from comment #11)

> This could make me think about some I/O issues. What filesystems do you
> mount in this server? Can you check everything is ok about filesystem I/O?

There haven't been any I/O issues lately. The filesystems are all either local disk or NFS 3 and 4 mounted. As far as I can tell there were no mount issues, but I'll add that to the list of things for close inspection next time. 

> 
> I also see that the node stops responding until reboot:
> Aug  9 21:56:41 frankie slurmctld[121446]: Resending TERMINATE_JOB request
> JobId=3092808 Nodelist=sgiuv300-srcf-d10-01
> Aug 10 09:51:22 frankie slurmctld[121446]: error: Nodes sgiuv300-srcf-d10-01
> not responding
> Aug 10 09:56:22 frankie slurmctld[121446]: error: Nodes sgiuv300-srcf-d10-01
> not responding
> Aug 10 10:01:22 frankie slurmctld[121446]: error: Nodes sgiuv300-srcf-d10-01
> not responding
> 
> Also curious about the "Brain damage", how do you detect that? I guess
> that's being done in NHC script.
>
> Aug  5 06:27:11 frankie slurmctld[23546]: update_node: node
> sgiuv300-srcf-d10-01 reason set to: REBOOTME: Brain damage.
> Aug  5 06:27:11 frankie slurmctld[23546]: update_node: node
> sgiuv300-srcf-d10-01 state set to DRAINING
> 

As the number of zombies goes up, the node eventually reaches a state where it no longer starts jobs. But before then if/when I notice a lot of zombies I'll drain the node with the Reason "REBOOTME: Brain damage" in the hopes that some of the current jobs might finish ok. I have a cron job that looks for drained nodes matching "REBOOTME" and power cycles them. "Brain damage" is a purely subjective diagnosis, and fortunately for this batch of logs I had chosen a term that was safe for work.

> 
> Finally, can you set this in slurm.conf?:
> 
> MemLimitEnforce=no
> JobAcctGatherParams=NoOverMemoryKill

Done. OOM kills are the most frequent cause of death for jobs, it's primarily a genomics workload and it's difficult to predict memory usage for most codes we use.
Comment 13 Felip Moll 2018-08-20 09:09:51 MDT
> Done. OOM kills are the most frequent cause of death for jobs, it's
> primarily a genomics workload and it's difficult to predict memory usage for
> most codes we use.

John,

Have the situation improved in any way after changing the commented parameters?

Setting the ctld and slurmd debug level to "debug" and debugflags to could help, can you set it and check it again?

Tell me if you see any I/O, net (dropped packages?), system error at the same time, and run a "sdiag" at the time it is happening.

thanks
Comment 14 John Hanks 2018-08-20 10:08:42 MDT
Hi Felip,

It's been pretty quiet here and the workload has shifted to an assembler I don't think we were running before last week. Since the last reboot the node is completely stable. If/when this recurs, I'll try to correlate it to a specific user/job. Maybe the problem here is something some specific app and/or user is doing that kicks slurmstepd off the rails. 

jbh
Comment 15 Felip Moll 2018-08-20 10:14:58 MDT
(In reply to John Hanks from comment #14)
> Hi Felip,
> 
> It's been pretty quiet here and the workload has shifted to an assembler I
> don't think we were running before last week. Since the last reboot the node
> is completely stable. If/when this recurs, I'll try to correlate it to a
> specific user/job. Maybe the problem here is something some specific app
> and/or user is doing that kicks slurmstepd off the rails. 
> 
> jbh

Ok John,

I'll keep this open for at least 2 weeks, let's see if it happens again.
Comment 16 John Hanks 2018-08-23 15:31:47 MDT
I caught a zombie attack in progress. Currently have a job which is losing track of some forked R processes. Uploading the slurmd log (debug5) and ps and pstree output from the system. 

jbh
Comment 17 John Hanks 2018-08-23 15:32:28 MDT
Created attachment 7691 [details]
slurmd log from node with zombies
Comment 18 John Hanks 2018-08-23 15:33:24 MDT
Created attachment 7692 [details]
pstree of processes on node
Comment 19 John Hanks 2018-08-23 15:33:50 MDT
Created attachment 7693 [details]
ps -aux of processes.
Comment 20 John Hanks 2018-08-23 15:42:36 MDT
Here is a 'thread apply all bt full' from a slurmstepd that has lost a child to the zombie process.


(gdb) thread apply all bt full

Thread 7 (Thread 0x2ad2d158b700 (LWP 260138)):
#0  0x00002ad2d2698995 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
No symbol table info available.
#1  0x00002ad2d1719cce in _watch_tasks (arg=<optimized out>) at slurm_jobacct_gather.c:227
        err = <optimized out>
        __func__ = "_watch_tasks"
#2  0x00002ad2d2694e25 in start_thread () from /usr/lib64/libpthread.so.0
No symbol table info available.
#3  0x00002ad2d29a7bad in clone () from /usr/lib64/libc.so.6
No symbol table info available.

Thread 6 (Thread 0x2ad2d7a72700 (LWP 260139)):
#0  0x00002ad2d2698d42 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
No symbol table info available.
#1  0x00002ad2d1713c91 in _timer_thread (args=<optimized out>) at slurm_acct_gather_profile.c:204
        err = <optimized out>
        i = 4
        now = 1535060418
        diff = <optimized out>
        tvnow = {tv_sec = 1535055855, tv_usec = 835349}
        abs = {tv_sec = 1535060419, tv_nsec = 835349000}
        __func__ = "_timer_thread"
#2  0x00002ad2d2694e25 in start_thread () from /usr/lib64/libpthread.so.0
No symbol table info available.
#3  0x00002ad2d29a7bad in clone () from /usr/lib64/libc.so.6
No symbol table info available.

Thread 5 (Thread 0x2ad2d7b73700 (LWP 260140)):
#0  0x00002ad2d299cf0d in poll () from /usr/lib64/libc.so.6
No symbol table info available.
#1  0x00002ad2d17a6f17 in _poll_internal (shutdown_time=<optimized out>, nfds=2, pfds=0x2ad2e00009f0) at eio.c:364
        n = <optimized out>
        timeout = -1
#2  eio_handle_mainloop (eio=0xd325a0) at eio.c:328
        retval = 0
        pollfds = 0x2ad2e00009f0
        map = 0x2ad2e0000950
        maxnfds = 1
        nfds = 2
        n = <optimized out>
        shutdown_time = <optimized out>
        __func__ = "eio_handle_mainloop"
#3  0x000000000041cbd8 in _msg_thr_internal (job_arg=0xd47d30) at req.c:245
        job = 0xd47d30
#4  0x00002ad2d2694e25 in start_thread () from /usr/lib64/libpthread.so.0
No symbol table info available.
#5  0x00002ad2d29a7bad in clone () from /usr/lib64/libc.so.6
No symbol table info available.

Thread 4 (Thread 0x2ad2e4d40700 (LWP 260141)):
#0  0x00002ad2d299cf0d in poll () from /usr/lib64/libc.so.6
No symbol table info available.
#1  0x00002ad2d17a6f17 in _poll_internal (shutdown_time=<optimized out>, nfds=3, pfds=0x2ad2e8000ac0) at eio.c:364
        n = <optimized out>
        timeout = -1
#2  eio_handle_mainloop (eio=0x2ad2e80008d0) at eio.c:328
        retval = 0
        pollfds = 0x2ad2e8000ac0
        map = 0x2ad2e8000af0
        maxnfds = 2
        nfds = 3
        n = <optimized out>
        shutdown_time = <optimized out>
        __func__ = "eio_handle_mainloop"
#3  0x00002ad2e4a2e2cc in _agent (unused=<optimized out>) at agent.c:326
        tree_listen_obj = <optimized out>
        task_obj = <optimized out>
        i = <optimized out>
        __func__ = "_agent"
#4  0x00002ad2d2694e25 in start_thread () from /usr/lib64/libpthread.so.0
No symbol table info available.
#5  0x00002ad2d29a7bad in clone () from /usr/lib64/libc.so.6
No symbol table info available.

Thread 3 (Thread 0x2ad2e4e41700 (LWP 260144)):
#0  0x00002ad2d299cf0d in poll () from /usr/lib64/libc.so.6
No symbol table info available.
#1  0x00002ad2e4211962 in _oom_event_monitor (x=<optimized out>) at task_cgroup_memory.c:478
        args = 0xd8a3f0
        ret = <optimized out>
        res = 0
        fds = {{fd = 17, events = 3, revents = 0}, {fd = 18, events = 8193, revents = 0}}
        __func__ = "_oom_event_monitor"
#2  0x00002ad2d2694e25 in start_thread () from /usr/lib64/libpthread.so.0
No symbol table info available.
#3  0x00002ad2d29a7bad in clone () from /usr/lib64/libc.so.6
No symbol table info available.
---Type <return> to continue, or q <return> to quit---

Thread 2 (Thread 0x2ad2e4f42700 (LWP 260145)):
#0  0x00002ad2d299cf0d in poll () from /usr/lib64/libc.so.6
No symbol table info available.
#1  0x00002ad2d17a6f17 in _poll_internal (shutdown_time=<optimized out>, nfds=4, pfds=0x2ad2f0000910) at eio.c:364
        n = <optimized out>
        timeout = -1
#2  eio_handle_mainloop (eio=0xd4c9a0) at eio.c:328
        retval = 0
        pollfds = 0x2ad2f0000910
        map = 0x2ad2f00008d0
        maxnfds = 4
        nfds = 4
        n = <optimized out>
        shutdown_time = <optimized out>
        __func__ = "eio_handle_mainloop"
#3  0x0000000000413bcb in _io_thr (arg=0xd47d30) at io.c:1479
        job = 0xd47d30
        set = {__val = {4097, 0 <repeats 15 times>}}
        rc = <optimized out>
#4  0x00002ad2d2694e25 in start_thread () from /usr/lib64/libpthread.so.0
No symbol table info available.
#5  0x00002ad2d29a7bad in clone () from /usr/lib64/libc.so.6
No symbol table info available.

Thread 1 (Thread 0x2ad2d1463000 (LWP 260097)):
#0  0x00002ad2d296e22a in wait4 () from /usr/lib64/libc.so.6
No symbol table info available.
#1  0x000000000040e85f in _wait_for_any_task (job=job@entry=0xd47d30, waitflag=waitflag@entry=true) at mgr.c:2042
        t = <optimized out>
        rc = <optimized out>
        status = 0
        pid = <optimized out>
        completed = 0
        jobacct = <optimized out>
        rusage = {ru_utime = {tv_sec = 0, tv_usec = 0}, ru_stime = {tv_sec = 0, tv_usec = 0}, {ru_maxrss = 0, __ru_maxrss_word = 0}, {ru_ixrss = 0, 
            __ru_ixrss_word = 0}, {ru_idrss = 0, __ru_idrss_word = 0}, {ru_isrss = 0, __ru_isrss_word = 0}, {ru_minflt = 0, __ru_minflt_word = 0}, {ru_majflt = 0, 
            __ru_majflt_word = 0}, {ru_nswap = 0, __ru_nswap_word = 0}, {ru_inblock = 0, __ru_inblock_word = 0}, {ru_oublock = 0, __ru_oublock_word = 0}, {
            ru_msgsnd = 0, __ru_msgsnd_word = 0}, {ru_msgrcv = 4294836224, __ru_msgrcv_word = 4294836224}, {ru_nsignals = 0, __ru_nsignals_word = 0}, {
            ru_nvcsw = 214748364800, __ru_nvcsw_word = 214748364800}, {ru_nivcsw = 0, __ru_nivcsw_word = 0}}
        tmp_env = <optimized out>
        task_offset = 0
        __func__ = "_wait_for_any_task"
#2  0x00000000004100b4 in _wait_for_all_tasks (job=0xd47d30) at mgr.c:2153
        rc = <optimized out>
        tasks_left = 1
        i = 0
#3  job_manager (job=job@entry=0xd47d30) at mgr.c:1350
        rc = 0
        io_initialized = true
        ckpt_type = 0xd2f170 "checkpoint/none"
        err_msg = 0x0
        __func__ = "job_manager"
#4  0x000000000040ca57 in main (argc=1, argv=0x7ffc98d21018) at slurmstepd.c:172
        cli = 0xd33570
        self = 0xd402d0
        msg = 0xd3e3a0
        rc = 0
        launch_params = 0x0
        __func__ = "main"
(gdb)
Comment 21 Felip Moll 2018-08-24 14:12:11 MDT
Hi John, your logs have been helpful.

I spent some time analyzing all your information and our code. It seems clear that what's related with the issue here, as you already guessed,
is that R processes die and remain in Z state because the parent process of R doesn't wait for them. Slurm doesn't seem to monitor these "dead"
pids, but the jobacctgather plugin tries to access /proc/<pid>, and a lot (674791 matches in your slurm.log) of errors like this are logged in:

slurm.log:[2018-08-23T14:13:15.848] [3360552.0] debug3: jobacct_gather_linux: unable to read requested pattern in /proc/261412/status

It accesses these status because these pids are in the tree of the job's main pid.

I've taken as example job 3360552 with multiple steps, one of them looks like:

  |-slurmstepd,260097                           
  |   |-qiime,260146 /home/abganz/.conda/envs/qiime2/bin/qiime dada2 denoise-paired --i-demultiplexed-seqs paired-end-demux.qza --output-dir dada2-209 --p-n-threads 15 --p-trunc-len-f 283 --p-trunc-len-r 209
  |   |   `-R,261285 --slave --no-restore --file=/home/abganz/.conda/envs/qiime2/bin/run_dada_paired.R --args /tmp/tmpwb3nmuzq/forward /tmp/tmpwb3nmuzq/reverse /tmp/tmpwb3nmuzq/output.tsv.biom /tmp/tmpwb3nmuzq/track.tsv /tmp/tmpwb3nmuzq/f
ilt_f /tmp/tmpwb3nmuzq/filt_r 283 209 0 0 2.0 2 consensus 1.0 15 1000000
  |   |      |-(R,261412)
  |   |      |-{R},301848
  |   |      |-{R},301849
  |   |      |-{R},301850
  |   |      |-{R},301851
  |   |      |-{R},301852
  |   |      |-{R},301853
  |   |      |-{R},301854
  |   |      |-{R},301855
  |   |      |-{R},301856
  |   |      |-{R},301857
  |   |      |-{R},301858
  |   |      |-{R},301859
  |   |      |-{R},301860
  |   |      `-{R},301861
  |   |-{slurmstepd},260138
  |   |-{slurmstepd},260139
  |   |-{slurmstepd},260140
  |   |-{slurmstepd},260141
  |   |-{slurmstepd},260144
  |   `-{slurmstepd},260145


As you see the parent of R is "qiime". Slurmstepd seem to wait for qiime and not for the sub-forked R processes.
qiime doesn't seem to wait 'R' and if one of these forked R dies, then it becomes a zombie.

Your ps aux for R process 261412 shows:

abganz   261412 83.8  0.0      0     0 ?        Z    13:27  51:03 [R] <defunct>

Other processes related to this job are:
root     259214  0.0  0.0 385504  8188 ?        Sl   13:24   0:02 slurmstepd: [3360552.extern]
root     259235  0.0  0.0 385904  8628 ?        Sl   13:24   0:01 slurmstepd: [3360552.batch]
abganz   259254  0.0  0.0 113188  1568 ?        S    13:24   0:00 /bin/bash /var/spool/slurm/slurmd/job3360552/slurm_script
root     260097  0.1  0.0 521056  8636 ?        Sl   13:24   0:03 slurmstepd: [3360552.0]


One cause of your system being stuck could be that qiime doesn't finish, so the job continues, and the Z processes keep increasing.
At a certain point the kernel process table may become full since the Z pids are not removed from the system if no wait call is issued, and since the qiime process
continues working the job doesn't die and the Z is not inherited by init.

From "man 2 wait":

A child that terminates, but has not been waited for becomes a "zombie".  The kernel maintains a minimal set of information about the zombie process
(PID, termination status, resource usage information) in order to allow  the parent to later perform a wait to obtain information about the child. 
As long as a zombie is not removed from the system via a wait, it will consume a slot in the kernel process table, and if this table fills, it will not be
possible to create further processes.  If a parent process terminates, then its "zombie" children (if any) are adopted by init(1), which automatically 
performs a wait to remove the zombies.
    

This is my first hypotesis, now I will try to reproduce this situation, but in any case I'd hardly suggest to your user abganz to fix the qiime software.


Have you tried 'qiime' outside of slurm? Does it generate Z processes?


This theory doesn't directly explain the following logged errors:

Aug  9 20:13:16 sgiuv300-srcf-d10-01 kernel: [ 9512.041358] INFO: task slurmstepd:377005 blocked for more than 120 seconds.
Aug  9 20:13:16 sgiuv300-srcf-d10-01 kernel: [ 9512.041361] slurmstepd      D ffff96cac7dd0000     0 377005      1 0x00000080
Aug  9 20:14:12 sgiuv300-srcf-d10-01 slurmstepd[377480]: error: *** EXTERN STEP FOR 3092791 STEPD TERMINATED ON sgiuv300-srcf-d10-01 AT 2018-08-09T20:14:11 DUE TO JOB NOT ENDING WITH SIGNALS ***

But given the kernel process table seems to be filled up we cannot discard a relation.
Comment 22 John Hanks 2018-08-24 14:47:05 MDT
We have cycled back to the ugly part of this. These qiime jobs (as far as the ones I have been involved with) run fine on our "normal" nodes, this problem just occurs on the UV300. And, qiime is just today's example, previous problems were with different applications and the abandoned children where the sleep 1000000 processes that slurmstepd starts. 

We've had problems before which stemmed from something counting up the CPUs on the UV300 and not being prepared to handle the large number. Is there anything about slurmstepd or slurmd for which 720 threads is a corner case? We ran into this with ZFS at one point and had to force it to use 128 instead of the actual CPU count. Just a thought.

jbh
Comment 23 Felip Moll 2018-08-27 08:52:07 MDT
> Is there
> anything about slurmstepd or slurmd for which 720 threads is a corner case?
> We ran into this with ZFS at one point and had to force it to use 128
> instead of the actual CPU count. Just a thought.
> 

I will look into that but it seems not probable.

What OS/Kernel version are you on?
Comment 24 John Hanks 2018-08-27 09:03:46 MDT
(In reply to Felip Moll from comment #23)
> 
> What OS/Kernel version are you on?

Stock CentOS 7.5 3.10.0-862.3.2.el7.x86_64

One other thing to note, those previous R zombies seem to have cleaned themselves up over the weekend. The are no zombie processes on the node now. I'm not sure if the previously observed sleep zombies would have done this, in those cases the node eventually ground to a halt and had to be rebooted.

jbh
Comment 25 Felip Moll 2018-08-27 09:19:03 MDT
(In reply to John Hanks from comment #24)
> (In reply to Felip Moll from comment #23)
> > 
> > What OS/Kernel version are you on?
> 
> Stock CentOS 7.5 3.10.0-862.3.2.el7.x86_64
> 
> One other thing to note, those previous R zombies seem to have cleaned
> themselves up over the weekend. The are no zombie processes on the node now.
> I'm not sure if the previously observed sleep zombies would have done this,
> in those cases the node eventually ground to a halt and had to be rebooted.
> 
> jbh

I am starting to think that zombies are just a side effect or another issue that may misguide us on the diagnosis.
A zombie process should not be harmful as long as the main process ends correctly; if the main process does a wait() this is fine, if not, init() will do a wait and clean the zombies. It seems the zombies are being created because of some bad reason, but the real problem is that slurmstepd doesn't end.

Still under investigation but bug 5103 looks very similar to what's happening to you about the hung slurmstepd processes, and there are a lot of related bugs, all of them happening under RHEL 7.[4-5] and Glibc 2.17-xxx

I guess your glibc is 2.17-222.
Comment 26 John Hanks 2018-08-27 09:21:50 MDT
Yes, our glibc is glibc-2.17-222.el7.x86_64.
Comment 28 Felip Moll 2018-08-28 06:12:52 MDT
After looking deeper in your logs, I see that there not seem to be any real hung slurmstepd.

I correlated the pstree slurmstepd processes (not LWP) with ps-aux, which gave me  145 slurmstepd processes corresponding to extern, batch and .0 steps. After that I grabbed the jobid of each extern step and correlated with the job_script processes found in ps aux. All of them had one running script, except for job 3360118, which as found in pstree to be an interactive bash that was running "less". There were 71 jobs.

This means that all jobs in your output were running. If you think it isn't this way, please, show me the exact hung process and job id, because I am not seeing it.
Also, the bt that you show in comment 20 is normal for a .0 step.

In your first comment in this bug you do a "pstree -a -c -p -S 421660", but the logged error "NOT ENDING WITH SIGNAL" refers to an extern step with pid 377480.

Can you please define what do you mean by "a slurmstepd that has lost a child to the zombie process"?

Note that a job may have 3 or more slurmstepds, for example, this corresponds to one single job in my system, and to the extern, batch and .0 steps:

           |-slurmstepd(10412)-+-sleep(10416)
           |                   |-{acctg_prof}(10414)
           |                   |-{acctg}(10413)
           |                   `-{slurmstepd}(10415)
           |-slurmstepd(10419)-+-slurm_script(10423)---srun(10424)-+-srun(10427)
           |                   |                                   |-{srun}(10428)
           |                   |                                   |-{srun}(10429)
           |                   |                                   |-{srun}(10430)
           |                   |                                   `-{srun}(10431)
           |                   |-{acctg_prof}(10421)
           |                   |-{acctg}(10420)
           |                   `-{slurmstepd}(10422)
           |-slurmstepd(10435)-+-sleep(10441)
           |                   |-{acctg_prof}(10437)
           |                   |-{acctg}(10436)
           |                   |-{slurmstepd}(10438)
           |                   |-{slurmstepd}(10439)
           |                   `-{slurmstepd}(10440)



Sorry for starting again from the beginning, but I need to have everything clearly identified.


In the meantime, you may do this checks:


a) Please, do a:  'grep -r "possible file system problem" slurm.log', you will see lots of:

[2018-08-23T08:46:52.792] Warning: slurmstepd startup took XX sec, possible file system problem or full memory

which reinforces the theory of I/O problems, maybe is just a warning but there's a considerable slowdown at some point as you will see.

b) Are the system limits tuned properly, specially nr_open_files? With systemd it can change the slurmd limits, take a look at /proc/$(pidof slurmd)/limits

Check specially these pages if you haven't done so:
https://slurm.schedmd.com/high_throughput.html
https://slurm.schedmd.com/big_sys.html

c) In comment 3 you said:

"but leave behind the zombie sleeps or they stay in Running state forever, even after the slurmstepd has been killed (again leaving behind an optional zombie sleep.)"

Do you always see them in squeue? I need a more clear case, when you identify one I need:

- jobid
- squeue
- pstree -p
- ps aux
- thread apply all bt full of the related slurmstepd cases, this means the processes of checking ps aux|grep $jobid corresponding to extern, batch, .0 steps, provided that the user task is still not running.


Thanks John
Comment 29 John Hanks 2018-08-28 07:43:16 MDT
(In reply to Felip Moll from comment #28)
> This means that all jobs in your output were running. If you think it isn't
> this way, please, show me the exact hung process and job id, because I am
> not seeing it.

This is expanding my understanding of zombies. I'm still not sure how the slurmstep losing track of it's children is ok, even if they do eventually terminate successfully, but since these jobs did finish that makes them unlike the previous jobs which left behind the sleep processes and slurmstepd's in wait() or in a state where I couldn't attach with gdb.

> Can you please define what do you mean by "a slurmstepd that has lost a
> child to the zombie process"?

In the originally observed cases with sleep, the processes did not eventually clear up and the associated jobs remained in Running state with a sleep zombie and sometimes a hung slurmstepd. Eventually the node would be reporting it was fully utilized but there wouldn't be any user processes running on it, at which point we'd reboot it, jobs would requeue and run/complete normally until the next time it happened.
 
> Sorry for starting again from the beginning, but I need to have everything
> clearly identified.

No worries, I may end up having to look back in time to correlate this with a specific users jobs if it doesn't appear again.

> In the meantime, you may do this checks:
> 
> a) Please, do a:  'grep -r "possible file system problem" slurm.log', you
> will see lots of:
> 
> [2018-08-23T08:46:52.792] Warning: slurmstepd startup took XX sec, possible
> file system problem or full memory
> 
> which reinforces the theory of I/O problems, maybe is just a warning but
> there's a considerable slowdown at some point as you will see.

I see that message logged a lot on this node in but not necessarily correlated with any of it's reboots or kernel panics. I don't see it logged much around previous reboots associated with this problem. It's extremely rare on any other node so I'm not sure I can actually blame the storage. I'll do some testing to see if there really is an I/O problem here, but since I recently used this node to specifically do a bunch of I/O heavy tasks without any problems, again, I'm not sure I can blame storage for the slowdown in slurmstepd startup. I don't mind blaming SGI, err, I mean, HP.


> b) Are the system limits tuned properly, specially nr_open_files? With
> systemd it can change the slurmd limits, take a look at /proc/$(pidof
> slurmd)/limits

[root@sgiuv300-srcf-d10-01 ~]# cat /proc/sys/fs/file-max 
1013864338

[root@sgiuv300-srcf-d10-01 ~]# cat /etc/systemd/system/slurmd.service | grep Limit
LimitNOFILE=51200
LimitMEMLOCK=infinity
LimitSTACK=infinity

Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            unlimited            unlimited            bytes     
Max core file size        unlimited            unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             39993724             39993724             processes 
Max open files            51200                51200                files     
Max locked memory         unlimited            unlimited            bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       39993724             39993724             signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us        

Just a few days ago we ran some jobs that produced > 14000 processes/threads and load > 3000 without hitting any limit problems (or having those jobs produce any zombies.)

> Check specially these pages if you haven't done so:
> https://slurm.schedmd.com/high_throughput.html
> https://slurm.schedmd.com/big_sys.html

It's been a while since I went through these, but will revisit them again in case something has changed or I have missed a critical bit.

> c) In comment 3 you said:
> 
> "but leave behind the zombie sleeps or they stay in Running state forever,
> even after the slurmstepd has been killed (again leaving behind an optional
> zombie sleep.)"
>
> Do you always see them in squeue? I need a more clear case, when you
> identify one I need:
> 
> - jobid
> - squeue
> - pstree -p
> - ps aux
> - thread apply all bt full of the related slurmstepd cases, this means the
> processes of checking ps aux|grep $jobid corresponding to extern, batch, .0
> steps, provided that the user task is still not running.
> 

It looks like I need to either look back in time and find a specific set of jobs this is correlated with and build a reproducer or wait until the problem re-occurs and let it reach the point of stopping the node before collecting more info. If you happen to have an epiphany, please send it our way otherwise consider the ball in my court and leave this on the back burner until I have found a way to reproduce. 

Thanks and sorry for the wild goose chase with the R processes although I have enjoyed learning more about how the kernel handles orphaned processes.

jbh
Comment 30 Felip Moll 2018-08-28 08:21:45 MDT
(In reply to John Hanks from comment #29)
> (In reply to Felip Moll from comment #28)
> > This means that all jobs in your output were running. If you think it isn't
> > this way, please, show me the exact hung process and job id, because I am
> > not seeing it.
> 
> This is expanding my understanding of zombies. I'm still not sure how the
> slurmstep losing track of it's children is ok, even if they do eventually
> terminate successfully, but since these jobs did finish that makes them
> unlike the previous jobs which left behind the sleep processes and
> slurmstepd's in wait() or in a state where I couldn't attach with gdb.

Keep in mind that slurmstepd will only wait4() for the processes that are immediately created by him.
This means that if a process does fork(), the subprocesses will not be waited by slurmstepd. This was the
case in comment 21, where stepd would wait4() for qiime but not for R and their threads. So Zombies can certainly occur.

  |-slurmstepd,260097                           
  |   |-qiime,260146 /home/abganz/.conda/envs/qiime2/bin/qiime dada2 denoise-paired --i-demultiplexed-seqs paired-end-demux.qza --output-dir dada2-209 --p-n-threads 15 ...
  |   |   `-R,261285 --slave --no-restore --file=/home/abganz/.conda/envs/qiime2/bin/run_dada_paired.R --args /tmp/tmpwb3nmuzq/forward /tmp/tmpwb3nmuzq/reverse /tmp/tmpwb...
  |   |      |-(R,261412)
  |   |      |-{R},301848
  |   |      |-{R},301849


Of course, sleep process must be waited for, so if you tell me you saw zombie sleeps around that could be because slurmstepd died ad some point, i.e. related to something like this?
and this should be investigated:

Aug  9 20:13:16 sgiuv300-srcf-d10-01 kernel: [ 9512.041358] INFO: task slurmstepd:377005 blocked for more than 120 seconds.
Aug  9 20:13:16 sgiuv300-srcf-d10-01 kernel: [ 9512.041361] slurmstepd      D ffff96cac7dd0000     0 377005      1 0x00000080

Maybe it was a particular situation?

> Eventually the node would be
> reporting it was fully utilized but there wouldn't be any user processes
> running on it, at which point we'd reboot it, jobs would requeue and
> run/complete normally until the next time it happened.

If a lot of processes created a lot of zombies, the kernel table could be filled up and the OS be unable to create new processes.
Having a lot of jobs that creates a lot of threads that generates zombies, could have this effect.

The question is, if the user app. is waiting correctly, why are the zombies created?
Maybe it would be worth to take a look at the user's code. Don't either discard some kernel bug.


> It looks like I need to either look back in time and find a specific set of
> jobs this is correlated with and build a reproducer or wait until the
> problem re-occurs and let it reach the point of stopping the node before
> collecting more info. If you happen to have an epiphany, please send it our
> way otherwise consider the ball in my court and leave this on the back
> burner until I have found a way to reproduce. 

Cool, I agree this is the correct way for now. I'll be waiting for fresh news.
Btw, from Thursday I am on vacation until 9th Sept, but my colleagues will be looking at the bug in the meantime.

> Thanks and sorry for the wild goose chase with the R processes although I
> have enjoyed learning more about how the kernel handles orphaned processes.

No worries!, I am also enjoying with this case and this forced me to refresh my knowledge on the matter :)
Comment 31 Felip Moll 2018-09-13 17:21:57 MDT
Hi John,

I assume you still haven't found a reproducer. I think that at this point and after +2weeks the best option is to mark the bug as INFOGIVEN, and let you reopen and mark as UNCONFIRMED again if at some point you are able to get fresh info.

What do you think?
Comment 32 John Hanks 2018-09-13 18:19:11 MDT
Hi Felip,

We've moved on to 18.08.0 now and upgraded to the latest CentOS kernel and so far haven't seen this reappear. It's safe to close this and if I ever find a reproducer for it, I know where to find you.

Thanks,

jbh
Comment 33 Felip Moll 2018-09-14 05:37:50 MDT
Very good,

I am closing this issue now.

Regards!
Comment 34 John Hanks 2018-11-14 09:59:25 MST
And, we're back. Another recurrence of teh slurm(step)d losing track of it's children, this time a gzip process went astray. The view from top:

top - 08:51:48 up 16:45,  2 users,  load average: 84.28, 84.62, 86.74
Tasks: 3581 total,   6 running, 3574 sleeping,   0 stopped,   1 zombie
%Cpu(s):  2.9 us,  0.1 sy,  0.0 ni, 97.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 10238839+total, 82755717+free, 38465574+used, 15786117+buff/cache
KiB Swap: 13674601+total, 13674505+free,    95744 used. 98321561+avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                   
173479 kameronr  20   0  113180   1560   1312 S   0.0  0.0   0:00.03 slurm_script                                                                              
173733 kameronr  20   0 3017240   2.7g   2396 D   0.0  0.0  11:00.28 perl                                                                                      
174988 kameronr  20   0       0      0      0 Z   0.0  0.0   0:00.62 gzip                                                                                      
175000 kameronr  20   0 3017240   2.7g    544 D   0.0  0.0  10:38.84 perl                                                                                      
175001 kameronr  20   0 3017240   2.7g    536 D   0.0  0.0  10:20.54 perl                                                                                      
175003 kameronr  20   0 3017240   2.7g    536 D   0.0  0.0  10:04.20 perl                                                                                      
175006 kameronr  20   0 3017240   2.7g    536 D   0.0  0.0  10:12.28 perl                                                                                      
175007 kameronr  20   0 3017240   2.7g    544 D   0.0  0.0  11:13.32 perl                                                                                      
175010 kameronr  20   0 3017240   2.7g    536 D   0.0  0.0  10:08.07 perl                                                                                      
175013 kameronr  20   0 3017240   2.7g    536 D   0.0  0.0  10:21.57 perl                                                                                      
175015 kameronr  20   0 3017240   2.7g    548 D   0.0  0.0  10:41.05 perl                                                                                      
175016 kameronr  20   0 3017240   2.7g    536 D   0.0  0.0  10:26.48 perl                                                                                      
175019 kameronr  20   0 3017240   2.7g    544 D   0.0  0.0  10:36.88 perl                                                                                      
175024 kameronr  20   0 3017240   2.7g    544 D   0.0  0.0  10:27.69 perl                                                                                      

This job things it is in Running state, but I think will never finish because gzip is still hanging around:

[root@sgiuv300-srcf-d10-01 foundation-2.19]# pstree -u kameronr
slurm_script───perl─┬─gzip
                    └─11*[perl]

In dmesg I see this, but I'm pretty sure this happened before the job above was started. Just putting it here for the sake of completeness:

[Tue Nov 13 16:18:05 2018] ------------[ cut here ]------------
[Tue Nov 13 16:18:05 2018] WARNING: CPU: 350 PID: 42860 at kernel/cpuset.c:1296 cpuset_write_resmask+0x80a/0xa10
[Tue Nov 13 16:18:05 2018] Modules linked in: nvidia_uvm(POE) nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) drm_kms_helper syscopyarea sysfillrect sysimgblt fb_s
ys_fops ipmi_devintf vfio_iommu_type1 vfio nfsd nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache ipt_REJECT nf_reject_ipv4 nf_c
onntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c iptable_filter zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) dm_serv
ice_time dm_multipath ipmi_si ipmi_msghandler sunrpc dm_mirror dm_region_hash dm_log dm_mod sr_mod cdrom intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kv
m iTCO_wdt iTCO_vendor_support irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcs
pkr joydev
[Tue Nov 13 16:18:05 2018]  mxm_wmi video ttm uas usb_storage nvme drm sg nvme_core i2c_i801 lpc_ich ioatdma shpchp wmi binfmt_misc ip_tables mpt3sas raid_class sc
si_transport_sas i40e igb i2c_algo_bit dca ptp pps_core i2c_core megaraid_sas ahci libahci libata sd_mod crc_t10dif crct10dif_common [last unloaded: ipmi_devintf]
[Tue Nov 13 16:18:05 2018] CPU: 350 PID: 42860 Comm: slurmstepd Tainted: P           OE  ------------   3.10.0-862.14.4.el7.x86_64 #1
[Tue Nov 13 16:18:05 2018] Hardware name: SGI UV300/UV300, BIOS SGI UV 300 series BIOS 06/15/2018
[Tue Nov 13 16:18:05 2018] Call Trace:
[Tue Nov 13 16:18:05 2018]  [<ffffffffa3d13754>] dump_stack+0x19/0x1b
[Tue Nov 13 16:18:05 2018]  [<ffffffffa36945d8>] __warn+0xd8/0x100
[Tue Nov 13 16:18:05 2018]  [<ffffffffa369471d>] warn_slowpath_null+0x1d/0x20
[Tue Nov 13 16:18:05 2018]  [<ffffffffa3725f0a>] cpuset_write_resmask+0x80a/0xa10
[Tue Nov 13 16:18:05 2018]  [<ffffffffa371b020>] ? cgroup_rightmost_descendant+0x80/0x80
[Tue Nov 13 16:18:05 2018]  [<ffffffffa3725700>] ? update_cpumasks_hier+0x410/0x410
[Tue Nov 13 16:18:05 2018]  [<ffffffffa371d411>] cgroup_file_write+0x1d1/0x2d0
[Tue Nov 13 16:18:05 2018]  [<ffffffffa38221c8>] ? __sb_start_write+0x58/0x110
[Tue Nov 13 16:18:05 2018]  [<ffffffffa38d5437>] ? security_file_permission+0x27/0xa0
[Tue Nov 13 16:18:05 2018]  [<ffffffffa381f240>] vfs_write+0xc0/0x1f0
[Tue Nov 13 16:18:05 2018]  [<ffffffffa382006f>] SyS_write+0x7f/0xf0
[Tue Nov 13 16:18:05 2018]  [<ffffffffa3d2579b>] system_call_fastpath+0x22/0x27
[Tue Nov 13 16:18:05 2018] ---[ end trace 1a3d8cbb1a4bc6dd ]---

And a bt from the guilty slurmstepd:

(gdb) thread apply all bt full

Thread 5 (Thread 0x2b02a998b700 (LWP 173420)):
#0  0x00002b02aaacb995 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
No symbol table info available.
#1  0x00002b02a9b46433 in _watch_tasks (arg=<optimized out>) at slurm_jobacct_gather.c:369
        err = <optimized out>
        __func__ = "_watch_tasks"
#2  0x00002b02aaac7e25 in start_thread () from /usr/lib64/libpthread.so.0
No symbol table info available.
#3  0x00002b02aaddabad in clone () from /usr/lib64/libc.so.6
No symbol table info available.

Thread 4 (Thread 0x2b02b08b4700 (LWP 173421)):
#0  0x00002b02aaacbd42 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
No symbol table info available.
#1  0x00002b02a9b407e6 in _timer_thread (args=<optimized out>) at slurm_acct_gather_profile.c:205
        err = <optimized out>
        i = 4
        now = 1542214614
        diff = <optimized out>
        tvnow = {tv_sec = 1542162036, tv_usec = 11681}
        abs = {tv_sec = 1542214615, tv_nsec = 11681000}
        __func__ = "_timer_thread"
#2  0x00002b02aaac7e25 in start_thread () from /usr/lib64/libpthread.so.0
No symbol table info available.
#3  0x00002b02aaddabad in clone () from /usr/lib64/libc.so.6
No symbol table info available.

Thread 3 (Thread 0x2b02b09b5700 (LWP 173422)):
#0  0x00002b02aadcff0d in poll () from /usr/lib64/libc.so.6
No symbol table info available.
#1  0x00002b02a9bd99d7 in _poll_internal (shutdown_time=<optimized out>, nfds=2, pfds=0x2b02b40008d0) at eio.c:364
        n = <optimized out>
        timeout = -1
#2  eio_handle_mainloop (eio=0x1fc9350) at eio.c:328
        retval = 0
        pollfds = 0x2b02b40008d0
        map = 0x2b02b4000900
        maxnfds = 1
        nfds = 2
        n = <optimized out>
        shutdown_time = <optimized out>
        __func__ = "eio_handle_mainloop"
#3  0x000000000041d7a0 in _msg_thr_internal (job_arg=0x1fe2e20) at req.c:273
        job = 0x1fe2e20
#4  0x00002b02aaac7e25 in start_thread () from /usr/lib64/libpthread.so.0
No symbol table info available.
#5  0x00002b02aaddabad in clone () from /usr/lib64/libc.so.6
No symbol table info available.

Thread 2 (Thread 0x2b02b18e8700 (LWP 173424)):
#0  0x00002b02aadcff0d in poll () from /usr/lib64/libc.so.6
No symbol table info available.
#1  0x00002b02b0fd0942 in _oom_event_monitor (x=<optimized out>) at task_cgroup_memory.c:478
        args = 0x20f0b70
        ret = <optimized out>
        res = 0
        fds = {{fd = 12, events = 3, revents = 0}, {fd = 13, events = 8193, revents = 0}}
        __func__ = "_oom_event_monitor"
#2  0x00002b02aaac7e25 in start_thread () from /usr/lib64/libpthread.so.0
No symbol table info available.

Hopefully something in all that has a useful clue. Checking with user to see if it's ok to leave this job hanging around for troubleshooting.

Best,

jbh
Comment 35 John Hanks 2018-11-14 10:00:47 MST
Sorry, bt got truncated in last comment, full is here:

(gdb) thread apply all bt full

Thread 5 (Thread 0x2b02a998b700 (LWP 173420)):
#0  0x00002b02aaacb995 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
No symbol table info available.
#1  0x00002b02a9b46433 in _watch_tasks (arg=<optimized out>) at slurm_jobacct_gather.c:369
        err = <optimized out>
        __func__ = "_watch_tasks"
#2  0x00002b02aaac7e25 in start_thread () from /usr/lib64/libpthread.so.0
No symbol table info available.
#3  0x00002b02aaddabad in clone () from /usr/lib64/libc.so.6
No symbol table info available.

Thread 4 (Thread 0x2b02b08b4700 (LWP 173421)):
#0  0x00002b02aaacbd42 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
No symbol table info available.
#1  0x00002b02a9b407e6 in _timer_thread (args=<optimized out>) at slurm_acct_gather_profile.c:205
        err = <optimized out>
        i = 4
        now = 1542214614
        diff = <optimized out>
        tvnow = {tv_sec = 1542162036, tv_usec = 11681}
        abs = {tv_sec = 1542214615, tv_nsec = 11681000}
        __func__ = "_timer_thread"
#2  0x00002b02aaac7e25 in start_thread () from /usr/lib64/libpthread.so.0
No symbol table info available.
#3  0x00002b02aaddabad in clone () from /usr/lib64/libc.so.6
No symbol table info available.

Thread 3 (Thread 0x2b02b09b5700 (LWP 173422)):
#0  0x00002b02aadcff0d in poll () from /usr/lib64/libc.so.6
No symbol table info available.
#1  0x00002b02a9bd99d7 in _poll_internal (shutdown_time=<optimized out>, nfds=2, pfds=0x2b02b40008d0) at eio.c:364
        n = <optimized out>
        timeout = -1
#2  eio_handle_mainloop (eio=0x1fc9350) at eio.c:328
        retval = 0
        pollfds = 0x2b02b40008d0
        map = 0x2b02b4000900
        maxnfds = 1
        nfds = 2
        n = <optimized out>
        shutdown_time = <optimized out>
        __func__ = "eio_handle_mainloop"
#3  0x000000000041d7a0 in _msg_thr_internal (job_arg=0x1fe2e20) at req.c:273
        job = 0x1fe2e20
#4  0x00002b02aaac7e25 in start_thread () from /usr/lib64/libpthread.so.0
No symbol table info available.
#5  0x00002b02aaddabad in clone () from /usr/lib64/libc.so.6
No symbol table info available.

Thread 2 (Thread 0x2b02b18e8700 (LWP 173424)):
#0  0x00002b02aadcff0d in poll () from /usr/lib64/libc.so.6
No symbol table info available.
#1  0x00002b02b0fd0942 in _oom_event_monitor (x=<optimized out>) at task_cgroup_memory.c:478
        args = 0x20f0b70
        ret = <optimized out>
        res = 0
        fds = {{fd = 12, events = 3, revents = 0}, {fd = 13, events = 8193, revents = 0}}
        __func__ = "_oom_event_monitor"
#2  0x00002b02aaac7e25 in start_thread () from /usr/lib64/libpthread.so.0
No symbol table info available.
---Type <return> to continue, or q <return> to quit---
#3  0x00002b02aaddabad in clone () from /usr/lib64/libc.so.6
No symbol table info available.

Thread 1 (Thread 0x2b02a9889000 (LWP 173411)):
#0  0x00002b02aada122a in wait4 () from /usr/lib64/libc.so.6
No symbol table info available.
#1  0x000000000040f246 in _wait_for_any_task (job=job@entry=0x1fe2e20, waitflag=waitflag@entry=true) at mgr.c:2116
        t = <optimized out>
        rc = <optimized out>
        status = 0
        pid = <optimized out>
        completed = 0
        jobacct = <optimized out>
        rusage = {ru_utime = {tv_sec = 0, tv_usec = 0}, ru_stime = {tv_sec = 0, tv_usec = 0}, {ru_maxrss = 0, __ru_maxrss_word = 0}, {ru_ixrss = 0, 
            __ru_ixrss_word = 0}, {ru_idrss = 0, __ru_idrss_word = 0}, {ru_isrss = 0, __ru_isrss_word = 0}, {ru_minflt = 0, __ru_minflt_word = 0}, {
            ru_majflt = 0, __ru_majflt_word = 0}, {ru_nswap = 0, __ru_nswap_word = 0}, {ru_inblock = 0, __ru_inblock_word = 0}, {ru_oublock = 0, 
            __ru_oublock_word = 0}, {ru_msgsnd = 0, __ru_msgsnd_word = 0}, {ru_msgrcv = 0, __ru_msgrcv_word = 0}, {ru_nsignals = 0, __ru_nsignals_word = 0}, {
            ru_nvcsw = 6151706, __ru_nvcsw_word = 6151706}, {ru_nivcsw = 47290437680538, __ru_nivcsw_word = 47290437680538}}
        tmp_env = <optimized out>
        task_offset = 0
        __func__ = "_wait_for_any_task"
#2  0x0000000000410cb3 in _wait_for_all_tasks (job=0x1fe2e20) at mgr.c:2227
        rc = <optimized out>
        tasks_left = 1
        i = 0
#3  job_manager (job=job@entry=0x1fe2e20) at mgr.c:1424
        rc = 0
        io_initialized = true
        ckpt_type = 0x1fc93c0 "checkpoint/none"
        err_msg = 0x0
        __func__ = "job_manager"
#4  0x000000000040d3ea in main (argc=1, argv=0x7ffc18229038) at slurmstepd.c:174
        cli = 0x1fc9ff0
        self = 0x0
        msg = 0x1fc6bc0
        rc = 0
        launch_params = 0x0
        __func__ = "main"
(gdb)
Comment 36 Felip Moll 2018-11-14 12:33:47 MST
>    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
> 
> 173479 kameronr  20   0  113180   1560   1312 S   0.0  0.0   0:00.03
> slurm_script                                                                
> 
> 173733 kameronr  20   0 3017240   2.7g   2396 D   0.0  0.0  11:00.28 perl   
> 
> 174988 kameronr  20   0       0      0      0 Z   0.0  0.0   0:00.62 gzip   
> 
> 175000 kameronr  20   0 3017240   2.7g    544 D   0.0  0.0  10:38.84 perl   
> 
> 175001 kameronr  20   0 3017240   2.7g    536 D   0.0  0.0  10:20.54 perl   
> 
> 175003 kameronr  20   0 3017240   2.7g    536 D   0.0  0.0  10:04.20 perl   
> 
> 175006 kameronr  20   0 3017240   2.7g    536 D   0.0  0.0  10:12.28 perl   
> 
> 175007 kameronr  20   0 3017240   2.7g    544 D   0.0  0.0  11:13.32 perl   
> 
> 175010 kameronr  20   0 3017240   2.7g    536 D   0.0  0.0  10:08.07 perl   
> 
> 175013 kameronr  20   0 3017240   2.7g    536 D   0.0  0.0  10:21.57 perl   
> 
> 175015 kameronr  20   0 3017240   2.7g    548 D   0.0  0.0  10:41.05 perl   
> 
> 175016 kameronr  20   0 3017240   2.7g    536 D   0.0  0.0  10:26.48 perl   
> 
> 175019 kameronr  20   0 3017240   2.7g    544 D   0.0  0.0  10:36.88 perl   
> 
> 175024 kameronr  20   0 3017240   2.7g    544 D   0.0  0.0  10:27.69 perl   

wow, that's absolutely interesting.

The slurm_script is run from slurmstepd (or it should be, a complete pstree -psal as root would help even more),
and it haven't finished. This means that perl haven't correctly finished.. and we see it in D
state. Gzip is in Z state, which indicates that perl haven't waited for him.
 
> [root@sgiuv300-srcf-d10-01 foundation-2.19]# pstree -u kameronr
> slurm_script───perl─┬─gzip
>                     └─11*[perl]

It seems that perl script have had some problem in its execution... it is curious that now it is perl,
and before an R software.

I am pretty sure that the provided kernel stack trace is somewhat related. It seems there may be
a bug in kernel cgroups or similar, cpuset_write_resmask() is the function in the kernel used to
modify a cpu or mem cgroup file and is indirectly used by slurm when adding/removing tasks to
the cgroup or setting limits.

Does the kernel dump coincide with the start time of the job, or similar?

I think we should investigate this path.
Comment 37 John Hanks 2018-11-14 15:26:12 MST
Created attachment 8302 [details]
pstree -psal
Comment 38 John Hanks 2018-11-14 15:29:13 MST
pstree -psal has been attacked.

That kernel stack trace happened two hours before the job in question started. The user cancelled the job and it stuck in CG state for a while, eventually SLURM cleared it and marked the node "draining Kill task failed". The processes are still hanging around though, if you want anything else collected from them.

jbh
Comment 39 John Hanks 2018-11-14 17:19:24 MST
Rebooted the node and it looks like the kernel backtrace is consistently happening (once?) at every boot when a job starts::

[ 4157.609389] ------------[ cut here ]------------
[ 4157.609428] WARNING: CPU: 310 PID: 232813 at kernel/cpuset.c:1296 cpuset_write_resmask+0x80a/0xa10
[ 4157.609431] Modules linked in: nvidia_uvm(POE) nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ipmi_devintf vfio_iommu_type1 vfio nfsd nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c iptable_filter zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) dm_service_time dm_multipath ipmi_si ipmi_msghandler sunrpc dm_mirror dm_region_hash dm_log dm_mod sr_mod cdrom intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt iTCO_vendor_support pcspkr joydev
[ 4157.609517]  mxm_wmi video ttm uas usb_storage drm nvme sg nvme_core i2c_i801 lpc_ich ioatdma shpchp wmi binfmt_misc ip_tables mpt3sas raid_class scsi_transport_sas i40e igb i2c_algo_bit dca ptp pps_core i2c_core megaraid_sas ahci libahci libata sd_mod crc_t10dif crct10dif_common [last unloaded: ipmi_devintf]
[ 4157.609564] CPU: 310 PID: 232813 Comm: slurmstepd Tainted: P           OE  ------------   3.10.0-862.14.4.el7.x86_64 #1
[ 4157.609567] Hardware name: SGI UV300/UV300, BIOS SGI UV 300 series BIOS 06/15/2018
[ 4157.609570] Call Trace:
[ 4157.609590]  [<ffffffff99313754>] dump_stack+0x19/0x1b
[ 4157.609609]  [<ffffffff98c945d8>] __warn+0xd8/0x100
[ 4157.609613]  [<ffffffff98c9471d>] warn_slowpath_null+0x1d/0x20
[ 4157.609616]  [<ffffffff98d25f0a>] cpuset_write_resmask+0x80a/0xa10
[ 4157.609634]  [<ffffffff98d1b020>] ? cgroup_rightmost_descendant+0x80/0x80
[ 4157.609668]  [<ffffffff98d25700>] ? update_cpumasks_hier+0x410/0x410
[ 4157.609672]  [<ffffffff98d1d411>] cgroup_file_write+0x1d1/0x2d0
[ 4157.609689]  [<ffffffff98e221c8>] ? __sb_start_write+0x58/0x110
[ 4157.609729]  [<ffffffff98ed5437>] ? security_file_permission+0x27/0xa0
[ 4157.609758]  [<ffffffff98e1f240>] vfs_write+0xc0/0x1f0
[ 4157.609779]  [<ffffffff98e2006f>] SyS_write+0x7f/0xf0
[ 4157.609811]  [<ffffffff9932579b>] system_call_fastpath+0x22/0x27
[ 4157.609830] ---[ end trace f3345453c14bf075 ]---
Comment 40 John Hanks 2018-11-15 09:15:29 MST
Another job with some zombies, bt:

(gdb) thread apply all bt full

Thread 5 (Thread 0x2afcac39c700 (LWP 233196)):
#0  0x00002afcad4dc995 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
No symbol table info available.
#1  0x00002afcac557433 in _watch_tasks (arg=<optimized out>) at slurm_jobacct_gather.c:369
        err = <optimized out>
        __func__ = "_watch_tasks"
#2  0x00002afcad4d8e25 in start_thread () from /usr/lib64/libpthread.so.0
No symbol table info available.
#3  0x00002afcad7ebbad in clone () from /usr/lib64/libc.so.6
No symbol table info available.

Thread 4 (Thread 0x2afcb32c5700 (LWP 233197)):
#0  0x00002afcad4dcd42 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
No symbol table info available.
#1  0x00002afcac5517e6 in _timer_thread (args=<optimized out>) at slurm_acct_gather_profile.c:205
        err = <optimized out>
        i = 4
        now = 1542298269
        diff = <optimized out>
        tvnow = {tv_sec = 1542240723, tv_usec = 671182}
        abs = {tv_sec = 1542298270, tv_nsec = 671182000}
        __func__ = "_timer_thread"
#2  0x00002afcad4d8e25 in start_thread () from /usr/lib64/libpthread.so.0
No symbol table info available.
#3  0x00002afcad7ebbad in clone () from /usr/lib64/libc.so.6
No symbol table info available.

Thread 3 (Thread 0x2afcb33c6700 (LWP 233198)):
#0  0x00002afcad7e0f0d in poll () from /usr/lib64/libc.so.6
No symbol table info available.
#1  0x00002afcac5ea9d7 in _poll_internal (shutdown_time=<optimized out>, nfds=2, pfds=0x2afcb40008d0) at eio.c:364
        n = <optimized out>
        timeout = -1
#2  eio_handle_mainloop (eio=0x1277a30) at eio.c:328
        retval = 0
        pollfds = 0x2afcb40008d0
        map = 0x2afcb4000900
        maxnfds = 1
        nfds = 2
        n = <optimized out>
        shutdown_time = <optimized out>
        __func__ = "eio_handle_mainloop"
#3  0x000000000041d7a0 in _msg_thr_internal (job_arg=0x1179e20) at req.c:273
        job = 0x1179e20
#4  0x00002afcad4d8e25 in start_thread () from /usr/lib64/libpthread.so.0
No symbol table info available.
#5  0x00002afcad7ebbad in clone () from /usr/lib64/libc.so.6
No symbol table info available.

Thread 2 (Thread 0x2afcb8302700 (LWP 233213)):
#0  0x00002afcad7e0f0d in poll () from /usr/lib64/libc.so.6
No symbol table info available.
#1  0x00002afcb39e1942 in _oom_event_monitor (x=<optimized out>) at task_cgroup_memory.c:478
        args = 0x11ba950
        ret = <optimized out>
        res = 0
        fds = {{fd = 12, events = 3, revents = 0}, {fd = 13, events = 8193, revents = 0}}
        __func__ = "_oom_event_monitor"
#2  0x00002afcad4d8e25 in start_thread () from /usr/lib64/libpthread.so.0
No symbol table info available.
#3  0x00002afcad7ebbad in clone () from /usr/lib64/libc.so.6
No symbol table info available.

Thread 1 (Thread 0x2afcac29a000 (LWP 233191)):
#0  0x00002afcad7b222a in wait4 () from /usr/lib64/libc.so.6
No symbol table info available.
#1  0x000000000040f246 in _wait_for_any_task (job=job@entry=0x1179e20, waitflag=waitflag@entry=true) at mgr.c:2116
---Type <return> to continue, or q <return> to quit---
        t = <optimized out>
        rc = <optimized out>
        status = 0
        pid = <optimized out>
        completed = 0
        jobacct = <optimized out>
        rusage = {ru_utime = {tv_sec = 0, tv_usec = 0}, ru_stime = {tv_sec = 0, tv_usec = 0}, {ru_maxrss = 0, __ru_maxrss_word = 0}, {ru_ixrss = 0, __ru_ixrss_word = 0}, {ru_idrss = 0, 
            __ru_idrss_word = 0}, {ru_isrss = 0, __ru_isrss_word = 0}, {ru_minflt = 0, __ru_minflt_word = 0}, {ru_majflt = 0, __ru_majflt_word = 0}, {ru_nswap = 0, __ru_nswap_word = 0}, {
            ru_inblock = 0, __ru_inblock_word = 0}, {ru_oublock = 0, __ru_oublock_word = 0}, {ru_msgsnd = 0, __ru_msgsnd_word = 0}, {ru_msgrcv = 18325024, __ru_msgrcv_word = 18325024}, {
            ru_nsignals = 0, __ru_nsignals_word = 0}, {ru_nvcsw = 6152281, __ru_nvcsw_word = 6152281}, {ru_nivcsw = 47264711986586, __ru_nivcsw_word = 47264711986586}}
        tmp_env = <optimized out>
        task_offset = 0
        __func__ = "_wait_for_any_task"
#2  0x0000000000410cb3 in _wait_for_all_tasks (job=0x1179e20) at mgr.c:2227
        rc = <optimized out>
        tasks_left = 1
        i = 0
#3  job_manager (job=job@entry=0x1179e20) at mgr.c:1424
        rc = 0
        io_initialized = true
        ckpt_type = 0x1154320 "checkpoint/none"
        err_msg = 0x0
        __func__ = "job_manager"
#4  0x000000000040d3ea in main (argc=1, argv=0x7ffc763ac398) at slurmstepd.c:174
        cli = 0x1160ff0
        self = 0x0
        msg = 0x115dbc0
        rc = 0
        launch_params = 0x0
        __func__ = "main"
(gdb)
Comment 41 John Hanks 2018-11-15 09:17:10 MST
Created attachment 8309 [details]
pstree -psal
Comment 42 Felip Moll 2018-11-16 05:39:19 MST
John,

After looking at the kernel code I see that the kernel stack trace is dumping
in a call to :

WARN_ON(!is_in_v2_mode() && !nodes_equal(cp->mems_allowed, cp->effective_mems)); 

This may indicate two failures, or that the nodes_equal comparison is wrong, which
could mean a corrupted bitmap, or that your cgroups is not operating in v2 mode.

Operating in v1 mode shouldn't be a concern on normal systems but since this is a
SGI UV300 special machine, I don't know how it manages cpus.

Quoting RHEL Release Notes:
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html-single/7.5_release_notes/index

Cpuset is now able to restore the effective CPU mask after a pair of offline and online events
Prior to this update, the cpuset filesystem, which confines processes to processor and memory
node subsets, had one bitmap set enabled for CPUs used in a cpuset. As a consequence, a CPU
offline event followed by a CPU online event caused the affected CPU to be removed from all
non-root cpusets. With this update, cpuset has two bitmap sets enabled. As a result, cpuset
is now able to properly track CPU online or offline events to restore the effective CPU mask
as long as the -o cpuset_v2_mode mount flag is used when mounting cpuset cgroup. (BZ#947004) 

I am not saying this is the exact cause but it could, given your high number of cpus (720),
how SGI deals with it, and the usage of cpusets by slurm:

[2018-08-23T08:47:43.950] [3358358.extern] debug3: xcgroup_set_param: parameter 'cpuset.cpus' set to '32,392,0-23,25-31,33-42,44,46-51,360-383,385-391,393-402,404,406-411' for '/sys/fs/cgroup/cpuset/slurm/uid_327881'

Would you be able to do the test and try to mount cgroup with this option, in order to see if after
a reboot you still see the kernel dump and/or the issue?

Thanks
Felip M
Comment 43 Felip Moll 2018-11-16 06:19:20 MST
In addition to my last comment, it would also be good to know
if you have any particular configuration for SGI cpusets.

I don't know these machines in deep but I remember that you can
configure different cpusets for different purposes. Can you
check your cpusets configuration in this machine, if any?
Comment 44 John Hanks 2018-11-16 10:16:43 MST
Our cgroups config is whatever CentOS 7.5 does out-of-the-box, which appears to be v1 configured via systemd. As far as I can tell v2 is not an option:

[root@sgiuv300-srcf-d10-01 /]# mkdir /mnt/cgroup2
[root@sgiuv300-srcf-d10-01 /]# mount -t cgroup2 none /mnt/cgroup2
mount: unknown filesystem type 'cgroup2'

The related mounts are:

[root@dper7425-srcf-d10-37 ~]# mount | grep cgroup
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_prio,net_cls)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpuacct,cpu)

We are not tied to this config, so if it needs to be modified to test anything, happy to do it.

We've reproduced this outside of SLURM on a different node now (Dell PE 7425 with AMD EPYC CPUs), at least for this particular application and it doesn't look like the app itself cares that the gzip process is a zombie, but we'll need to let it run another day or two to be sure it finishes correctly. Since previous occurrences were the sleep processes started by SLURM as well as application processes from jobs, maybe we have a generic problem here that SLURM just highlights because things go off the rails at job completion when SLURM attempts to clean up the job?
Comment 45 John Hanks 2018-11-16 10:44:26 MST
For the sake of completeness/comparison, a `pstree -u kameronr -psal` for the non-SLURM zombie gzip incident:

screen,16735 -S analysis
  └─bash,16737
      └─perl,17085 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bismark --genome /ifs/labs/sjaiswal/genomes/mm9_fasta --fastq 0hr_noBS_S38_R1_001.trimmed.fastq.gz,0hr_S31_R1_001.trimmed.fastq.gz,1hr_S32_R1_001.trimmed.fastq.gz,2hr_S33_R1_001.trimmed.fastq.gz,3hr_S34_R1_001.trimmed.fastq.gz,4hr_S35_R1_001.trimmed.fastq.gz,5hr_S36_R1_001.trimmed.fastq.gz,6hr_S37_R1_001.trimmed.fastq.gz,0hr_noBS_S38_R2_001.trimmed.fastq.gz,0hr_S31_R2_001.trimmed.fastq.gz,1hr_S32_R2_001.trimmed.fastq.gz,2hr_S33_R2_001.trimmed.fastq.gz,3hr_S34_R2_001.trimmed.fastq.gz,4hr_S35_R2_001.trimmed.fastq.gz,5hr_S36_R2_001.trimmed.fastq.gz,6hr_S37_R2_001.trimmed.fastq.gz -1 0hr_noBS_S38_R1_001.trimmed.fastq.gz,0hr_S31_R1_001.trimmed.fastq.gz,1hr_S32_R1_001.trimmed.fastq.gz,2hr_S33_R1_001.trimmed.fastq.gz,3hr_S34_R1_001.trimmed.fastq.gz,4hr_S35_R1_001.trimmed.fastq.gz,5hr_S36_R1_001.trimmed.fastq.gz,6hr_S37_R1_001.trimmed.fastq.gz -2 0hr_noBS_S38_R2_001.trimmed.fastq.gz,0hr_S31_R2_001.trimmed.fastq.gz,1hr_S32_R2_001.trimmed.fastq.gz,2hr_S33_R2_001.trimmed.fastq.gz,3hr_S34_R2_001.trimmed.fastq.gz,4hr_S35_R2_001.trimmed.fastq.gz,5hr_S36_R2_001.trimmed.fastq.gz,6hr_S37_R2_001.trimmed.fastq.gz --multicore 8 --un --ambiguous --output_dir /ifs/labs/sjaiswal/kameron/bisulfite/run180925_analysis4 --temp_dir /tmp --prefix run180925_mm9 -N 0 -L 20
          ├─(gzip,17168)
          ├─perl,17127 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bowtie2 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --norc -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/CT_conversion/BS_CT -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.8_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.8_G_to_A.fastq
          │   └─bowtie2-align-s,18529 --wrapper basic-0 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --norc -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/CT_conversion/BS_CT -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.8_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.8_G_to_A.fastq
          │       └─{bowtie2-align-s},21283
          ├─perl,21347 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bowtie2 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --nofw -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/GA_conversion/BS_GA -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.8_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.8_G_to_A.fastq
          │   └─bowtie2-align-s,21500 --wrapper basic-0 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --nofw -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/GA_conversion/BS_GA -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.8_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.8_G_to_A.fastq
          │       └─{bowtie2-align-s},22684
          ├─perl,79980 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bismark --genome /ifs/labs/sjaiswal/genomes/mm9_fasta --fastq 0hr_noBS_S38_R1_001.trimmed.fastq.gz,0hr_S31_R1_001.trimmed.fastq.gz,1hr_S32_R1_001.trimmed.fastq.gz,2hr_S33_R1_001.trimmed.fastq.gz,3hr_S34_R1_001.trimmed.fastq.gz,4hr_S35_R1_001.trimmed.fastq.gz,5hr_S36_R1_001.trimmed.fastq.gz,6hr_S37_R1_001.trimmed.fastq.gz,0hr_noBS_S38_R2_001.trimmed.fastq.gz,0hr_S31_R2_001.trimmed.fastq.gz,1hr_S32_R2_001.trimmed.fastq.gz,2hr_S33_R2_001.trimmed.fastq.gz,3hr_S34_R2_001.trimmed.fastq.gz,4hr_S35_R2_001.trimmed.fastq.gz,5hr_S36_R2_001.trimmed.fastq.gz,6hr_S37_R2_001.trimmed.fastq.gz -1 0hr_noBS_S38_R1_001.trimmed.fastq.gz,0hr_S31_R1_001.trimmed.fastq.gz,1hr_S32_R1_001.trimmed.fastq.gz,2hr_S33_R1_001.trimmed.fastq.gz,3hr_S34_R1_001.trimmed.fastq.gz,4hr_S35_R1_001.trimmed.fastq.gz,5hr_S36_R1_001.trimmed.fastq.gz,6hr_S37_R1_001.trimmed.fastq.gz -2 0hr_noBS_S38_R2_001.trimmed.fastq.gz,0hr_S31_R2_001.trimmed.fastq.gz,1hr_S32_R2_001.trimmed.fastq.gz,2hr_S33_R2_001.trimmed.fastq.gz,3hr_S34_R2_001.trimmed.fastq.gz,4hr_S35_R2_001.trimmed.fastq.gz,5hr_S36_R2_001.trimmed.fastq.gz,6hr_S37_R2_001.trimmed.fastq.gz --multicore 8 --un --ambiguous --output_dir /ifs/labs/sjaiswal/kameron/bisulfite/run180925_analysis4 --temp_dir /tmp --prefix run180925_mm9 -N 0 -L 20
          │   ├─perl,21026 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bowtie2 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --norc -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/CT_conversion/BS_CT -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.1_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.1_G_to_A.fastq
          │   │   └─bowtie2-align-s,21272 --wrapper basic-0 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --norc -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/CT_conversion/BS_CT -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.1_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.1_G_to_A.fastq
          │   │       └─{bowtie2-align-s},22222
          │   ├─perl,22264 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bowtie2 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --nofw -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/GA_conversion/BS_GA -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.1_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.1_G_to_A.fastq
          │   │   └─bowtie2-align-s,22579 --wrapper basic-0 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --nofw -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/GA_conversion/BS_GA -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.1_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.1_G_to_A.fastq
          │   │       └─{bowtie2-align-s},24946
          │   └─sh,25007 -c /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/samtools view -bSh 2>/dev/null - > /ifs/scg/labs/sjaiswal/kameron/bisulfite/run180925_analysis4/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.1_bismark_bt2_pe.bam
          │       └─samtools,25022 view -bSh -
          ├─perl,79981 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bismark --genome /ifs/labs/sjaiswal/genomes/mm9_fasta --fastq 0hr_noBS_S38_R1_001.trimmed.fastq.gz,0hr_S31_R1_001.trimmed.fastq.gz,1hr_S32_R1_001.trimmed.fastq.gz,2hr_S33_R1_001.trimmed.fastq.gz,3hr_S34_R1_001.trimmed.fastq.gz,4hr_S35_R1_001.trimmed.fastq.gz,5hr_S36_R1_001.trimmed.fastq.gz,6hr_S37_R1_001.trimmed.fastq.gz,0hr_noBS_S38_R2_001.trimmed.fastq.gz,0hr_S31_R2_001.trimmed.fastq.gz,1hr_S32_R2_001.trimmed.fastq.gz,2hr_S33_R2_001.trimmed.fastq.gz,3hr_S34_R2_001.trimmed.fastq.gz,4hr_S35_R2_001.trimmed.fastq.gz,5hr_S36_R2_001.trimmed.fastq.gz,6hr_S37_R2_001.trimmed.fastq.gz -1 0hr_noBS_S38_R1_001.trimmed.fastq.gz,0hr_S31_R1_001.trimmed.fastq.gz,1hr_S32_R1_001.trimmed.fastq.gz,2hr_S33_R1_001.trimmed.fastq.gz,3hr_S34_R1_001.trimmed.fastq.gz,4hr_S35_R1_001.trimmed.fastq.gz,5hr_S36_R1_001.trimmed.fastq.gz,6hr_S37_R1_001.trimmed.fastq.gz -2 0hr_noBS_S38_R2_001.trimmed.fastq.gz,0hr_S31_R2_001.trimmed.fastq.gz,1hr_S32_R2_001.trimmed.fastq.gz,2hr_S33_R2_001.trimmed.fastq.gz,3hr_S34_R2_001.trimmed.fastq.gz,4hr_S35_R2_001.trimmed.fastq.gz,5hr_S36_R2_001.trimmed.fastq.gz,6hr_S37_R2_001.trimmed.fastq.gz --multicore 8 --un --ambiguous --output_dir /ifs/labs/sjaiswal/kameron/bisulfite/run180925_analysis4 --temp_dir /tmp --prefix run180925_mm9 -N 0 -L 20
          │   ├─perl,21830 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bowtie2 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --norc -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/CT_conversion/BS_CT -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.2_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.2_G_to_A.fastq
          │   │   └─bowtie2-align-s,21902 --wrapper basic-0 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --norc -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/CT_conversion/BS_CT -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.2_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.2_G_to_A.fastq
          │   │       └─{bowtie2-align-s},23746
          │   ├─perl,23807 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bowtie2 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --nofw -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/GA_conversion/BS_GA -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.2_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.2_G_to_A.fastq
          │   │   └─bowtie2-align-s,24063 --wrapper basic-0 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --nofw -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/GA_conversion/BS_GA -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.2_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.2_G_to_A.fastq
          │   │       └─{bowtie2-align-s},26016
          │   └─sh,26064 -c /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/samtools view -bSh 2>/dev/null - > /ifs/scg/labs/sjaiswal/kameron/bisulfite/run180925_analysis4/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.2_bismark_bt2_pe.bam
          │       └─samtools,26067 view -bSh -
          ├─perl,79983 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bismark --genome /ifs/labs/sjaiswal/genomes/mm9_fasta --fastq 0hr_noBS_S38_R1_001.trimmed.fastq.gz,0hr_S31_R1_001.trimmed.fastq.gz,1hr_S32_R1_001.trimmed.fastq.gz,2hr_S33_R1_001.trimmed.fastq.gz,3hr_S34_R1_001.trimmed.fastq.gz,4hr_S35_R1_001.trimmed.fastq.gz,5hr_S36_R1_001.trimmed.fastq.gz,6hr_S37_R1_001.trimmed.fastq.gz,0hr_noBS_S38_R2_001.trimmed.fastq.gz,0hr_S31_R2_001.trimmed.fastq.gz,1hr_S32_R2_001.trimmed.fastq.gz,2hr_S33_R2_001.trimmed.fastq.gz,3hr_S34_R2_001.trimmed.fastq.gz,4hr_S35_R2_001.trimmed.fastq.gz,5hr_S36_R2_001.trimmed.fastq.gz,6hr_S37_R2_001.trimmed.fastq.gz -1 0hr_noBS_S38_R1_001.trimmed.fastq.gz,0hr_S31_R1_001.trimmed.fastq.gz,1hr_S32_R1_001.trimmed.fastq.gz,2hr_S33_R1_001.trimmed.fastq.gz,3hr_S34_R1_001.trimmed.fastq.gz,4hr_S35_R1_001.trimmed.fastq.gz,5hr_S36_R1_001.trimmed.fastq.gz,6hr_S37_R1_001.trimmed.fastq.gz -2 0hr_noBS_S38_R2_001.trimmed.fastq.gz,0hr_S31_R2_001.trimmed.fastq.gz,1hr_S32_R2_001.trimmed.fastq.gz,2hr_S33_R2_001.trimmed.fastq.gz,3hr_S34_R2_001.trimmed.fastq.gz,4hr_S35_R2_001.trimmed.fastq.gz,5hr_S36_R2_001.trimmed.fastq.gz,6hr_S37_R2_001.trimmed.fastq.gz --multicore 8 --un --ambiguous --output_dir /ifs/labs/sjaiswal/kameron/bisulfite/run180925_analysis4 --temp_dir /tmp --prefix run180925_mm9 -N 0 -L 20
          │   ├─perl,21925 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bowtie2 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --norc -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/CT_conversion/BS_CT -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.3_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.3_G_to_A.fastq
          │   │   └─bowtie2-align-s,22000 --wrapper basic-0 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --norc -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/CT_conversion/BS_CT -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.3_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.3_G_to_A.fastq
          │   │       └─{bowtie2-align-s},23813
          │   ├─perl,23883 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bowtie2 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --nofw -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/GA_conversion/BS_GA -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.3_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.3_G_to_A.fastq
          │   │   └─bowtie2-align-s,24114 --wrapper basic-0 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --nofw -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/GA_conversion/BS_GA -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.3_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.3_G_to_A.fastq
          │   │       └─{bowtie2-align-s},25810
          │   └─sh,25865 -c /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/samtools view -bSh 2>/dev/null - > /ifs/scg/labs/sjaiswal/kameron/bisulfite/run180925_analysis4/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.3_bismark_bt2_pe.bam
          │       └─samtools,25873 view -bSh -
          ├─perl,79985 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bismark --genome /ifs/labs/sjaiswal/genomes/mm9_fasta --fastq 0hr_noBS_S38_R1_001.trimmed.fastq.gz,0hr_S31_R1_001.trimmed.fastq.gz,1hr_S32_R1_001.trimmed.fastq.gz,2hr_S33_R1_001.trimmed.fastq.gz,3hr_S34_R1_001.trimmed.fastq.gz,4hr_S35_R1_001.trimmed.fastq.gz,5hr_S36_R1_001.trimmed.fastq.gz,6hr_S37_R1_001.trimmed.fastq.gz,0hr_noBS_S38_R2_001.trimmed.fastq.gz,0hr_S31_R2_001.trimmed.fastq.gz,1hr_S32_R2_001.trimmed.fastq.gz,2hr_S33_R2_001.trimmed.fastq.gz,3hr_S34_R2_001.trimmed.fastq.gz,4hr_S35_R2_001.trimmed.fastq.gz,5hr_S36_R2_001.trimmed.fastq.gz,6hr_S37_R2_001.trimmed.fastq.gz -1 0hr_noBS_S38_R1_001.trimmed.fastq.gz,0hr_S31_R1_001.trimmed.fastq.gz,1hr_S32_R1_001.trimmed.fastq.gz,2hr_S33_R1_001.trimmed.fastq.gz,3hr_S34_R1_001.trimmed.fastq.gz,4hr_S35_R1_001.trimmed.fastq.gz,5hr_S36_R1_001.trimmed.fastq.gz,6hr_S37_R1_001.trimmed.fastq.gz -2 0hr_noBS_S38_R2_001.trimmed.fastq.gz,0hr_S31_R2_001.trimmed.fastq.gz,1hr_S32_R2_001.trimmed.fastq.gz,2hr_S33_R2_001.trimmed.fastq.gz,3hr_S34_R2_001.trimmed.fastq.gz,4hr_S35_R2_001.trimmed.fastq.gz,5hr_S36_R2_001.trimmed.fastq.gz,6hr_S37_R2_001.trimmed.fastq.gz --multicore 8 --un --ambiguous --output_dir /ifs/labs/sjaiswal/kameron/bisulfite/run180925_analysis4 --temp_dir /tmp --prefix run180925_mm9 -N 0 -L 20
          │   ├─perl,18898 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bowtie2 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --norc -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/CT_conversion/BS_CT -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.4_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.4_G_to_A.fastq
          │   │   └─bowtie2-align-s,19757 --wrapper basic-0 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --norc -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/CT_conversion/BS_CT -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.4_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.4_G_to_A.fastq
          │   │       └─{bowtie2-align-s},21566
          │   ├─perl,21643 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bowtie2 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --nofw -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/GA_conversion/BS_GA -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.4_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.4_G_to_A.fastq
          │   │   └─bowtie2-align-s,21750 --wrapper basic-0 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --nofw -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/GA_conversion/BS_GA -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.4_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.4_G_to_A.fastq
          │   │       └─{bowtie2-align-s},23263
          │   └─sh,23404 -c /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/samtools view -bSh 2>/dev/null - > /ifs/scg/labs/sjaiswal/kameron/bisulfite/run180925_analysis4/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.4_bismark_bt2_pe.bam
          │       └─samtools,23444 view -bSh -
          ├─perl,79988 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bismark --genome /ifs/labs/sjaiswal/genomes/mm9_fasta --fastq 0hr_noBS_S38_R1_001.trimmed.fastq.gz,0hr_S31_R1_001.trimmed.fastq.gz,1hr_S32_R1_001.trimmed.fastq.gz,2hr_S33_R1_001.trimmed.fastq.gz,3hr_S34_R1_001.trimmed.fastq.gz,4hr_S35_R1_001.trimmed.fastq.gz,5hr_S36_R1_001.trimmed.fastq.gz,6hr_S37_R1_001.trimmed.fastq.gz,0hr_noBS_S38_R2_001.trimmed.fastq.gz,0hr_S31_R2_001.trimmed.fastq.gz,1hr_S32_R2_001.trimmed.fastq.gz,2hr_S33_R2_001.trimmed.fastq.gz,3hr_S34_R2_001.trimmed.fastq.gz,4hr_S35_R2_001.trimmed.fastq.gz,5hr_S36_R2_001.trimmed.fastq.gz,6hr_S37_R2_001.trimmed.fastq.gz -1 0hr_noBS_S38_R1_001.trimmed.fastq.gz,0hr_S31_R1_001.trimmed.fastq.gz,1hr_S32_R1_001.trimmed.fastq.gz,2hr_S33_R1_001.trimmed.fastq.gz,3hr_S34_R1_001.trimmed.fastq.gz,4hr_S35_R1_001.trimmed.fastq.gz,5hr_S36_R1_001.trimmed.fastq.gz,6hr_S37_R1_001.trimmed.fastq.gz -2 0hr_noBS_S38_R2_001.trimmed.fastq.gz,0hr_S31_R2_001.trimmed.fastq.gz,1hr_S32_R2_001.trimmed.fastq.gz,2hr_S33_R2_001.trimmed.fastq.gz,3hr_S34_R2_001.trimmed.fastq.gz,4hr_S35_R2_001.trimmed.fastq.gz,5hr_S36_R2_001.trimmed.fastq.gz,6hr_S37_R2_001.trimmed.fastq.gz --multicore 8 --un --ambiguous --output_dir /ifs/labs/sjaiswal/kameron/bisulfite/run180925_analysis4 --temp_dir /tmp --prefix run180925_mm9 -N 0 -L 20
          │   ├─perl,15326 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bowtie2 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --norc -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/CT_conversion/BS_CT -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.5_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.5_G_to_A.fastq
          │   │   └─bowtie2-align-s,16580 --wrapper basic-0 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --norc -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/CT_conversion/BS_CT -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.5_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.5_G_to_A.fastq
          │   │       └─{bowtie2-align-s},20983
          │   ├─perl,21018 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bowtie2 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --nofw -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/GA_conversion/BS_GA -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.5_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.5_G_to_A.fastq
          │   │   └─bowtie2-align-s,21298 --wrapper basic-0 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --nofw -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/GA_conversion/BS_GA -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.5_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.5_G_to_A.fastq
          │   │       └─{bowtie2-align-s},22785
          │   └─sh,22910 -c /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/samtools view -bSh 2>/dev/null - > /ifs/scg/labs/sjaiswal/kameron/bisulfite/run180925_analysis4/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.5_bismark_bt2_pe.bam
          │       └─samtools,22939 view -bSh -
          ├─perl,79989 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bismark --genome /ifs/labs/sjaiswal/genomes/mm9_fasta --fastq 0hr_noBS_S38_R1_001.trimmed.fastq.gz,0hr_S31_R1_001.trimmed.fastq.gz,1hr_S32_R1_001.trimmed.fastq.gz,2hr_S33_R1_001.trimmed.fastq.gz,3hr_S34_R1_001.trimmed.fastq.gz,4hr_S35_R1_001.trimmed.fastq.gz,5hr_S36_R1_001.trimmed.fastq.gz,6hr_S37_R1_001.trimmed.fastq.gz,0hr_noBS_S38_R2_001.trimmed.fastq.gz,0hr_S31_R2_001.trimmed.fastq.gz,1hr_S32_R2_001.trimmed.fastq.gz,2hr_S33_R2_001.trimmed.fastq.gz,3hr_S34_R2_001.trimmed.fastq.gz,4hr_S35_R2_001.trimmed.fastq.gz,5hr_S36_R2_001.trimmed.fastq.gz,6hr_S37_R2_001.trimmed.fastq.gz -1 0hr_noBS_S38_R1_001.trimmed.fastq.gz,0hr_S31_R1_001.trimmed.fastq.gz,1hr_S32_R1_001.trimmed.fastq.gz,2hr_S33_R1_001.trimmed.fastq.gz,3hr_S34_R1_001.trimmed.fastq.gz,4hr_S35_R1_001.trimmed.fastq.gz,5hr_S36_R1_001.trimmed.fastq.gz,6hr_S37_R1_001.trimmed.fastq.gz -2 0hr_noBS_S38_R2_001.trimmed.fastq.gz,0hr_S31_R2_001.trimmed.fastq.gz,1hr_S32_R2_001.trimmed.fastq.gz,2hr_S33_R2_001.trimmed.fastq.gz,3hr_S34_R2_001.trimmed.fastq.gz,4hr_S35_R2_001.trimmed.fastq.gz,5hr_S36_R2_001.trimmed.fastq.gz,6hr_S37_R2_001.trimmed.fastq.gz --multicore 8 --un --ambiguous --output_dir /ifs/labs/sjaiswal/kameron/bisulfite/run180925_analysis4 --temp_dir /tmp --prefix run180925_mm9 -N 0 -L 20
          │   ├─perl,22508 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bowtie2 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --norc -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/CT_conversion/BS_CT -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.6_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.6_G_to_A.fastq
          │   │   └─bowtie2-align-s,22664 --wrapper basic-0 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --norc -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/CT_conversion/BS_CT -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.6_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.6_G_to_A.fastq
          │   │       └─{bowtie2-align-s},24877
          │   ├─perl,24928 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bowtie2 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --nofw -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/GA_conversion/BS_GA -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.6_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.6_G_to_A.fastq
          │   │   └─bowtie2-align-s,25110 --wrapper basic-0 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --nofw -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/GA_conversion/BS_GA -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.6_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.6_G_to_A.fastq
          │   │       └─{bowtie2-align-s},27069
          │   └─sh,27169 -c /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/samtools view -bSh 2>/dev/null - > /ifs/scg/labs/sjaiswal/kameron/bisulfite/run180925_analysis4/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.6_bismark_bt2_pe.bam
          │       └─samtools,27196 view -bSh -
          ├─perl,79991 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bismark --genome /ifs/labs/sjaiswal/genomes/mm9_fasta --fastq 0hr_noBS_S38_R1_001.trimmed.fastq.gz,0hr_S31_R1_001.trimmed.fastq.gz,1hr_S32_R1_001.trimmed.fastq.gz,2hr_S33_R1_001.trimmed.fastq.gz,3hr_S34_R1_001.trimmed.fastq.gz,4hr_S35_R1_001.trimmed.fastq.gz,5hr_S36_R1_001.trimmed.fastq.gz,6hr_S37_R1_001.trimmed.fastq.gz,0hr_noBS_S38_R2_001.trimmed.fastq.gz,0hr_S31_R2_001.trimmed.fastq.gz,1hr_S32_R2_001.trimmed.fastq.gz,2hr_S33_R2_001.trimmed.fastq.gz,3hr_S34_R2_001.trimmed.fastq.gz,4hr_S35_R2_001.trimmed.fastq.gz,5hr_S36_R2_001.trimmed.fastq.gz,6hr_S37_R2_001.trimmed.fastq.gz -1 0hr_noBS_S38_R1_001.trimmed.fastq.gz,0hr_S31_R1_001.trimmed.fastq.gz,1hr_S32_R1_001.trimmed.fastq.gz,2hr_S33_R1_001.trimmed.fastq.gz,3hr_S34_R1_001.trimmed.fastq.gz,4hr_S35_R1_001.trimmed.fastq.gz,5hr_S36_R1_001.trimmed.fastq.gz,6hr_S37_R1_001.trimmed.fastq.gz -2 0hr_noBS_S38_R2_001.trimmed.fastq.gz,0hr_S31_R2_001.trimmed.fastq.gz,1hr_S32_R2_001.trimmed.fastq.gz,2hr_S33_R2_001.trimmed.fastq.gz,3hr_S34_R2_001.trimmed.fastq.gz,4hr_S35_R2_001.trimmed.fastq.gz,5hr_S36_R2_001.trimmed.fastq.gz,6hr_S37_R2_001.trimmed.fastq.gz --multicore 8 --un --ambiguous --output_dir /ifs/labs/sjaiswal/kameron/bisulfite/run180925_analysis4 --temp_dir /tmp --prefix run180925_mm9 -N 0 -L 20
          │   ├─perl,21307 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bowtie2 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --norc -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/CT_conversion/BS_CT -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.7_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.7_G_to_A.fastq
          │   │   └─bowtie2-align-s,21473 --wrapper basic-0 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --norc -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/CT_conversion/BS_CT -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.7_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.7_G_to_A.fastq
          │   │       └─{bowtie2-align-s},22774
          │   ├─perl,22895 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bowtie2 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --nofw -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/GA_conversion/BS_GA -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.7_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.7_G_to_A.fastq
          │   │   └─bowtie2-align-s,23183 --wrapper basic-0 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --nofw -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/GA_conversion/BS_GA -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.7_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.7_G_to_A.fastq
          │   │       └─{bowtie2-align-s},24641
          │   └─sh,24734 -c /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/samtools view -bSh 2>/dev/null - > /ifs/scg/labs/sjaiswal/kameron/bisulfite/run180925_analysis4/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.7_bismark_bt2_pe.bam
          │       └─samtools,24775 view -bSh -
          └─sh,22753 -c /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/samtools view -bSh 2>/dev/null - > /ifs/scg/labs/sjaiswal/kameron/bisulfite/run180925_analysis4/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.8_bismark_bt2_pe.bam
              └─samtools,22756 view -bSh -
Comment 46 Felip Moll 2018-11-16 12:34:01 MST
(In reply to John Hanks from comment #44)
> Our cgroups config is whatever CentOS 7.5 does out-of-the-box, which appears
> to be v1 configured via systemd. As far as I can tell v2 is not an option:
> 
> [root@sgiuv300-srcf-d10-01 /]# mkdir /mnt/cgroup2
> [root@sgiuv300-srcf-d10-01 /]# mount -t cgroup2 none /mnt/cgroup2
> mount: unknown filesystem type 'cgroup2'

The mount is a mount option for cgroup version1 to behave like cgroup2.

mount option is '-o cpuset_v2_mode' as I quoted from RHEL release note.


> We are not tied to this config, so if it needs to be modified to test
> anything, happy to do it.

I think is worth to try at least to see if it removes this kernel warning.
 
> We've reproduced this outside of SLURM on a different node now (Dell PE 7425
> with AMD EPYC CPUs), at least for this particular application and it doesn't
> look like the app itself cares that the gzip process is a zombie, but we'll
> need to let it run another day or two to be sure it finishes correctly.
> Since previous occurrences were the sleep processes started by SLURM as well
> as application processes from jobs, maybe we have a generic problem here
> that SLURM just highlights because things go off the rails at job completion
> when SLURM attempts to clean up the job?

This is not good at all and brokes our previous assumptions:
1. it is not due to SGI UV300 architecture
2. it is not due to the application

And shows a possible ones:
3. it is not Slurm's fault
4. it may be a kernel bug?

Does this other machine have the same kernel? May we discard a kernel bug or may
this theory still be valid? Does dmesg also show the kernel dump warning in this other node?

The application could be ignoring child processes and masking
signals, which are two very bad things. We should verify that in
perl and qiime cases. Was qiime a perl process?


Are we in the sync with these ideas?
Comment 47 John Hanks 2018-11-16 13:33:57 MST
(In reply to Felip Moll from comment #46)
> (In reply to John Hanks from comment #44)
> > Our cgroups config is whatever CentOS 7.5 does out-of-the-box, which appears
> > to be v1 configured via systemd. As far as I can tell v2 is not an option:
> > 
> > [root@sgiuv300-srcf-d10-01 /]# mkdir /mnt/cgroup2
> > [root@sgiuv300-srcf-d10-01 /]# mount -t cgroup2 none /mnt/cgroup2
> > mount: unknown filesystem type 'cgroup2'
> 
> The mount is a mount option for cgroup version1 to behave like cgroup2.
> 
> mount option is '-o cpuset_v2_mode' as I quoted from RHEL release note.
> 
> 
> > We are not tied to this config, so if it needs to be modified to test
> > anything, happy to do it.
> 
> I think is worth to try at least to see if it removes this kernel warning.
>  
> > We've reproduced this outside of SLURM on a different node now (Dell PE 7425
> > with AMD EPYC CPUs), at least for this particular application and it doesn't
> > look like the app itself cares that the gzip process is a zombie, but we'll
> > need to let it run another day or two to be sure it finishes correctly.
> > Since previous occurrences were the sleep processes started by SLURM as well
> > as application processes from jobs, maybe we have a generic problem here
> > that SLURM just highlights because things go off the rails at job completion
> > when SLURM attempts to clean up the job?
> 
> This is not good at all and brokes our previous assumptions:
> 1. it is not due to SGI UV300 architecture
> 2. it is not due to the application
> 
> And shows a possible ones:
> 3. it is not Slurm's fault
> 4. it may be a kernel bug?

This is what I am concluding as well.

> Does this other machine have the same kernel? May we discard a kernel bug or
> may
> this theory still be valid? Does dmesg also show the kernel dump warning in
> this other node?

Both nodes boot disklessly from the same image, same kernel. From an OS perspective, they are identical.

> The application could be ignoring child processes and masking
> signals, which are two very bad things. We should verify that in
> perl and qiime cases. Was qiime a perl process?

qiime is a python wrapper that calls a lot of other stuff. bismark is a perl wrapper that calls a bunch of other tools. The R job was just R, but in it's case it was teh sleep commands started by slurmstepd which zombied.

The bismark run outside of SLURM finished successfully (as far as the apps output is concerned), leaving behind a zombie gzip and two zobmie perl proecesses.

> Are we in the sync with these ideas?

I think we are.

systemd rants in next comment :)
Comment 48 John Hanks 2018-11-16 13:37:32 MST
I've tried this:

[root@sgiuv300-srcf-d10-01 ~]# mount  | grep cpuset
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
[root@sgiuv300-srcf-d10-01 ~]# mount -o remount,cpuset_v2_mode /sys/fs/cgroup/cpuset
mount: /sys/fs/cgroup/cpuset not mounted or bad option

       In some cases useful info is found in syslog - try
       dmesg | tail or so.

Of course nothing in dmesg with any clues. Between google, looking in my /etc and probing numerous man pages, I have not yet found where I would actually configure systemd to use this mount option for mounting cpuset. Still looking, only stopping to occasionally stick pins in my Poettering voodoo doll or delicately roast it over an open candle flame.
Comment 49 John Hanks 2018-11-16 14:07:49 MST
> Does dmesg also show the kernel dump warning in this other node?

Checked all node, at least 42 (out of 77) have the kernel stack trace at boot time. Sample from a Dell R730xd:

[24259.423477] ------------[ cut here ]------------
[24259.423493] WARNING: CPU: 11 PID: 38788 at kernel/cpuset.c:1296 cpuset_write_resmask+0x80a/0xa10
[24259.423496] Modules linked in: nfsd nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c iptable_filter zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) sunrpc dm_mirror dm_region_hash dm_log dm_mod dcdbas iTCO_wdt iTCO_vendor_support mxm_wmi sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd i2c_algo_bit ttm drm_kms_helper pcspkr syscopyarea sysfillrect sysimgblt fb_sys_fops drm i2c_core mei_me lpc_ich mei sg shpchp ipmi_si ipmi_devintf ipmi_msghandler wmi acpi_power_meter binfmt_misc ip_tables ixgbe
[24259.423568]  dca mdio megaraid_sas tg3 ptp pps_core ahci libahci libata sd_mod crc_t10dif crct10dif_common
[24259.423584] CPU: 11 PID: 38788 Comm: slurmstepd Tainted: P           OE  ------------   3.10.0-862.14.4.el7.x86_64 #1
[24259.423587] Hardware name: Dell Inc. PowerEdge R730xd/0599V5, BIOS 2.6.0 10/26/2017
[24259.423589] Call Trace:
[24259.423598]  [<ffffffffaa713754>] dump_stack+0x19/0x1b
[24259.423606]  [<ffffffffaa0945d8>] __warn+0xd8/0x100
[24259.423611]  [<ffffffffaa09471d>] warn_slowpath_null+0x1d/0x20
[24259.423615]  [<ffffffffaa125f0a>] cpuset_write_resmask+0x80a/0xa10
[24259.423623]  [<ffffffffaa11b020>] ? cgroup_rightmost_descendant+0x80/0x80
[24259.423628]  [<ffffffffaa125700>] ? update_cpumasks_hier+0x410/0x410
[24259.423632]  [<ffffffffaa11d411>] cgroup_file_write+0x1d1/0x2d0
[24259.423639]  [<ffffffffaa2221c8>] ? __sb_start_write+0x58/0x110
[24259.423646]  [<ffffffffaa2d5437>] ? security_file_permission+0x27/0xa0
[24259.423651]  [<ffffffffaa21f240>] vfs_write+0xc0/0x1f0
[24259.423656]  [<ffffffffaa22006f>] SyS_write+0x7f/0xf0
[24259.423663]  [<ffffffffaa72579b>] system_call_fastpath+0x22/0x27
[24259.423667] ---[ end trace 2f4806cb5ac5b506 ]---
Comment 50 John Hanks 2018-11-16 14:24:13 MST
Well, the config file of interest seems to be:

https://github.com/systemd/systemd/blob/master/src/core/mount-setup.c

Any suggestions welcome. My Poettering voodoo doll can't take much more abuse, though, and it doesn't seem to help anyway.

Can I revert to having SLURM mount its own view of these and control them as we did back in the CentOS 6 days of old?
Comment 51 John Hanks 2018-11-16 15:19:51 MST
It seems that it is possible to mount cgroups with the v2 option at another location,

[root@sgiuv300-srcf-d10-01 ~]# mount -t cgroup -o cpuset,cpuset_v2_mode none /mnt/cgroup2/
[root@sgiuv300-srcf-d10-01 ~]# ls /mnt/cgroup2/
cgroup.clone_children  cpuset.cpu_exclusive   cpuset.mem_exclusive    cpuset.memory_pressure_enabled  cpuset.sched_load_balance        slurm
cgroup.event_control   cpuset.cpus            cpuset.mem_hardwall     cpuset.memory_spread_page       cpuset.sched_relax_domain_level  tasks
cgroup.procs           cpuset.effective_cpus  cpuset.memory_migrate   cpuset.memory_spread_slab       notify_on_release
cgroup.sane_behavior   cpuset.effective_mems  cpuset.memory_pressure  cpuset.mems                     release_agent

Based on that I'm going to test building a separate CgroupMountpoint= location for SLURM to use and if it works, will split the cluster between using v1 and v2 and see if there is a difference in outcomes.
Comment 52 Felip Moll 2018-11-27 16:19:51 MST
(In reply to John Hanks from comment #51)
> It seems that it is possible to mount cgroups with the v2 option at another
> location,
> 
> [root@sgiuv300-srcf-d10-01 ~]# mount -t cgroup -o cpuset,cpuset_v2_mode none
> /mnt/cgroup2/
> [root@sgiuv300-srcf-d10-01 ~]# ls /mnt/cgroup2/
> cgroup.clone_children  cpuset.cpu_exclusive   cpuset.mem_exclusive   
> cpuset.memory_pressure_enabled  cpuset.sched_load_balance        slurm
> cgroup.event_control   cpuset.cpus            cpuset.mem_hardwall    
> cpuset.memory_spread_page       cpuset.sched_relax_domain_level  tasks
> cgroup.procs           cpuset.effective_cpus  cpuset.memory_migrate  
> cpuset.memory_spread_slab       notify_on_release
> cgroup.sane_behavior   cpuset.effective_mems  cpuset.memory_pressure 
> cpuset.mems                     release_agent
> 
> Based on that I'm going to test building a separate CgroupMountpoint=
> location for SLURM to use and if it works, will split the cluster between
> using v1 and v2 and see if there is a difference in outcomes.

Hi John,

Were you been able to test the black magic here?
Do we have any news?
Comment 53 John Hanks 2018-11-28 09:37:49 MST
I haven't had a chance to dig deeper lately and it'll probably be a few weeks before I have a chance to make any real progress, unless of course this begins increasing in frequency and becomes the squeakiest wheel.
Comment 54 Felip Moll 2018-12-20 06:53:30 MST
(In reply to John Hanks from comment #53)
> I haven't had a chance to dig deeper lately and it'll probably be a few
> weeks before I have a chance to make any real progress, unless of course
> this begins increasing in frequency and becomes the squeakiest wheel.

Hi,

It's been a bit since your last comment. Do you have any news? Were you able
tot test the cgroup mount option?

Thanks!
Felip
Comment 55 John Hanks 2018-12-20 08:07:14 MST
Hi Felip,

In the interim we've upgraded to 18.08.4 and CentOS 7.6 and since then, the problem hasn't been showing up since those updates. Things are also winding down here for the winter break, so it's unlikely this will see any more activity this year unless it suddenly becomes a chronic problem.

jbh
Comment 56 Felip Moll 2018-12-20 10:48:59 MST
(In reply to John Hanks from comment #55)
> Hi Felip,
> 
> In the interim we've upgraded to 18.08.4 and CentOS 7.6 and since then, the
> problem hasn't been showing up since those updates. Things are also winding
> down here for the winter break, so it's unlikely this will see any more
> activity this year unless it suddenly becomes a chronic problem.
> 
> jbh

Ok,

I will keep this one open until, let's say, 17th January, and if it
doesn't show up again we will close it.

I am not sure the upgrade have fixed the issue, but if you also upgraded
the OS it could.

Do you agree?
Comment 57 John Hanks 2018-12-20 10:53:05 MST
(In reply to Felip Moll from comment #56)
> I will keep this one open until, let's say, 17th January, and if it
> doesn't show up again we will close it.
> 
> I am not sure the upgrade have fixed the issue, but if you also upgraded
> the OS it could.
> 
> Do you agree?

Yes, it seems like the problem is some corner case with the kernel and the way Slurm uses cgroups, so it's entirely possible the kernel update fixed it. It's also possible the user(s) with the jobs that triggered it are already on break :) It's all speculation until I track down a repeatable reproducer test case.
Comment 58 Felip Moll 2019-01-11 04:28:07 MST
(In reply to John Hanks from comment #57)
> (In reply to Felip Moll from comment #56)
> > I will keep this one open until, let's say, 17th January, and if it
> > doesn't show up again we will close it.
> > 
> > I am not sure the upgrade have fixed the issue, but if you also upgraded
> > the OS it could.
> > 
> > Do you agree?
> 
> Yes, it seems like the problem is some corner case with the kernel and the
> way Slurm uses cgroups, so it's entirely possible the kernel update fixed
> it. It's also possible the user(s) with the jobs that triggered it are
> already on break :) It's all speculation until I track down a repeatable
> reproducer test case.

Hey John,

First of all, happy new year!

I revisited this bug because it turns out we have another internal one
where slurmstepd processes are stuck, which then are killed by Slurm after
a timeout, and may leave sleep processes behind. It's similar in some way
to your issue but we haven't managed to diagnose yet. It seems related to
the PrologFlags=contain parameter.

Have you seen any other occurrence of this issue?
Comment 59 John Hanks 2019-01-11 08:06:29 MST
> Hey John,
> 
> First of all, happy new year!
> 
> I revisited this bug because it turns out we have another internal one
> where slurmstepd processes are stuck, which then are killed by Slurm after
> a timeout, and may leave sleep processes behind. It's similar in some way
> to your issue but we haven't managed to diagnose yet. It seems related to
> the PrologFlags=contain parameter.
> 
> Have you seen any other occurrence of this issue?

Hi Felip,

I must shamefully admit that this did recur over the holiday break and I just rebooted the affected node and quickly went back to gluttonous overeating, pretending like there was no problem that couldn't be solved by more mashed potatoes. 

Our PrologFlags are:

PrologFlags=Alloc,Contain,X11

so these may be the same issue.

jbh
Comment 60 Felip Moll 2019-02-14 10:30:08 MST
(In reply to John Hanks from comment #59)
> > Hey John,
> > 
> > First of all, happy new year!
> > 
> > I revisited this bug because it turns out we have another internal one
> > where slurmstepd processes are stuck, which then are killed by Slurm after
> > a timeout, and may leave sleep processes behind. It's similar in some way
> > to your issue but we haven't managed to diagnose yet. It seems related to
> > the PrologFlags=contain parameter.
> > 
> > Have you seen any other occurrence of this issue?
> 
> Hi Felip,
> 
> I must shamefully admit that this did recur over the holiday break and I
> just rebooted the affected node and quickly went back to gluttonous
> overeating, pretending like there was no problem that couldn't be solved by
> more mashed potatoes. 
> 
> Our PrologFlags are:
> 
> PrologFlags=Alloc,Contain,X11
> 
> so these may be the same issue.
> 
> jbh

Hi John,

I escalated this issue to my team and took a look at it. We all think this must
be a kernel or a zfs bug because in your situations we see D processes hung. If there
are D processes, as I commented before, there should be no way for Slurm to
kill them.

I see you have zfs module loaded, are you using zfs filesystems? Are the affected
processes touching files in these filesystems?

We've some concerns about scalability of ZFS with large number of cores.
Comment 61 John Hanks 2019-02-14 10:37:35 MST
This is a valid theory, we do use ZFS on all compute nodes and the /tmp space for each user is on a zfs filesystem. That doesn't guarantee that affected processes are using ZFS, but it does make it possible/likely. And regardless of how it's triggered it does seem like a kernel bug at this point. 

I'm still waiting for a re-occurrence or a repeatable reproducer to appear and will add this to the list of things to check if/when the problem appears again. 

jbh
Comment 62 Felip Moll 2019-02-14 10:53:54 MST
(In reply to John Hanks from comment #61)
> This is a valid theory, we do use ZFS on all compute nodes and the /tmp
> space for each user is on a zfs filesystem. That doesn't guarantee that
> affected processes are using ZFS, but it does make it possible/likely. And
> regardless of how it's triggered it does seem like a kernel bug at this
> point. 
> 
> I'm still waiting for a re-occurrence or a repeatable reproducer to appear
> and will add this to the list of things to check if/when the problem appears
> again. 
> 
> jbh

One option would be to get a kernel dump and sent it to SGI, though you reproduced it in other (Dell) machines.

Unfortunately I think that analyzing a kernel dump is out of SchedMD's
support scope.

At this point, do you want to keep this bug open until it reproduces again,
or otherwise you feel comfortable closing it now?

Thanks and sorry for the late response.
Comment 63 John Hanks 2019-02-14 11:00:21 MST
(In reply to Felip Moll from comment #62)
> 
> One option would be to get a kernel dump and sent it to SGI, though you
> reproduced it in other (Dell) machines.
> 
> Unfortunately I think that analyzing a kernel dump is out of SchedMD's
> support scope.
> 
> At this point, do you want to keep this bug open until it reproduces again,
> or otherwise you feel comfortable closing it now?
> 
> Thanks and sorry for the late response.

Sadly it's HPE now, not SGI and they already beat you to the "unsupported configuration" punch line, on both CentOS and ZFS. I think it's fine to close this bug, if/when it pops up again I'll need to recheck everything anyway since we've updated both SLURM and kernels since it happened last.
Comment 64 Felip Moll 2019-02-14 11:29:29 MST
> Sadly it's HPE now, not SGI and they already beat you to the "unsupported
> configuration" punch line, on both CentOS and ZFS.

Yeah, I know :( , this is unfortunate.

> I think it's fine to
> close this bug, if/when it pops up again I'll need to recheck everything
> anyway since we've updated both SLURM and kernels since it happened last.

That's ok. Our other recommendation is to try to change the ZFS filesystem of
/tmp to another one. At least just to test and discard things. I know this
sounds like trial/error, but given the D states I think is worth to try.
Also an lsof of the stuck processes could help.

As you know, just reopen again if it reproduces again.
Comment 65 John Hanks 2019-03-05 07:55:54 MST
Like any good zombie story, there's always another season.

This morning I have several nodes on which slurmd has died and fails to restart:

[root@dper7425-srcf-d15-19 ~]# slurmd -D -v -v -v -v -v -v -v
slurmd: debug:  Log file re-opened
slurmd: debug4: found jobid = 7273685, stepid = 4294967294
slurmd: debug4: found jobid = 7271593, stepid = 4294967294
slurmd: debug4: found jobid = 7271594, stepid = 4294967294
slurmd: debug4: found jobid = 7271543, stepid = 4294967294
slurmd: debug3: Trying to load plugin /scg/slurm/installed/slurm-18.08.5/lib/slurm/auth_munge.so
slurmd: debug:  Munge authentication plugin loaded
slurmd: debug3: Success.

it never gets any further than that. gdb and a bt:


[root@dper7425-srcf-d15-19 ~]# gdb `which slurmd`
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-114.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /scg/slurm/installed/slurm-18.08.5/sbin/slurmd...done.
(gdb) set args -D -v -v -v -v -v -v -v
(gdb) run
Starting program: /scg/slurm/current/sbin/slurmd -D -v -v -v -v -v -v -v
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
slurmd: debug:  Log file re-opened
slurmd: debug4: found jobid = 7273685, stepid = 4294967294
slurmd: debug4: found jobid = 7271593, stepid = 4294967294
slurmd: debug4: found jobid = 7271594, stepid = 4294967294
slurmd: debug4: found jobid = 7271543, stepid = 4294967294
slurmd: debug3: Trying to load plugin /scg/slurm/installed/slurm-18.08.5/lib/slurm/auth_munge.so
slurmd: debug:  Munge authentication plugin loaded
slurmd: debug3: Success.
^C
Program received signal SIGINT, Interrupt.
0x00002aaaac3569c0 in __connect_nocancel () from /usr/lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install audit-libs-2.8.4-4.el7.x86_64 glibc-2.17-260.el7.x86_64 hwloc-libs-1.11.8-4.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-34.el7.x86_64 libcap-ng-0.7.5-4.el7.x86_64 libcom_err-1.42.9-13.el7.x86_64 libgcc-4.8.5-36.el7.x86_64 libselinux-2.5-14.1.el7.x86_64 libssh2-1.4.3-12.el7.x86_64 libtool-ltdl-2.4.2-22.el7_3.x86_64 lz4-1.7.5-2.el7.x86_64 munge-libs-0.5.11-3.el7.x86_64 numactl-libs-2.0.9-7.el7.x86_64 openssl-libs-1.0.2k-16.el7.x86_64 pam-1.1.8-22.el7.x86_64 pcre-8.32-17.el7.x86_64 sssd-client-1.16.2-13.el7.x86_64 zlib-1.2.7-18.el7.x86_64
(gdb) thread apply all bt full

Thread 1 (Thread 0x2aaaaaadb700 (LWP 126544)):
#0  0x00002aaaac3569c0 in __connect_nocancel () from /usr/lib64/libpthread.so.0
No symbol table info available.
#1  0x00002aaaab24437c in _step_connect (stepid=4294967294, jobid=7273685, nodename=0x692fa0 "dper7425-srcf-d15-19", 
    directory=0x692f70 "/var/spool/slurm/slurmd") at stepd_api.c:173
        fd = 5
        len = <optimized out>
        addr = {sun_family = 1, 
          sun_path = "/var/spool/slurm/slurmd/dper7425-srcf-d15-19_7273685.4294967294", '\000' <repeats 44 times>}
        name = 0x693a60 "/var/spool/slurm/slurmd/dper7425-srcf-d15-19_7273685.4294967294"
#2  stepd_connect (directory=0x692f70 "/var/spool/slurm/slurmd", nodename=0x692fa0 "dper7425-srcf-d15-19", jobid=7273685, 
    stepid=4294967294, protocol_version=protocol_version@entry=0x68ff58) at stepd_api.c:273
        req = 0
        fd = -1
        rc = 0
        auth_cred = <optimized out>
        auth_info = 0x0
        local_nodename = 0x0
        buffer = 0x693c80
        len = 0
        __func__ = "stepd_connect"
#3  0x000000000040b955 in _update_logging () at slurmd.c:2027
        fd = <optimized out>
        steps = 0x638ec0
        i = 0x669830
        stepd = 0x68ff40
        o = <optimized out>
        cf = <optimized out>
#4  0x000000000040d0d0 in _read_config () at slurmd.c:921
        path_pubkey = 0x0
        cf = 0x633f40 <slurmctld_conf>
        cc = <optimized out>
        cr_flag = true
        gang_flag = false
        tok = <optimized out>
        save_ptr = 0x0
        over_memory_kill = false
        __func__ = "_read_config"
#5  0x000000000040eb06 in _slurmd_init () at slurmd.c:1545
        rlim = {rlim_cur = 0, rlim_max = 0}
        stat_buf = {st_dev = 3833183645592809572, st_ino = 3559019523307893549, st_nlink = 18446742975157447989, 
---Type <return> to continue, or q <return> to quit---
          st_mode = 4294967295, st_uid = 4294967295, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 0, st_blksize = 0, 
          st_blocks = 0, st_atim = {tv_sec = 0, tv_nsec = 0}, st_mtim = {tv_sec = 0, tv_nsec = 0}, st_ctim = {tv_sec = 0, 
            tv_nsec = 0}, __unused = {0, 0, 0}}
        cpu_cnt = <optimized out>
#6  main (argc=9, argv=0x7fffffffdd98) at slurmd.c:261
        i = <optimized out>
        pidfd = <optimized out>
        blocked_signals = {13, 0}
        cc = <optimized out>
        oom_value = <optimized out>
        slurmd_uid = 0
        curr_uid = 0
        time_stamp = '\000' <repeats 32 times>, "\377\377\377\377\377\377\377\377\020\211~\\\000\000\000\000\345\332R\031\000\000\000\000\374\366\377\377\377\037\000\000\337/l\253\252*\000\000\000\000\000\000\000\000\000\000\300\334\377\377\377\177\000\000Ѕc\000\000\000\000\000\360\333\377\377\377\177\000\000\065&l\253\252*\000\000\377\000\000\000\000\000\000\000\360\001\000\000\000\000\000\000\002\000\000\000\000\000\000\000\006\000\000\000\000\000\000\200", '\000' <repeats 56 times>, "\200"...
        __func__ = "main"


A pstree shows the slurmstepds have lost the children:

           ├─slurmstepd(6678)─┬─{slurmstepd}(6679)
           │                  ├─{slurmstepd}(6680)
           │                  ├─{slurmstepd}(6681)
           │                  ├─{slurmstepd}(6684)
           │                  └─{slurmstepd}(3829)
           ├─slurmstepd(6683)─┬─{slurmstepd}(6687)
           │                  └─{slurmstepd}(3842)
           ├─slurmstepd(107340)─┬─{slurmstepd}(107343)
           │                    └─{slurmstepd}(3836)
           ├─slurmstepd(531)─┬─{slurmstepd}(540)
           │                 ├─{slurmstepd}(541)
           │                 ├─{slurmstepd}(543)
           │                 ├─{slurmstepd}(546)
           │                 ├─{slurmstepd}(126253)
           │                 └─{slurmstepd}(71084)
           ├─slurmstepd(2977)─┬─{slurmstepd}(2986)
           │                  ├─{slurmstepd}(2987)
           │                  ├─{slurmstepd}(2989)
           │                  ├─{slurmstepd}(2992)
           │                  ├─{slurmstepd}(6467)
           │                  └─{slurmstepd}(81942)
           ├─smartd(14458)
           ├─sshd(20298)───sshd(6038)───bash(6087)───pstree(100405)
           ├─sssd(20290)─┬─sssd_autofs(20294)
           │             ├─sssd_be(20291)
           │             ├─sssd_nss(20292)
           │             └─sssd_pam(20293)
           ├─starter-suid(6952)─┬─call_vars_deepv(7102)───call_variants(51944)───python(52011)───python(52203)─┬─{python}(76410)
           │                    │                                                                              ├─{python}(76411)
           │                    │                                                                              ├─{python}(76412)
           │                    │                                                                              ├─{python}(76413)
           │                    │                                                                              ├─{python}(76414)
           │                    │                                                                              ├─{python}(76415)
           │                    │                                                                              ├─{python}(76416)
           │                    │                                                                              ├─{python}(76417)
           │                    │                                                                              ├─{python}(76418)
           │                    │                                                                              ├─{python}(76419)
           │                    │                                                                              └─{python}(76420)
           │                    ├─{starter-suid}(7105)
           │                    ├─{starter-suid}(7106)
           │                    ├─{starter-suid}(7107)
           │                    ├─{starter-suid}(7108)
           │                    ├─{starter-suid}(7110)
           │                    └─{starter-suid}(7115)
           ├─starter-suid(6953)─┬─call_vars_deepv(7081)───call_variants(51946)───python(52096)───python(52204)
           │                    └─{starter-suid}(7087)
           ├─starter-suid(107349)─┬─call_vars_deepv(107369)───postprocess_var(54867)───python(56259)───python(57586)───{python}(76641)
           │                      └─{starter-suid}(107373)


strace of a slurmstepd shows:


[root@dper7425-srcf-d15-19 ~]# strace -ff -p 6678
strace: Process 6678 attached with 6 threads
[pid  6684] restart_syscall(<... resuming interrupted restart_syscall ...> <unfinished ...>
[pid  6681] futex(0x2b547634a520, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid  6680] futex(0x2b547634a520, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid  6679] futex(0x2b547634a520, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid  6678] futex(0x2b547634d520, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid  6684] <... restart_syscall resumed> ) = -1 ENOSYS (Function not implemented)
[pid  6681] <... futex resumed> )       = -1 ENOSYS (Function not implemented)
[pid  6680] <... futex resumed> )       = -1 ENOSYS (Function not implemented)
[pid  6679] <... futex resumed> )       = -1 ENOSYS (Function not implemented)
[pid  6678] <... futex resumed> )       = -1 ENOSYS (Function not implemented)



Attaching to those slurmsted processes results in a hung gdb (killable with kill -9).

Looking into the /proc/PID of these and trying to ls /proc/PID/fd results in a hung ls process (not killable with kill -9).

There are no recent dmesg entries, at least none since these jobs started.

The last dmesg entry is 02/28 4 am. These first of these jobs started later that day around 9 am. 

Every hung node I looked at so far is running a similar job from the same user, so I'll follow up on what the job is doing and maybe we'll get a reproducer out of this. Since I can't attach successfully with gdb, and lsof, /proc, etc. are non-responsive I'm rebooting these to focus on reproducing it.

jbh
Comment 66 John Hanks 2019-03-05 09:57:58 MST
The jobs that are associated with these zombies all run a container using singularity which binds to several NFS filesystems. Looking across several of these I can see that some of the earlier starting jobs do overlap with 

[Thu Feb 28 03:13:19 2019] nfs: server gsfs0 not responding, still trying
[Thu Feb 28 03:13:22 2019] nfs: server gsfs0 not responding, still trying
[Thu Feb 28 03:13:32 2019] nfs: server gsfs0 not responding, still trying
[Thu Feb 28 03:16:41 2019] nfs: server gsfs0 OK
[Thu Feb 28 03:16:41 2019] nfs: server gsfs0 OK
[Thu Feb 28 03:16:41 2019] nfs: server gsfs0 OK

which is one of the locations singularity is binding to:

singularity exec --bind /srv/gsfs0 --bind /ifs/lab/...

There is a decidedly p > 0 probability that previous jobs were also using singularity, at least one prior occurrence of this was with a user who runs a pipeline dependent on a couple of container images. 

Other data: 

 - a mix of Intel and AMD EPYC nodes were affected.
 
 - these jobs are part of a batch of 600 jobs, about 8 or so triggered zombies so ~ 1.3% failed. I'm not hopeful about getting a reproducer out of this.

 - the nfs chatter in dmesg was common in all cases.

 - around the time of at least some of these messages we did have a controller failure in the DDN/GPFS behind /srv/gsfs0, but it seemed to recover ok at the time and works fine on the other nodes.

Sorry for the unfiltered dump, just trying to not overlook something that might be relevant. Is it possible a trigger for this could be 

if (USING_SINGULARITY and BINDING an NFS mount and ACTIVELY READ/WRITE and NFS chokes) then hang/zombies? 

The number of things that would have to come together  there at the same time make the low incident rate for any set of jobs that potentially meet these conditions seem reasonable.
Comment 67 John Hanks 2019-03-05 13:08:02 MST
Another possible clue, this message seems to be suspiciously present around the times we've seen this problem and shows up in old logs with a similar occurrence rate, which is to say weeks go by without it then a few jobs get hit:

Mar  1 15:06:32 frankie slurmctld[176615]: error: A non superuser 324759 tried to complete batch JobId=7271361

It's happening right now to two jobs that look like this on the node:

[root@sgisummit-frcf-111-20 ~]# pstree -u dwyman
slurmstepd───6*[{slurmstepd}]

slurmstepd───6*[{slurmstepd}]

If I attach with gdb, it hangs and I can't interrupt to get a bt. strace show (in a repeating way):

[root@sgisummit-frcf-111-20 proc]# strace -ff -p 221000
strace: Process 221000 attached with 6 threads
[pid 42795] futex(0x6372c0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 129792] restart_syscall(<... resuming interrupted nanosleep ...> <unfinished ...>
[pid 221027] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid 221018] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid 221015] restart_syscall(<... resuming interrupted futex ...> <unfinished ...>
[pid 221014] futex(0x2b005bab53fc, FUTEX_WAIT_PRIVATE, 503, NULL <unfinished ...>
[pid 221015] <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 221015] futex(0x2b005bab3160, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 221015] futex(0x2b005bab3124, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 15019, {1551814838, 782476000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid 221015] futex(0x2b005bab3160, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 221015] futex(0x2b005bab3124, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 15021, {1551814839, 782476000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid 221015] futex(0x2b005bab3160, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 221015] futex(0x2b005bab3124, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 15023, {1551814840, 782476000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid 221015] futex(0x2b005bab3160, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 221015] futex(0x2b005bab3124, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 15025, {1551814841, 782476000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid 221015] futex(0x2b005bab3160, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 221015] futex(0x2b005bab3124, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 15027, {1551814842, 782476000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid 221015] futex(0x2b005bab3160, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 221015] futex(0x2b005bab3124, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 15029, {1551814843, 782476000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid 221015] futex(0x2b005bab3160, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 221015] futex(0x2b005bab3124, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 15031, {1551814844, 782476000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid 221015] futex(0x2b005bab3160, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 221015] futex(0x2b005bab3124, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 15033, {1551814845, 782476000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid 221015] futex(0x2b005bab3160, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 221015] futex(0x2b005bab3124, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 15035, {1551814846, 782476000}, ffffffff <unfinished ...>
[pid 129792] <... restart_syscall resumed> ) = 0
[pid 129792] stat("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=74, ...}) = 0
[pid 129792] socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 17
[pid 129792] fcntl(17, F_GETFL)         = 0x2 (flags O_RDWR)
[pid 129792] fcntl(17, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid 129792] connect(17, {sa_family=AF_INET, sin_port=htons(6923), sin_addr=inet_addr("10.210.15.254")}, 16) = -1 EINPROGRESS (Operation now in progress)
[pid 129792] poll([{fd=17, events=POLLIN|POLLOUT}], 1, 2000) = 1 ([{fd=17, revents=POLLOUT}])
[pid 129792] getsockopt(17, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
[pid 129792] fcntl(17, F_SETFL, O_RDWR) = 0
[pid 129792] fcntl(17, F_SETFD, FD_CLOEXEC) = 0
[pid 129792] rt_sigaction(SIGALRM, {SIG_DFL, [ALRM], SA_RESTORER, 0x2b005c7465d0}, {SIG_DFL, [ALRM], SA_RESTORER, 0x2b005c7465d0}, 8) = 0
[pid 129792] stat("/var/run/munge/munge.socket.2", {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
[pid 129792] socket(AF_LOCAL, SOCK_STREAM, 0) = 18
[pid 129792] fcntl(18, F_GETFL)         = 0x2 (flags O_RDWR)
[pid 129792] fcntl(18, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid 129792] connect(18, {sa_family=AF_LOCAL, sun_path="/var/run/munge/munge.socket.2"}, 110) = 0
[pid 129792] writev(18, [{"\0`mK\4\2\0\0\0\0\24", 11}, {"\1\1\1\0\0\0\0\0\377\377\377\377\377\377\377\377\0\0\0\0", 20}], 2) = 31
[pid 129792] read(18, "\0`mK\4\3\0\0\0\0\206", 11) = 11
[pid 129792] read(18, "\0\0\0\0\0\200MUNGE:AwQDAACx4qJScnazKTt7"..., 134) = 134
[pid 129792] close(18)                  = 0
[pid 129792] rt_sigaction(SIGALRM, {SIG_DFL, [ALRM], SA_RESTORER, 0x2b005c7465d0}, {SIG_DFL, [ALRM], SA_RESTORER, 0x2b005c7465d0}, 8) = 0
[pid 129792] rt_sigaction(SIGPIPE, {SIG_IGN, [PIPE], SA_RESTORER, 0x2b005c7465d0}, {SIG_IGN, [PIPE], SA_RESTORER, 0x2b005c7465d0}, 8) = 0
[pid 129792] fcntl(17, F_GETFL)         = 0x2 (flags O_RDWR)
[pid 129792] fcntl(17, F_GETFL)         = 0x2 (flags O_RDWR)
[pid 129792] fcntl(17, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid 129792] poll([{fd=17, events=POLLOUT}], 1, 60000) = 1 ([{fd=17, revents=POLLOUT}])
[pid 129792] recvfrom(17, 0x2b0068e20950, 1, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
[pid 129792] sendto(17, "\0\0\5H", 4, 0, NULL, 0) = 4
[pid 129792] fcntl(17, F_SETFL, O_RDWR) = 0
[pid 129792] fcntl(17, F_GETFL)         = 0x2 (flags O_RDWR)
[pid 129792] fcntl(17, F_GETFL)         = 0x2 (flags O_RDWR)
[pid 129792] fcntl(17, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid 129792] poll([{fd=17, events=POLLOUT}], 1, 60000) = 1 ([{fd=17, revents=POLLOUT}])
[pid 129792] recvfrom(17, 0x2b0068e20950, 1, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
[pid 129792] sendto(17, "!\0\0\0\0\0\23\232\0\0\4\233\0\0\0\0\0\0\0\0\0\0\0\0\0\vauth/m"..., 1352, 0, NULL, 0) = 1352
[pid 129792] fcntl(17, F_SETFL, O_RDWR) = 0
[pid 129792] rt_sigaction(SIGPIPE, {SIG_IGN, [PIPE], SA_RESTORER, 0x2b005c7465d0}, {SIG_IGN, [PIPE], SA_RESTORER, 0x2b005c7465d0}, 8) = 0
[pid 129792] fcntl(17, F_GETFL)         = 0x2 (flags O_RDWR)
[pid 129792] fcntl(17, F_GETFL)         = 0x2 (flags O_RDWR)
[pid 129792] fcntl(17, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid 129792] poll([{fd=17, events=POLLIN}], 1, 60000) = 1 ([{fd=17, revents=POLLIN}])
[pid 129792] recvfrom(17, "", 4, 0, NULL, NULL) = 0
[pid 129792] fcntl(17, F_SETFL, O_RDWR) = 0
[pid 129792] close(17)                  = 0
[pid 129792] socket(AF_LOCAL, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 17
[pid 129792] connect(17, {sa_family=AF_LOCAL, sun_path="/dev/log"}, 110) = 0
[pid 129792] sendto(17, "<14>Mar  5 11:40:46 slurmstepd[2"..., 88, MSG_NOSIGNAL, NULL, 0) = 88
[pid 129792] close(17)                  = 0
[pid 129792] rt_sigprocmask(SIG_BLOCK, [CHLD], [HUP INT QUIT USR1 USR2 PIPE ALRM TERM TSTP], 8) = 0
[pid 129792] rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], SA_RESTORER, 0x2b005c7465d0}, 8) = 0
[pid 129792] rt_sigprocmask(SIG_SETMASK, [HUP INT QUIT USR1 USR2 PIPE ALRM TERM TSTP], NULL, 8) = 0
[pid 129792] nanosleep({15, 0},  <unfinished ...>
[pid 221015] <... futex resumed> )      = -1 ETIMEDOUT (Connection timed out)
[pid 221015] futex(0x2b005bab3160, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 221015] futex(0x2b005bab3124, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 15037, {1551814847, 782476000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid 221015] futex(0x2b005bab3160, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 221015] futex(0x2b005bab3124, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 15039, {1551814848, 782476000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid 221015] futex(0x2b005bab3160, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 221015] futex(0x2b005bab3124, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 15041, {1551814849, 782476000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid 221015] futex(0x2b005bab3160, FUTEX_WAKE_PRIVATE, 1) = 0

lsof on the node doesn't work, but /proc/PID/fd for the slurmstepd and children look (in general) like

[root@sgisummit-frcf-111-20 fd]# ls -al
total 0
dr-x------ 2 root   root        0 Mar  5 11:37 .
dr-xr-xr-x 9 dwyman upg_dwyman  0 Mar  5 09:35 ..
lrwx------ 1 root   root       64 Mar  5 11:37 0 -> /dev/null
lrwx------ 1 root   root       64 Mar  5 11:37 1 -> /dev/null
lr-x------ 1 root   root       64 Mar  5 11:37 10 -> /sys/fs/cgroup/memory/slurm/uid_34470/job_7358408/step_batch/memory.oom_control
l-wx------ 1 root   root       64 Mar  5 11:37 11 -> /sys/fs/cgroup/memory/slurm/uid_34470/job_7358408/step_batch/cgroup.event_control
lrwx------ 1 root   root       64 Mar  5 11:37 12 -> anon_inode:[eventfd]
lr-x------ 1 root   root       64 Mar  5 11:37 13 -> pipe:[31120943]
l-wx------ 1 root   root       64 Mar  5 11:37 14 -> pipe:[31120943]
lr-x------ 1 root   root       64 Mar  5 11:37 15 -> /dev/null
lrwx------ 1 root   root       64 Mar  5 11:37 2 -> /dev/null
lr-x------ 1 root   root       64 Mar  5 11:37 3 -> /var/lib/sss/mc/passwd
lr-x------ 1 root   root       64 Mar  5 11:37 4 -> pipe:[31111662]
l-wx------ 1 root   root       64 Mar  5 11:37 5 -> pipe:[31111662]
lrwx------ 1 root   root       64 Mar  5 11:37 6 -> socket:[31111663]
lr-x------ 1 root   root       64 Mar  5 11:37 7 -> pipe:[31111664]
l-wx------ 1 root   root       64 Mar  5 11:37 8 -> pipe:[31111664]
lrwx------ 1 root   root       64 Mar  5 11:37 9 -> socket:[31121150]


This node has several jobs that are running normally on it, so I'm going to drain it and watch this to see what happens.
Comment 68 John Hanks 2019-03-05 13:33:47 MST
The two jobs from the previous comment eventually finished, the slurm logs which mention the job ids in question look like:

Mar  5 11:44:32 frankie slurmctld[69336]: Resending TERMINATE_JOB request JobId=7358407 Nodelist=sgisummit-frcf-111-20
Mar  5 11:44:32 frankie slurmctld[69336]: Resending TERMINATE_JOB request JobId=7358408 Nodelist=sgisummit-frcf-111-20
Mar  5 11:44:46 sgisummit-frcf-111-20 slurmstepd[221023]: Retrying job complete RPC for 7358407.4294967294
Mar  5 11:44:46 sgisummit-frcf-111-20 slurmstepd[221000]: Retrying job complete RPC for 7358408.4294967294
Mar  5 11:44:46 frankie slurmctld[69336]: error: A non superuser 34470 tried to complete batch JobId=7358407
Mar  5 11:44:46 frankie slurmctld[69336]: error: A non superuser 34470 tried to complete batch JobId=7358408
Mar  5 11:45:01 sgisummit-frcf-111-20 slurmstepd[221000]: Retrying job complete RPC for 7358408.4294967294
Mar  5 11:45:01 sgisummit-frcf-111-20 slurmstepd[221023]: Retrying job complete RPC for 7358407.4294967294
Mar  5 11:45:01 frankie slurmctld[69336]: error: A non superuser 34470 tried to complete batch JobId=7358408
Mar  5 11:45:01 frankie slurmctld[69336]: error: A non superuser 34470 tried to complete batch JobId=7358407
Mar  5 11:45:16 sgisummit-frcf-111-20 slurmstepd[221023]: Retrying job complete RPC for 7358407.4294967294
Mar  5 11:45:16 sgisummit-frcf-111-20 slurmstepd[221000]: Retrying job complete RPC for 7358408.4294967294
Mar  5 11:45:16 frankie slurmctld[69336]: error: A non superuser 34470 tried to complete batch JobId=7358407
Mar  5 11:45:16 frankie slurmctld[69336]: error: A non superuser 34470 tried to complete batch JobId=7358408
Mar  5 11:45:31 sgisummit-frcf-111-20 slurmstepd[221023]: Retrying job complete RPC for 7358407.4294967294
Mar  5 11:45:31 sgisummit-frcf-111-20 slurmstepd[221000]: Retrying job complete RPC for 7358408.4294967294
Mar  5 11:45:31 frankie slurmctld[69336]: error: A non superuser 34470 tried to complete batch JobId=7358407
Mar  5 11:45:31 frankie slurmctld[69336]: error: A non superuser 34470 tried to complete batch JobId=7358408
Mar  5 11:45:46 frankie slurmctld[69336]: error: A non superuser 34470 tried to complete batch JobId=7358408
Mar  5 11:45:46 sgisummit-frcf-111-20 slurmstepd[221023]: Retrying job complete RPC for 7358407.4294967294
Mar  5 11:45:46 sgisummit-frcf-111-20 slurmstepd[221000]: Retrying job complete RPC for 7358408.4294967294
Mar  5 11:45:46 frankie slurmctld[69336]: error: A non superuser 34470 tried to complete batch JobId=7358407
Mar  5 11:46:01 sgisummit-frcf-111-20 slurmstepd[221000]: Retrying job complete RPC for 7358408.4294967294
Mar  5 11:46:01 sgisummit-frcf-111-20 slurmstepd[221023]: Retrying job complete RPC for 7358407.4294967294
Mar  5 11:46:01 frankie slurmctld[69336]: error: A non superuser 34470 tried to complete batch JobId=7358408
Mar  5 11:46:01 frankie slurmctld[69336]: error: A non superuser 34470 tried to complete batch JobId=7358407
Mar  5 11:46:16 sgisummit-frcf-111-20 slurmstepd[221000]: error: rmdir(/var/spool/slurm/slurmd/job7358408): Permission denied
Mar  5 11:46:16 sgisummit-frcf-111-20 slurmstepd[221023]: error: rmdir(/var/spool/slurm/slurmd/job7358407): Permission denied
Mar  5 11:46:20 sgisummit-frcf-111-20 slurmd[94054]: _handle_stray_script: Purging vestigial job script /var/spool/slurm/slurmd/job7358407/slurm_script
Mar  5 11:46:20 sgisummit-frcf-111-20 slurmd[94054]: _handle_stray_script: Purging vestigial job script /var/spool/slurm/slurmd/job7358408/slurm_script
Mar  5 11:46:20 frankie slurmctld[69336]: cleanup_completing: JobId=7358407 completion process took 4246 seconds
Mar  5 11:46:20 frankie slurmctld[69336]: cleanup_completing: JobId=7358408 completion process took 4246 seconds

jbh
Comment 69 Felip Moll 2019-03-05 14:04:46 MST
> [Thu Feb 28 03:13:19 2019] nfs: server gsfs0 not responding, still trying
> [Thu Feb 28 03:13:22 2019] nfs: server gsfs0 not responding, still trying
> [Thu Feb 28 03:13:32 2019] nfs: server gsfs0 not responding, still trying
> [Thu Feb 28 03:16:41 2019] nfs: server gsfs0 OK
> [Thu Feb 28 03:16:41 2019] nfs: server gsfs0 OK
> [Thu Feb 28 03:16:41 2019] nfs: server gsfs0 OK

That's not good indeed.

> 
> if (USING_SINGULARITY and BINDING an NFS mount and ACTIVELY READ/WRITE and
> NFS chokes) then hang/zombies? 
> 
> The number of things that would have to come together  there at the same
> time make the low incident rate for any set of jobs that potentially meet
> these conditions seem reasonable.

To me, this makes perfect sense. All indications points to some I/O problem: D processes,
gdb not able to attach, slurmd not proceeding and even a hung 'ls'.. 

At this point I think that it is not a Slurm issue and it seems it goes beyond of what
we (SchedMD) can do to diagnose it. The suggestion is to monitor your filesystem, be it ZFS
or NFS. Maybe you can try to force it to happen just stressing NFS and see what happens.

What do singularity logs says? Is there any way to easily play with these containers and try to isolate
it from slurm?


> Mar  5 11:44:46 frankie slurmctld[69336]: error: A non superuser 34470 tried
> to complete batch JobId=7358407
> Mar  5 11:44:46 frankie slurmctld[69336]: error: A non superuser 34470 tried

Have you added nodes to the system recently? If so you need to restart all your slurmd daemons.
If not, 34470, can you 'getent passwd' it? Are you using sssd?
Comment 70 John Hanks 2019-03-05 14:19:27 MST
(In reply to Felip Moll from comment #69)
> > [Thu Feb 28 03:13:19 2019] nfs: server gsfs0 not responding, still trying
> > [Thu Feb 28 03:13:22 2019] nfs: server gsfs0 not responding, still trying
> > [Thu Feb 28 03:13:32 2019] nfs: server gsfs0 not responding, still trying
> > [Thu Feb 28 03:16:41 2019] nfs: server gsfs0 OK
> > [Thu Feb 28 03:16:41 2019] nfs: server gsfs0 OK
> > [Thu Feb 28 03:16:41 2019] nfs: server gsfs0 OK
> 
> That's not good indeed.
> 
> > 
> > if (USING_SINGULARITY and BINDING an NFS mount and ACTIVELY READ/WRITE and
> > NFS chokes) then hang/zombies? 
> > 
> > The number of things that would have to come together  there at the same
> > time make the low incident rate for any set of jobs that potentially meet
> > these conditions seem reasonable.
> 
> To me, this makes perfect sense. All indications points to some I/O problem:
> D processes,
> gdb not able to attach, slurmd not proceeding and even a hung 'ls'.. 
> 
> At this point I think that it is not a Slurm issue and it seems it goes
> beyond of what
> we (SchedMD) can do to diagnose it. The suggestion is to monitor your
> filesystem, be it ZFS
> or NFS. Maybe you can try to force it to happen just stressing NFS and see
> what happens.

I'm not sure it's that straightforward. When I check these nodes the filesystems in question all work fine and the output from the jobs show no indication that there were any I/O issues. The timing of these NFS messages in dmesg is circumstantial and suspicious, but there are no I/O problems in processes except (potentially) the hung processes. The filesystems the jobs associated with this hung slurmstepd processses used all work fine while the processes remain hung. 

None of these recent jobs had any associated zfs/kernel dumps in dmesg. And I hate containers and NFS, so I'd love to blame them or even systemd, but the common thread here seems to be slurmstepd. All these other things work fine across all our systems except in these rare flurries that occur in batch jobs. 
 
> What do singularity logs says? Is there any way to easily play with these
> containers and try to isolate
> it from slurm?

At least for this recent workflow, the user tests it extensively on login nodes and via interactive srun jobs before submitting it in mass and has not had this problem while working on the pipeline.

 
> > Mar  5 11:44:46 frankie slurmctld[69336]: error: A non superuser 34470 tried
> > to complete batch JobId=7358407
> > Mar  5 11:44:46 frankie slurmctld[69336]: error: A non superuser 34470 tried
> 
> Have you added nodes to the system recently? If so you need to restart all
> your slurmd daemons.
> If not, 34470, can you 'getent passwd' it? Are you using sssd?

I did add a node recently, but have these messages logged extending back before adding the node and after my last node change several months ago. 

sssd is in use, but is working fine on the node (there were several other jobs running normally and logins, etc, work fine.)
Comment 71 John Hanks 2019-03-05 14:22:56 MST
Also if there is a recipe I can follow for backtraces, core dumps, etc., that would be more helpful, I'm happy to collect more info that way.
Comment 72 John Hanks 2019-03-13 09:53:37 MDT
We have another data point to report on this, these messages caught my attention:

Mar 13 08:28:09 frankie slurmctld[58154]: Resending TERMINATE_JOB request JobId=8137198 Nodelist=dper7425-srcf-d15-27
Mar 13 08:28:09 frankie slurmctld[58154]: Resending TERMINATE_JOB request JobId=8137199 Nodelist=dper7425-srcf-d15-27
Mar 13 08:28:09 frankie slurmctld[58154]: Resending TERMINATE_JOB request JobId=8137200 Nodelist=dper7425-srcf-d15-27
Mar 13 08:28:09 frankie slurmctld[58154]: Resending TERMINATE_JOB request JobId=8137201 Nodelist=dper7425-srcf-d15-27
Mar 13 08:28:09 frankie slurmctld[58154]: Resending TERMINATE_JOB request JobId=8137202 Nodelist=dper7425-srcf-d15-27
Mar 13 08:28:09 frankie slurmctld[58154]: Resending TERMINATE_JOB request JobId=8137203 Nodelist=dper7425-srcf-d15-27


Upon closer inspection these are more hung processes. It looks like at some point slurm does manage to mark things in this state as complete and move on as on this node I found some older ghosts:

[root@dper7425-srcf-d15-27 ~]# ps aux | grep USERNAME
USERNAME   14113  0.0  0.0 420256  4316 ?        D    Feb27   0:00 slurmstepd: [7269234.batch]
USERNAME   14163  0.0  0.0 420256  5000 ?        D    Feb27   0:00 slurmstepd: [7269235.batch]
USERNAME   14230  0.0  0.0 420256  4852 ?        D    Feb27   0:00 slurmstepd: [7269236.batch]
USERNAME   14269  0.0  0.0 420256  6164 ?        D    Feb27   0:00 slurmstepd: [7269237.batch]
USERNAME   14319  0.0  0.0 420256  4144 ?        D    Feb27   0:00 slurmstepd: [7269238.batch]
USERNAME   33348  0.0  0.0 420260  4884 ?        D    Mar05   0:00 slurmstepd: [7549600.batch]
USERNAME   33385  0.0  0.0 420260  5848 ?        D    Mar05   0:00 slurmstepd: [7549601.batch]
USERNAME   33412  0.0  0.0 420260  6784 ?        D    Mar05   0:00 slurmstepd: [7549602.batch]
USERNAME   33443  0.0  0.0 420260  6780 ?        D    Mar05   0:00 slurmstepd: [7549603.batch]
USERNAME   61561  0.0  0.0 353700  7348 ?        D    Mar11   0:00 slurmstepd: [8137199.batch]
USERNAME   61582  0.0  0.0 353700  7352 ?        D    Mar11   0:00 slurmstepd: [8137198.batch]
USERNAME   61608  0.0  0.0 353700  7344 ?        D    Mar11   0:00 slurmstepd: [8137201.batch]
USERNAME   61633  0.0  0.0 353700  7352 ?        D    Mar11   0:00 slurmstepd: [8137202.batch]
USERNAME   61652  0.0  0.0 353700  7352 ?        D    Mar11   0:00 slurmstepd: [8137200.batch]
USERNAME   61678  0.0  0.0 353700  7348 ?        D    Mar11   0:00 slurmstepd: [8137203.batch]
USERNAME  103421  0.0  0.0 353696  7252 ?        D    Mar08   0:00 slurmstepd: [8086937.batch]
USERNAME  103434  0.0  0.0 353696  7308 ?        D    Mar08   0:00 slurmstepd: [8086936.batch]

It turns out these jobs have no accounting history.

[I am root@frankie ~]# sacct -j 7269237
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
[I am root@frankie ~]# sacct -j 7269238
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
[I am root@frankie ~]# sacct -j 7269239
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
[I am root@frankie ~]# sacct -j 8086936
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
[I am root@frankie ~]# sacct -j 8086937
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 


In previous cases I hadn't thought to look at the accounting history for the zombies, but looking back now at one of the oldest reported examples:

[I am root@frankie ~]# sacct -j 3092804
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
[I am root@frankie ~]# 


In looking the node and processes over there are no NFS errors in dmesg around the times associated with these jobs, the children of the slurmstepd in question are all gone. As per the usual behavior, I can't interrupt them to get a backtrace:

[root@dper7425-srcf-d15-27 ~]# gdb -p 103421 `which slurmstepd`
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-114.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /scg/slurm/installed/slurm-18.08.6/sbin/slurmstepd...done.
Attaching to program: /scg/slurm/current/sbin/slurmstepd, process 103421

^C^C^C^C^C^C^C^C


And strace is a dead end:

[root@dper7425-srcf-d15-27 ~]# strace -ff -p 103421
strace: Process 103421 attached

kill -11 PID for these does not produce a core dump.

If we can't figure out who/what to blame for why slurmd/slurmstepd is abandoning ship in these cases, can we at least agree that they need to be accounted for?
Comment 73 Felip Moll 2019-03-14 04:49:35 MDT
John, I am working today/tomorrow with your bug again and let the rest of the team look also to your bug.

Will come back when I have relevant info.
Comment 74 Felip Moll 2019-03-14 07:03:12 MDT
I'm starting to think that this is a race with: 

ProctrackType=proctrack/cgroup and TaskPlugin=task/cgroup

and not really your problem. 

I have other similar bugs with same problems, but until today I hadn't seen
zombies on other customers. Now one reports also zombies.

In any case it seems to happen only on a specific set of machines, I cannot
reproduce while there's one person reproducing it consistently.

Will come back to you.
Comment 75 John Hanks 2019-03-14 07:33:40 MDT
Hi Felip,

Thanks for looking into it. I went back and counted these (based on the log messages and lack of accounting) and we've had this occur ~ 4000 times out of our 8.25 million jobs during that period. Since I've only noticed maybe 100 of those over that time I am assuming that these have been cleaned up in most cases by an automatic process I use to drain/reboot nodes with job complete errors or just wound up as unnoticed processes in D state.

If there's something I can test or collect on my end, happy to oblige, but without a solid reproducer it'll just be more statistical sampling. 

jbh
Comment 76 Felip Moll 2019-03-15 09:45:18 MDT
John,

Can you show me the output from a node of:

]$ mount |grep cgroup
Comment 77 John Hanks 2019-03-15 10:56:03 MDT
(In reply to Felip Moll from comment #76)
> John,
> 
> Can you show me the output from a node of:
> 
> ]$ mount |grep cgroup

From random node:

[root@dper7425-srcf-d15-09 ~]# mount |grep cgroup
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_prio,net_cls)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpuacct,cpu)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
Comment 78 John Hanks 2019-03-20 13:40:36 MDT
Another round of possibly affected jobs, at least to the point of a collection of slurmstepd who have been abandoned. This time, however, there is backtrace.

Caveat: I did add a new node and partition to Slurm this morning. I don't know yet if these were already off the rails before then or that I just noticed them after.

jbh


(gdb) thread apply all bt full

Thread 3 (Thread 0x2b7be2beb700 (LWP 119262)):
#0  0x00002b7bdd01b20d in poll () from /usr/lib64/libc.so.6
No symbol table info available.
#1  0x00002b7bdbe25d77 in _poll_internal (shutdown_time=<optimized out>, nfds=2, pfds=0x2b7be40008d0) at eio.c:364
        n = <optimized out>
        timeout = -1
#2  eio_handle_mainloop (eio=0x14158c0) at eio.c:328
        retval = 0
        pollfds = 0x2b7be40008d0
        map = 0x2b7be4000900
        maxnfds = 1
        nfds = 2
        n = <optimized out>
        shutdown_time = <optimized out>
        __func__ = "eio_handle_mainloop"
#3  0x000000000041d98a in _msg_thr_internal (job_arg=0x13dda30) at req.c:273
        job = 0x13dda30
#4  0x00002b7bdcd13dd5 in start_thread () from /usr/lib64/libpthread.so.0
No symbol table info available.
#5  0x00002b7bdd025ead in clone () from /usr/lib64/libc.so.6
No symbol table info available.

Thread 2 (Thread 0x2b7be3b1e700 (LWP 119296)):
#0  0x00002b7bdd01b20d in poll () from /usr/lib64/libc.so.6
No symbol table info available.
#1  0x00002b7be3206942 in _oom_event_monitor (x=<optimized out>) at task_cgroup_memory.c:478
        args = 0x13caa90
        ret = <optimized out>
        res = 1
        fds = {{fd = 12, events = 3, revents = 0}, {fd = 13, events = 8193, revents = 0}}
        __func__ = "_oom_event_monitor"
#2  0x00002b7bdcd13dd5 in start_thread () from /usr/lib64/libpthread.so.0
No symbol table info available.
#3  0x00002b7bdd025ead in clone () from /usr/lib64/libc.so.6
No symbol table info available.

Thread 1 (Thread 0x2b7bdbad4080 (LWP 119202)):
#0  0x00002b7bdcfece2d in nanosleep () from /usr/lib64/libc.so.6
No symbol table info available.
#1  0x00002b7bdcfeccc4 in sleep () from /usr/lib64/libc.so.6
No symbol table info available.
#2  0x000000000040da68 in _one_step_complete_msg (job=job@entry=0x13dda30, first=0, last=0) at mgr.c:886
        req = {address = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, 
          auth_cred = 0x0, body_offset = 0, buffer = 0x0, conn = 0x0, conn_fd = -1, data = 0x7ffc02ba4ac0, data_size = 0, 
          flags = 0, msg_index = 0, msg_type = 5016, protocol_version = 65534, forward = {cnt = 0, init = 65534, nodelist = 0x0, 
            timeout = 0, tree_width = 0}, forward_struct = 0x0, orig_addr = {sin_family = 0, sin_port = 0, sin_addr = {
              s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, ret_list = 0x0}
        msg = {job_id = 8370873, job_step_id = 4294967295, range_first = 0, range_last = 0, step_rc = 0, jobacct = 0x140f9f0}
        rc = -1
        retcode = <optimized out>
        i = 184
        port = 0
        ip_buf = "0.0.0.0\000\000\000\000\000\000\000\000"
        acct_sent = true
#3  0x000000000040faeb in stepd_send_step_complete_msgs (job=job@entry=0x13dda30) at mgr.c:953
        start = 0
        size = 0
        first = -1
        last = -1
        sent_own_comp_msg = false
        __func__ = "stepd_send_step_complete_msgs"
#4  0x00000000004109ab in _spawn_job_container (job=0x13dda30) at mgr.c:1223
        jobacct_id = {taskid = 0, nodeid = 0, job = 0x13dda30}
        status = 15
        pid = 119298
        jobacct = <optimized out>
        rusage = {ru_utime = {tv_sec = 0, tv_usec = 0}, ru_stime = {tv_sec = 0, tv_usec = 1306}, {ru_maxrss = 1800, 
            __ru_maxrss_word = 1800}, {ru_ixrss = 0, __ru_ixrss_word = 0}, {ru_idrss = 0, __ru_idrss_word = 0}, {ru_isrss = 0, 
            __ru_isrss_word = 0}, {ru_minflt = 267, __ru_minflt_word = 267}, {ru_majflt = 0, __ru_majflt_word = 0}, {ru_nswap = 0, 
            __ru_nswap_word = 0}, {ru_inblock = 0, __ru_inblock_word = 0}, {ru_oublock = 0, __ru_oublock_word = 0}, {
            ru_msgsnd = 0, __ru_msgsnd_word = 0}, {ru_msgrcv = 0, __ru_msgrcv_word = 0}, {ru_nsignals = 0, 
            __ru_nsignals_word = 0}, {ru_nvcsw = 2, __ru_nvcsw_word = 2}, {ru_nivcsw = 0, __ru_nivcsw_word = 0}}
        rc = 0
        x11_pipe = {0, 0}
#5  job_manager (job=job@entry=0x13dda30) at mgr.c:1292
        rc = 0
        io_initialized = false
        ckpt_type = 0x13c59d0 "checkpoint/none"
        err_msg = 0x0
        __func__ = "job_manager"
#6  0x000000000040d452 in main (argc=1, argv=0x7ffc02ba4e78) at slurmstepd.c:174
        cli = 0x13d4b70
        self = 0x13d4b40
        msg = 0x13dd340
        rc = 0
        launch_params = 0x0
        __func__ = "main"


(gdb) where
#0  0x00002b7bdcfece2d in nanosleep () from /usr/lib64/libc.so.6
#1  0x00002b7bdcfeccc4 in sleep () from /usr/lib64/libc.so.6
#2  0x000000000040da68 in _one_step_complete_msg (job=job@entry=0x13dda30, first=0, last=0) at mgr.c:886
#3  0x000000000040faeb in stepd_send_step_complete_msgs (job=job@entry=0x13dda30) at mgr.c:953
#4  0x00000000004109ab in _spawn_job_container (job=0x13dda30) at mgr.c:1223
#5  job_manager (job=job@entry=0x13dda30) at mgr.c:1292
#6  0x000000000040d452 in main (argc=1, argv=0x7ffc02ba4e78) at slurmstepd.c:174


From dmesg I see:

[Wed Mar 20 09:22:24 2019] slurmstepd[119463]: segfault at 0 ip 00002ba7bc926b2b sp 00007ffce1328040 error 4 in libslurmfull.so[2ba7bc8b6000+1b3000]
[Wed Mar 20 09:22:24 2019] slurmstepd[119334]: segfault at 0 ip 00002ba7bc926b2b sp 00007ffce132a1b0 error 4 in libslurmfull.so[2ba7bc8b6000+1b3000]
[Wed Mar 20 09:22:24 2019] slurmstepd[120270]: segfault at 0 ip 00002b0d19130b2b sp 00007ffc48d47390 error 4 in libslurmfull.so[2b0d190c0000+1b3000]
[Wed Mar 20 09:22:24 2019] slurmstepd[120200]: segfault at 0 ip 00002b0d19130b2b sp 00007ffc48d49500 error 4 in libslurmfull.so[2b0d190c0000+1b3000]
[Wed Mar 20 09:22:25 2019] slurmstepd[122299]: segfault at 0 ip 00002acf010e8b2b sp 00007ffcf6819d20 error 4 in libslurmfull.so[2acf01078000+1b3000]
[Wed Mar 20 09:22:25 2019] slurmstepd[121978]: segfault at 0 ip 00002acf010e8b2b sp 00007ffcf681be90 error 4 in libslurmfull.so[2acf01078000+1b3000]
Comment 79 Felip Moll 2019-03-21 04:02:45 MDT
Thanks, I am investigating the race condition currently.

One thing you could temporary test is to change proctrack to linuxproc in order to see if the error is gone. This has implications though, but would give us a good hint to really conclude it is a race condition with these plugins.


> 
> [Wed Mar 20 09:22:24 2019] slurmstepd[119463]: segfault at 0 ip
> 00002ba7bc926b2b sp 00007ffce1328040 error 4 in
> libslurmfull.so[2ba7bc8b6000+1b3000]
> [Wed Mar 20 09:22:24 2019] slurmstepd[119334]: segfault at 0 ip
> 00002ba7bc926b2b sp 00007ffce132a1b0 error 4 in
> libslurmfull.so[2ba7bc8b6000+1b3000]
> [Wed Mar 20 09:22:24 2019] slurmstepd[120270]: segfault at 0 ip
> 00002b0d19130b2b sp 00007ffc48d47390 error 4 in
> libslurmfull.so[2b0d190c0000+1b3000]
> [Wed Mar 20 09:22:24 2019] slurmstepd[120200]: segfault at 0 ip
> 00002b0d19130b2b sp 00007ffc48d49500 error 4 in
> libslurmfull.so[2b0d190c0000+1b3000]
> [Wed Mar 20 09:22:25 2019] slurmstepd[122299]: segfault at 0 ip
> 00002acf010e8b2b sp 00007ffcf6819d20 error 4 in
> libslurmfull.so[2acf01078000+1b3000]
> [Wed Mar 20 09:22:25 2019] slurmstepd[121978]: segfault at 0 ip
> 00002acf010e8b2b sp 00007ffcf681be90 error 4 in
> libslurmfull.so[2acf01078000+1b3000]

This seems a different issue.

We should try to look for the code where it segfaulted:

In the libslurmfull.so[2acf01078000+1b3000] part, the 2acf01078000 is where the library was loaded. Subtract this from the instruction pointer (ip, 00002acf010e8b2b) and you'll get the offset (it gives me 70b2b) into the .so of the offending instruction. Then you can use objdump -DCgl libslurmfull.so and search for the instruction at that offset. We should easily be able to figure out which function it is from the asm labels.

We can also try using addr2line -j.text -e libslurmfull.so 70b2b, but because of compiler optimization it may not be correct.

https://stackoverflow.com/questions/2179403/how-do-you-read-a-segfault-kernel-log-message

I wouldn't worry too much now for this segfault though.
Comment 80 John Hanks 2019-03-22 10:25:55 MDT
For the sake of completeness (and because I have multiple jobs on multiple nodes that had this segfault in the same way/location):

From the objdump:

/scg/slurm/sources/slurm-18.08.6/src/common/env.c:824
   70b11:       48 8b 05 30 32 34 00    mov    0x343230(%rip),%rax        # 3b3d48 <slurmctld_conf@@Base-0xc078>
   70b18:       4c 8b 80 60 04 00 00    mov    0x460(%rax),%r8
   70b1f:       4d 85 c0                test   %r8,%r8
   70b22:       75 0a                   jne    70b2e <setup_env+0x127e>
/scg/slurm/sources/slurm-18.08.6/src/common/env.c:827
   70b24:       48 8b 80 e0 00 00 00    mov    0xe0(%rax),%rax
   70b2b:       4c 8b 00                mov    (%rax),%r8
/scg/slurm/sources/slurm-18.08.6/src/common/env.c:828
   70b2e:       48 8b 05 13 32 34 00    mov    0x343213(%rip),%rax        # 3b3d48 <slurmctld_conf@@Base-0xc078>
   70b35:       44 8b 88 90 04 00 00    mov    0x490(%rax),%r9d
   70b3c:       48 8b 88 c8 00 00 00    mov    0xc8(%rax),%rcx
   70b43:       48 8d 7b 70             lea    0x70(%rbx),%rdi
   70b47:       c7 04 24 00 21 00 00    movl   $0x2100,(%rsp)
   70b4e:       48 8d 15 2d 62 0f 00    lea    0xf622d(%rip),%rdx        # 166d82 <__func__.7868+0x4c2>
   70b55:       48 8d 35 32 62 0f 00    lea    0xf6232(%rip),%rsi        # 166d8e <__func__.7868+0x4ce>
   70b5c:       b8 00 00 00 00          mov    $0x0,%eax
   70b61:       e8 1a 38 fc ff          callq  34380 <setenvf@plt>

And,


[root@dper730xd-srcf-d16-35 ~]# addr2line -j.text -e /scg/slurm/current/lib/slurm/libslurmfull.so 70b2b
/scg/slurm/sources/slurm-18.08.6/src/common/plugin.c:162
Comment 81 John Hanks 2019-03-22 10:33:38 MDT
(In reply to Felip Moll from comment #79)
> Thanks, I am investigating the race condition currently.
> 
> One thing you could temporary test is to change proctrack to linuxproc in
> order to see if the error is gone. This has implications though, but would
> give us a good hint to really conclude it is a race condition with these
> plugins.
> 

If I do this live with running jobs, will they be affected?

jbh
Comment 82 Felip Moll 2019-03-25 05:59:30 MDT
(In reply to John Hanks from comment #80)
> For the sake of completeness (and because I have multiple jobs on multiple
> nodes that had this segfault in the same way/location):
> 
> From the objdump:
> 
> /scg/slurm/sources/slurm-18.08.6/src/common/env.c:824
>    70b11:       48 8b 05 30 32 34 00    mov    0x343230(%rip),%rax        #
> 3b3d48 <slurmctld_conf@@Base-0xc078>
>    70b18:       4c 8b 80 60 04 00 00    mov    0x460(%rax),%r8
>    70b1f:       4d 85 c0                test   %r8,%r8
>    70b22:       75 0a                   jne    70b2e <setup_env+0x127e>
> /scg/slurm/sources/slurm-18.08.6/src/common/env.c:827
>    70b24:       48 8b 80 e0 00 00 00    mov    0xe0(%rax),%rax
>    70b2b:       4c 8b 00                mov    (%rax),%r8
> /scg/slurm/sources/slurm-18.08.6/src/common/env.c:828
>    70b2e:       48 8b 05 13 32 34 00    mov    0x343213(%rip),%rax        #
> 3b3d48 <slurmctld_conf@@Base-0xc078>
>    70b35:       44 8b 88 90 04 00 00    mov    0x490(%rax),%r9d
>    70b3c:       48 8b 88 c8 00 00 00    mov    0xc8(%rax),%rcx
>    70b43:       48 8d 7b 70             lea    0x70(%rbx),%rdi
>    70b47:       c7 04 24 00 21 00 00    movl   $0x2100,(%rsp)
>    70b4e:       48 8d 15 2d 62 0f 00    lea    0xf622d(%rip),%rdx        #
> 166d82 <__func__.7868+0x4c2>
>    70b55:       48 8d 35 32 62 0f 00    lea    0xf6232(%rip),%rsi        #
> 166d8e <__func__.7868+0x4ce>
>    70b5c:       b8 00 00 00 00          mov    $0x0,%eax
>    70b61:       e8 1a 38 fc ff          callq  34380 <setenvf@plt>
> 
> And,
> 
> 
> [root@dper730xd-srcf-d16-35 ~]# addr2line -j.text -e
> /scg/slurm/current/lib/slurm/libslurmfull.so 70b2b
> /scg/slurm/sources/slurm-18.08.6/src/common/plugin.c:162

Have you upgraded to 18.08.6? Have you changed your slurm.conf recently?
The plugin.c:162 is trying to load a .so plugin, while the results of the objdump is trying to setup the environment variable SLURM_WORKING_CLUSTER from slurmctld_conf.control_addr[0].

In any case I would suggest opening a new bug for this segfault.


In what regards to changing ProcTrack cgroup to linuxproc, I did some tests and your jobs shouldn't get affected. In any case I am suggesting this as a measure to diagnose the race condition between cgroup plugins and just add your bug to the list of bugs with the same symptoms. You need to restart all slurmds, but the stepds will continue working.

Let me know if you're willing to do this change, otherwise I will continue investigating the race and when I have some solution provide you with it.
Comment 83 John Hanks 2019-03-26 18:26:42 MDT
> Have you upgraded to 18.08.6? Have you changed your slurm.conf recently?

We are now at 18.08.6. No changes to slurm.conf recently except adding nodes and partitions. 
 
> In any case I would suggest opening a new bug for this segfault.

I'll fork a new bug the next time I am cleaning these up on a node. 
 

> 
> Let me know if you're willing to do this change, otherwise I will continue
> investigating the race and when I have some solution provide you with it.

Change made, as of this afternoon we are running with

ProctrackType=proctrack/linuxproc

If anything good or bad happens, you'll be the first to know :)

jbh
Comment 84 Felip Moll 2019-03-27 10:18:36 MDT
(In reply to John Hanks from comment #83)
> > Have you upgraded to 18.08.6? Have you changed your slurm.conf recently?
> 
> We are now at 18.08.6. No changes to slurm.conf recently except adding nodes
> and partitions. 
>  
> > In any case I would suggest opening a new bug for this segfault.
> 
> I'll fork a new bug the next time I am cleaning these up on a node. 
>  
> 
> > 
> > Let me know if you're willing to do this change, otherwise I will continue
> > investigating the race and when I have some solution provide you with it.
> 
> Change made, as of this afternoon we are running with
> 
> ProctrackType=proctrack/linuxproc
> 
> If anything good or bad happens, you'll be the first to know :)
> 
> jbh

John,

I identified a problem with mixing jobacct_gather/cgroup and proctrack/cgroup.

It turns out that the jobacct_gather/cgroup moves the pids of tasks in the extern step into something like: /sys/fs/cgroup/freezer/slurm_gamba1/uid_1000/job_27/step_extern/task_0/cgroup.procs , but proctrack/cgroup expects those to be in ...job_27/step_extern/cgroup.procs.

This causes that when the job finishes, the processes inside step_extern are not killed: i.e. the sleep processes, that's because a call to xcgroup_get_pids provided by proctrack/cgroup returns no pids. Then, since there are no pids to kill, a try to send a signal is useless and therefore the processes do not die, the cgroup is not removed, and the job doesn't finish (CG state). After a timeout, the job is flushed and the node set to drain.

The only piece left in the puzzle is that this is happening to me only when the cgroup mountpoints are shared, like:
  cgroup on /sys/fs/cgroup/freezer,memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory,freezer)

and that does not seem to be your case.

In any case, switching from jobacct_gather/cgroup to linux , OR proctrack/cgroup to linuxproc, OR disabling the flag Contain(and X11) should workaround the problem. From these 3 options I would go for the jobacct_gather/linux one. I know I asked you to change proctrack, but is a test I want to do with your system since your cgroup is not shared, and I want to discard a couple of situations.

Let me know when you have new informations.
Comment 85 John Hanks 2019-03-27 11:13:45 MDT
(In reply to Felip Moll from comment #84)
> In any case, switching from jobacct_gather/cgroup to linux , OR
> proctrack/cgroup to linuxproc, OR disabling the flag Contain(and X11) should
> workaround the problem. From these 3 options I would go for the
> jobacct_gather/linux one. I know I asked you to change proctrack, but is a
> test I want to do with your system since your cgroup is not shared, and I
> want to discard a couple of situations.
> 
> Let me know when you have new information.

As of this morning we are now running with: 

ProctrackType=proctrack/cgroup
MemLimitEnforce=no
JobAcctGatherParams=NoOverMemoryKill
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
TaskPlugin=task/affinity,task/cgroup
PrologFlags=Alloc,Contain,X11


Is that the correct collection of settings you want us to test?

Best,

jbh
Comment 86 Felip Moll 2019-03-28 07:08:15 MDT
(In reply to John Hanks from comment #85)
> (In reply to Felip Moll from comment #84)
> > In any case, switching from jobacct_gather/cgroup to linux , OR
> > proctrack/cgroup to linuxproc, OR disabling the flag Contain(and X11) should
> > workaround the problem. From these 3 options I would go for the
> > jobacct_gather/linux one. I know I asked you to change proctrack, but is a
> > test I want to do with your system since your cgroup is not shared, and I
> > want to discard a couple of situations.
> > 
> > Let me know when you have new information.
> 
> As of this morning we are now running with: 
> 
> ProctrackType=proctrack/cgroup
> MemLimitEnforce=no
> JobAcctGatherParams=NoOverMemoryKill
> JobAcctGatherFrequency=30
> JobAcctGatherType=jobacct_gather/linux
> TaskPlugin=task/affinity,task/cgroup
> PrologFlags=Alloc,Contain,X11
> 
> 
> Is that the correct collection of settings you want us to test?
> 
> Best,
> 
> jbh

That should be correct and not give you errors anymore. If it still fails I will ask for more information, specifically:

In a failing node, when the job is still CG, and for each cgroup susbsystem in 'freezer,memory,cpu', show the associated tasks and the cgroup tree i.e.:

root@deb5:/sys/fs/cgroup# find . -type d -name job_27
./cpuset/slurm_gamba1/uid_1000/job_27
./memory/slurm_gamba1/uid_1000/job_27   <---- do for it
./freezer/slurm_gamba1/uid_1000/job_27  <---- do for it
./cpuacct/slurm_gamba1/uid_1000/job_27

> cd /sys/fs/cgroup/freezer/slurm_gamba1/uid_1000
> for i in $(find job_27 -name tasks -exec cat '{}' \;); do ps aux|grep $i|grep -v grep; done
> find job_27 -name cgroup.procs -exec echo '{}' \; -exec cat '{}' \;
> find job_27 -type f

Let me know about your status.
Comment 87 Felip Moll 2019-03-28 07:11:14 MDT
Btw, in comment 83 and comment 85 there's a mismatch.

Comment 83 says you are running with

ProctrackType=proctrack/linuxproc

Comment 85 says you are running with

ProctrackType=proctrack/cgroup


In any case I am interested in knowing if comment 83 change had any benefit, or if since comment 85 the issue happened again. Both situations should've workaround the issue.
Comment 88 John Hanks 2019-04-08 14:17:08 MDT
(In reply to Felip Moll from comment #87)
> Btw, in comment 83 and comment 85 there's a mismatch.
> 
> Comment 83 says you are running with
> 
> ProctrackType=proctrack/linuxproc
> 
> Comment 85 says you are running with
> 
> ProctrackType=proctrack/cgroup
> 
> 
> In any case I am interested in knowing if comment 83 change had any benefit,
> or if since comment 85 the issue happened again. Both situations should've
> workaround the issue.

Comment 85 is the current state, the Comment 83 state was short-lived and I went through the whole config to verify and get to the current (85) state based on rereading your suggestions.

I'm still draining some nodes, but have been rebooting nodes since that change and have not yet seen the problem hit on a freshly rebooted node under the current configuration. Since I like a lively parameter space to explore the ongoing reboots have been for a kernel update :)

jbh
Comment 89 Felip Moll 2019-04-09 03:37:46 MDT
(In reply to John Hanks from comment #88)
> (In reply to Felip Moll from comment #87)
> > Btw, in comment 83 and comment 85 there's a mismatch.
> > 
> > Comment 83 says you are running with
> > 
> > ProctrackType=proctrack/linuxproc
> > 
> > Comment 85 says you are running with
> > 
> > ProctrackType=proctrack/cgroup
> > 
> > 
> > In any case I am interested in knowing if comment 83 change had any benefit,
> > or if since comment 85 the issue happened again. Both situations should've
> > workaround the issue.
> 
> Comment 85 is the current state, the Comment 83 state was short-lived and I
> went through the whole config to verify and get to the current (85) state
> based on rereading your suggestions.
> 
> I'm still draining some nodes, but have been rebooting nodes since that
> change and have not yet seen the problem hit on a freshly rebooted node
> under the current configuration. Since I like a lively parameter space to
> explore the ongoing reboots have been for a kernel update :)
> 
> jbh

Thanks, please keep me posted if something new happens.
Comment 90 John Hanks 2019-04-10 07:52:16 MDT
 
> Thanks, please keep me posted if something new happens.


I woke up this morning to a large batch of nodes draining with reason="Kill task failed". This in itself is not new, been struggling with this for years on and off but I thought I'd describe it in case it's related to our current problem.

A pstree shows the offending jobs stuck with a sleep and several slurmstepd processes.



           ├─slurmd(25725)
           ├─slurmstepd(48947)─┬─sleep(48952)
           │                   ├─{slurmstepd}(48948)
           │                   ├─{slurmstepd}(48949)
           │                   ├─{slurmstepd}(48950)
           │                   └─{slurmstepd}(48951)
           ├─slurmstepd(48954)─┬─slurm_script(48965,krystal2)───python(49615)───java(49616)─┬─{java}(49617)
           │                   │                                                            ├─{java}(49618)
           [ many java processes snipped here ]
           │                   │                                                            ├─{java}(49652)
           │                   │                                                            └─{java}(49653)
           │                   ├─{slurmstepd}(48955)
           │                   ├─{slurmstepd}(48956)
           │                   ├─{slurmstepd}(48957)
           │                   └─{slurmstepd}(48958)
           ├─slurmstepd(65982)─┬─sleep(65987)
           │                   ├─{slurmstepd}(65983)
           │                   ├─{slurmstepd}(65984)
           │                   ├─{slurmstepd}(65985)
           │                   └─{slurmstepd}(65986)
           ├─slurmstepd(65989)─┬─slurm_script(65999,abahman)───sh(66029)
           │                   ├─{slurmstepd}(65990)
           │                   ├─{slurmstepd}(65991)
           │                   ├─{slurmstepd}(65992)
           │                   └─{slurmstepd}(65993)


A bt of one of these (48947) shows:

(gdb) thread apply all bt full

Thread 5 (Thread 0x2b068dafb700 (LWP 48948)):
#0  0x00002b068ec3c965 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
No symbol table info available.
#1  0x00002b068dcb7303 in _watch_tasks (arg=<optimized out>) at slurm_jobacct_gather.c:373
        err = <optimized out>
        __func__ = "_watch_tasks"
#2  0x00002b068ec38dd5 in start_thread () from /usr/lib64/libpthread.so.0
No symbol table info available.
#3  0x00002b068ef4aead in clone () from /usr/lib64/libc.so.6
No symbol table info available.

Thread 4 (Thread 0x2b0694a1e700 (LWP 48949)):
#0  0x00002b068ec3cd12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
No symbol table info available.
#1  0x00002b068dcb16b0 in _timer_thread (args=<optimized out>) at slurm_acct_gather_profile.c:205
        err = <optimized out>
        i = 4
        now = 1554903404
        diff = <optimized out>
        tvnow = {tv_sec = 1554742482, tv_usec = 891838}
        abs = {tv_sec = 1554903405, tv_nsec = 891838000}
        __func__ = "_timer_thread"
#2  0x00002b068ec38dd5 in start_thread () from /usr/lib64/libpthread.so.0
No symbol table info available.
#3  0x00002b068ef4aead in clone () from /usr/lib64/libc.so.6
No symbol table info available.

Thread 3 (Thread 0x2b0694b1f700 (LWP 48950)):
#0  0x00002b068ef4020d in poll () from /usr/lib64/libc.so.6
No symbol table info available.
#1  0x00002b068dd4ad77 in _poll_internal (shutdown_time=<optimized out>, nfds=2, pfds=0x2b06980008d0) at eio.c:364
        n = <optimized out>
        timeout = -1
#2  eio_handle_mainloop (eio=0x1a6ef80) at eio.c:328
        retval = 0
        pollfds = 0x2b06980008d0
        map = 0x2b0698000900
        maxnfds = 1
        nfds = 2
        n = <optimized out>
        shutdown_time = <optimized out>
        __func__ = "eio_handle_mainloop"
#3  0x000000000041d98a in _msg_thr_internal (job_arg=0x1a6eb10) at req.c:273
        job = 0x1a6eb10
#4  0x00002b068ec38dd5 in start_thread () from /usr/lib64/libpthread.so.0
No symbol table info available.
#5  0x00002b068ef4aead in clone () from /usr/lib64/libc.so.6
No symbol table info available.

Thread 2 (Thread 0x2b0695a52700 (LWP 48951)):
#0  0x00002b068ef4020d in poll () from /usr/lib64/libc.so.6
No symbol table info available.
#1  0x00002b069513a942 in _oom_event_monitor (x=<optimized out>) at task_cgroup_memory.c:478
        args = 0x1a6c9e0
        ret = <optimized out>
        res = 0
        fds = {{fd = 12, events = 3, revents = 0}, {fd = 13, events = 8193, revents = 0}}
        __func__ = "_oom_event_monitor"
#2  0x00002b068ec38dd5 in start_thread () from /usr/lib64/libpthread.so.0
No symbol table info available.
#3  0x00002b068ef4aead in clone () from /usr/lib64/libc.so.6
No symbol table info available.

Thread 1 (Thread 0x2b068d9f9040 (LWP 48947)):
#0  0x00002b068ef11aea in wait4 () from /usr/lib64/libc.so.6
No symbol table info available.
#1  0x000000000041084f in _spawn_job_container (job=0x1a6eb10) at mgr.c:1183
        jobacct_id = {taskid = 0, nodeid = 0, job = 0x1a6eb10}
        status = 0
        pid = 48952
        jobacct = 0x0
        rusage = {ru_utime = {tv_sec = 0, tv_usec = 0}, ru_stime = {tv_sec = 0, tv_usec = 0}, {ru_maxrss = 0, __ru_maxrss_word = 0}, {
            ru_ixrss = 0, __ru_ixrss_word = 0}, {ru_idrss = 0, __ru_idrss_word = 0}, {ru_isrss = 0, __ru_isrss_word = 0}, {ru_minflt = 0, 
            __ru_minflt_word = 0}, {ru_majflt = 0, __ru_majflt_word = 0}, {ru_nswap = 0, __ru_nswap_word = 0}, {ru_inblock = 0, 
            __ru_inblock_word = 0}, {ru_oublock = 0, __ru_oublock_word = 0}, {ru_msgsnd = 0, __ru_msgsnd_word = 0}, {ru_msgrcv = 0, 
            __ru_msgrcv_word = 0}, {ru_nsignals = 0, __ru_nsignals_word = 0}, {ru_nvcsw = 0, __ru_nvcsw_word = 0}, {ru_nivcsw = 0, 
            __ru_nivcsw_word = 0}}
        rc = 0
        x11_pipe = {0, 0}
#2  job_manager (job=job@entry=0x1a6eb10) at mgr.c:1292
        rc = 0
        io_initialized = false
        ckpt_type = 0x1a62660 "checkpoint/none"
        err_msg = 0x0
        __func__ = "job_manager"
#3  0x000000000040d452 in main (argc=1, argv=0x7fffbc2431f8) at slurmstepd.c:174
        cli = 0x1a6d370
        self = 0x1a63010
        msg = 0x1a637f0
        rc = 0
        launch_params = 0x0
        __func__ = "main"


An strace of this process shows this in a loop:

[pid 48949] futex(0x2b068dfad160, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 48949] futex(0x2b068dfad124, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 322193, {1554903579, 891838000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid 48949] futex(0x2b068dfad160, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 48949] futex(0x2b068dfad124, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 322195, {1554903580, 891838000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid 48949] futex(0x2b068dfad160, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 48949] futex(0x2b068dfaf3fc, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x2b068dfaf3f8, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
[pid 48948] <... futex resumed> )       = 0
[pid 48949] futex(0x2b068dfad124, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 322197, {1554903581, 891838000}, ffffffff <unfinished ...>
[pid 48948] futex(0x2b068dfaf428, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 48948] stat("/sys/fs/cgroup/freezer/slurm/uid_281293/job_8592288/step_extern/cgroup.procs", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
[pid 48948] open("/sys/fs/cgroup/freezer/slurm/uid_281293/job_8592288/step_extern/cgroup.procs", O_RDONLY) = 9
[pid 48948] lseek(9, 0, SEEK_CUR)       = 0
[pid 48948] lseek(9, 0, SEEK_SET)       = 0
[pid 48948] read(9, "4", 1)             = 1
[pid 48948] read(9, "8", 1)             = 1
[pid 48948] read(9, "9", 1)             = 1
[pid 48948] read(9, "5", 1)             = 1
[pid 48948] read(9, "2", 1)             = 1
[pid 48948] read(9, "\n", 1)            = 1
[pid 48948] read(9, "", 1)              = 0
[pid 48948] lseek(9, 0, SEEK_SET)       = 0
[pid 48948] read(9, "48952\n", 6)       = 6
[pid 48948] close(9)                    = 0
[pid 48948] open("/proc/48952/stat", O_RDONLY) = 9
[pid 48948] fcntl(9, F_SETFD, FD_CLOEXEC) = 0
[pid 48948] read(9, "48952 (sleep) S 48947 48952 4894"..., 511) = 320
[pid 48948] open("/proc/48952/status", O_RDONLY) = 15
[pid 48948] fstat(15, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
[pid 48948] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b068dafc000
[pid 48948] read(15, "Name:\tsleep\nUmask:\t0022\nState:\tS"..., 1024) = 1024
[pid 48948] close(15)                   = 0
[pid 48948] munmap(0x2b068dafc000, 4096) = 0
[pid 48948] close(9)                    = 0
[pid 48948] open("/proc/48952/io", O_RDONLY) = 9
[pid 48948] fcntl(9, F_SETFD, FD_CLOEXEC) = 0
[pid 48948] read(9, "rchar: 2012\nwchar: 0\nsyscr: 6\nsy"..., 255) = 93
[pid 48948] open("/proc/48952/status", O_RDONLY) = 15
[pid 48948] fstat(15, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
[pid 48948] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b068dafc000
[pid 48948] read(15, "Name:\tsleep\nUmask:\t0022\nState:\tS"..., 1024) = 1024
[pid 48948] close(15)                   = 0
[pid 48948] munmap(0x2b068dafc000, 4096) = 0
[pid 48948] close(9)                    = 0
[pid 48948] open("/sys/devices/system/cpu/cpu7/cpufreq/cpuinfo_cur_freq", O_RDONLY) = 9
[pid 48948] fcntl(9, F_SETFD, FD_CLOEXEC) = 0
[pid 48948] read(9, "2401000", 7)       = 7
[pid 48948] close(9)                    = 0
[pid 48948] futex(0x2b068dfaf3fc, FUTEX_WAIT_PRIVATE, 10735, NULL <unfinished ...>
[pid 48949] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 48949] futex(0x2b068dfad160, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 48949] futex(0x2b068dfad124, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 322199, {1554903582, 891838000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid 48949] futex(0x2b068dfad160, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 48949] futex(0x2b068dfad124, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 322201, {1554903583, 891838000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid 48949] futex(0x2b068dfad160, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 48949] futex(0x2b068dfad124, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 322203, {1554903584, 891838000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid 48949] futex(0x2b068dfad160, FUTEX_WAKE_PRIVATE, 1) = 0

The job in question shows as running:


(scg)[I am root@frankie ~]# sacct -j 8592288
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
8592288            GERM      batch   ccurtis2          1    RUNNING      0:0 
8592288.ext+     extern              ccurtis2          1    RUNNING      0:0 

although clearly all user parts of it have exited.

scancel works to kill this job.

I don't have much time to dig into this right now but will add more detail when I do.

jbh
Comment 91 Felip Moll 2019-04-10 08:21:01 MDT
(In reply to John Hanks from comment #90)
>  
> > Thanks, please keep me posted if something new happens.
> 
> 
> I woke up this morning to a large batch of nodes draining with reason="Kill
> task failed". This in itself is not new, been struggling with this for years
> on and off but I thought I'd describe it in case it's related to our current
> problem.
> 
> A pstree shows the offending jobs stuck with a sleep and several slurmstepd
> processes.
> 
> 
> 
>            ├─slurmd(25725)
>            ├─slurmstepd(48947)─┬─sleep(48952)
>            │                   ├─{slurmstepd}(48948)
>            │                   ├─{slurmstepd}(48949)
>            │                   ├─{slurmstepd}(48950)
>            │                   └─{slurmstepd}(48951)
>           
> ├─slurmstepd(48954)─┬─slurm_script(48965,
> krystal2)───python(49615)───java(49616)─┬─{java}(49617)
>            │                   │                                            
> ├─{java}(49618)
>            [ many java processes snipped here ]
>            │                   │                                            
> ├─{java}(49652)
>            │                   │                                            
> └─{java}(49653)
>            │                   ├─{slurmstepd}(48955)
>            │                   ├─{slurmstepd}(48956)
>            │                   ├─{slurmstepd}(48957)
>            │                   └─{slurmstepd}(48958)
>            ├─slurmstepd(65982)─┬─sleep(65987)
>            │                   ├─{slurmstepd}(65983)
>            │                   ├─{slurmstepd}(65984)
>            │                   ├─{slurmstepd}(65985)
>            │                   └─{slurmstepd}(65986)
>            ├─slurmstepd(65989)─┬─slurm_script(65999,abahman)───sh(66029)
>            │                   ├─{slurmstepd}(65990)
>            │                   ├─{slurmstepd}(65991)
>            │                   ├─{slurmstepd}(65992)
>            │                   └─{slurmstepd}(65993)
> 
> 
> A bt of one of these (48947) shows:
> 
> (gdb) thread apply all bt full
> 
> Thread 5 (Thread 0x2b068dafb700 (LWP 48948)):
> #0  0x00002b068ec3c965 in pthread_cond_wait@@GLIBC_2.3.2 () from
> /usr/lib64/libpthread.so.0
> No symbol table info available.
> #1  0x00002b068dcb7303 in _watch_tasks (arg=<optimized out>) at
> slurm_jobacct_gather.c:373
>         err = <optimized out>
>         __func__ = "_watch_tasks"
> #2  0x00002b068ec38dd5 in start_thread () from /usr/lib64/libpthread.so.0
> No symbol table info available.
> #3  0x00002b068ef4aead in clone () from /usr/lib64/libc.so.6
> No symbol table info available.
> 
> Thread 4 (Thread 0x2b0694a1e700 (LWP 48949)):
> #0  0x00002b068ec3cd12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from
> /usr/lib64/libpthread.so.0
> No symbol table info available.
> #1  0x00002b068dcb16b0 in _timer_thread (args=<optimized out>) at
> slurm_acct_gather_profile.c:205
>         err = <optimized out>
>         i = 4
>         now = 1554903404
>         diff = <optimized out>
>         tvnow = {tv_sec = 1554742482, tv_usec = 891838}
>         abs = {tv_sec = 1554903405, tv_nsec = 891838000}
>         __func__ = "_timer_thread"
> #2  0x00002b068ec38dd5 in start_thread () from /usr/lib64/libpthread.so.0
> No symbol table info available.
> #3  0x00002b068ef4aead in clone () from /usr/lib64/libc.so.6
> No symbol table info available.
> 
> Thread 3 (Thread 0x2b0694b1f700 (LWP 48950)):
> #0  0x00002b068ef4020d in poll () from /usr/lib64/libc.so.6
> No symbol table info available.
> #1  0x00002b068dd4ad77 in _poll_internal (shutdown_time=<optimized out>,
> nfds=2, pfds=0x2b06980008d0) at eio.c:364
>         n = <optimized out>
>         timeout = -1
> #2  eio_handle_mainloop (eio=0x1a6ef80) at eio.c:328
>         retval = 0
>         pollfds = 0x2b06980008d0
>         map = 0x2b0698000900
>         maxnfds = 1
>         nfds = 2
>         n = <optimized out>
>         shutdown_time = <optimized out>
>         __func__ = "eio_handle_mainloop"
> #3  0x000000000041d98a in _msg_thr_internal (job_arg=0x1a6eb10) at req.c:273
>         job = 0x1a6eb10
> #4  0x00002b068ec38dd5 in start_thread () from /usr/lib64/libpthread.so.0
> No symbol table info available.
> #5  0x00002b068ef4aead in clone () from /usr/lib64/libc.so.6
> No symbol table info available.
> 
> Thread 2 (Thread 0x2b0695a52700 (LWP 48951)):
> #0  0x00002b068ef4020d in poll () from /usr/lib64/libc.so.6
> No symbol table info available.
> #1  0x00002b069513a942 in _oom_event_monitor (x=<optimized out>) at
> task_cgroup_memory.c:478
>         args = 0x1a6c9e0
>         ret = <optimized out>
>         res = 0
>         fds = {{fd = 12, events = 3, revents = 0}, {fd = 13, events = 8193,
> revents = 0}}
>         __func__ = "_oom_event_monitor"
> #2  0x00002b068ec38dd5 in start_thread () from /usr/lib64/libpthread.so.0
> No symbol table info available.
> #3  0x00002b068ef4aead in clone () from /usr/lib64/libc.so.6
> No symbol table info available.
> 
> Thread 1 (Thread 0x2b068d9f9040 (LWP 48947)):
> #0  0x00002b068ef11aea in wait4 () from /usr/lib64/libc.so.6
> No symbol table info available.
> #1  0x000000000041084f in _spawn_job_container (job=0x1a6eb10) at mgr.c:1183
>         jobacct_id = {taskid = 0, nodeid = 0, job = 0x1a6eb10}
>         status = 0
>         pid = 48952
>         jobacct = 0x0
>         rusage = {ru_utime = {tv_sec = 0, tv_usec = 0}, ru_stime = {tv_sec =
> 0, tv_usec = 0}, {ru_maxrss = 0, __ru_maxrss_word = 0}, {
>             ru_ixrss = 0, __ru_ixrss_word = 0}, {ru_idrss = 0,
> __ru_idrss_word = 0}, {ru_isrss = 0, __ru_isrss_word = 0}, {ru_minflt = 0, 
>             __ru_minflt_word = 0}, {ru_majflt = 0, __ru_majflt_word = 0},
> {ru_nswap = 0, __ru_nswap_word = 0}, {ru_inblock = 0, 
>             __ru_inblock_word = 0}, {ru_oublock = 0, __ru_oublock_word = 0},
> {ru_msgsnd = 0, __ru_msgsnd_word = 0}, {ru_msgrcv = 0, 
>             __ru_msgrcv_word = 0}, {ru_nsignals = 0, __ru_nsignals_word =
> 0}, {ru_nvcsw = 0, __ru_nvcsw_word = 0}, {ru_nivcsw = 0, 
>             __ru_nivcsw_word = 0}}
>         rc = 0
>         x11_pipe = {0, 0}
> #2  job_manager (job=job@entry=0x1a6eb10) at mgr.c:1292
>         rc = 0
>         io_initialized = false
>         ckpt_type = 0x1a62660 "checkpoint/none"
>         err_msg = 0x0
>         __func__ = "job_manager"
> #3  0x000000000040d452 in main (argc=1, argv=0x7fffbc2431f8) at
> slurmstepd.c:174
>         cli = 0x1a6d370
>         self = 0x1a63010
>         msg = 0x1a637f0
>         rc = 0
>         launch_params = 0x0
>         __func__ = "main"
> 
> 
> An strace of this process shows this in a loop:
> 
> [pid 48949] futex(0x2b068dfad160, FUTEX_WAKE_PRIVATE, 1) = 0
> [pid 48949] futex(0x2b068dfad124,
> FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 322193, {1554903579,
> 891838000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
> [pid 48949] futex(0x2b068dfad160, FUTEX_WAKE_PRIVATE, 1) = 0
> [pid 48949] futex(0x2b068dfad124,
> FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 322195, {1554903580,
> 891838000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
> [pid 48949] futex(0x2b068dfad160, FUTEX_WAKE_PRIVATE, 1) = 0
> [pid 48949] futex(0x2b068dfaf3fc, FUTEX_WAKE_OP_PRIVATE, 1, 1,
> 0x2b068dfaf3f8, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
> [pid 48948] <... futex resumed> )       = 0
> [pid 48949] futex(0x2b068dfad124,
> FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 322197, {1554903581,
> 891838000}, ffffffff <unfinished ...>
> [pid 48948] futex(0x2b068dfaf428, FUTEX_WAKE_PRIVATE, 1) = 0
> [pid 48948]
> stat("/sys/fs/cgroup/freezer/slurm/uid_281293/job_8592288/step_extern/cgroup.
> procs", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
> [pid 48948]
> open("/sys/fs/cgroup/freezer/slurm/uid_281293/job_8592288/step_extern/cgroup.
> procs", O_RDONLY) = 9
> [pid 48948] lseek(9, 0, SEEK_CUR)       = 0
> [pid 48948] lseek(9, 0, SEEK_SET)       = 0
> [pid 48948] read(9, "4", 1)             = 1
> [pid 48948] read(9, "8", 1)             = 1
> [pid 48948] read(9, "9", 1)             = 1
> [pid 48948] read(9, "5", 1)             = 1
> [pid 48948] read(9, "2", 1)             = 1
> [pid 48948] read(9, "\n", 1)            = 1
> [pid 48948] read(9, "", 1)              = 0
> [pid 48948] lseek(9, 0, SEEK_SET)       = 0
> [pid 48948] read(9, "48952\n", 6)       = 6
> [pid 48948] close(9)                    = 0
> [pid 48948] open("/proc/48952/stat", O_RDONLY) = 9
> [pid 48948] fcntl(9, F_SETFD, FD_CLOEXEC) = 0
> [pid 48948] read(9, "48952 (sleep) S 48947 48952 4894"..., 511) = 320
> [pid 48948] open("/proc/48952/status", O_RDONLY) = 15
> [pid 48948] fstat(15, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> [pid 48948] mmap(NULL, 4096, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b068dafc000
> [pid 48948] read(15, "Name:\tsleep\nUmask:\t0022\nState:\tS"..., 1024) = 1024
> [pid 48948] close(15)                   = 0
> [pid 48948] munmap(0x2b068dafc000, 4096) = 0
> [pid 48948] close(9)                    = 0
> [pid 48948] open("/proc/48952/io", O_RDONLY) = 9
> [pid 48948] fcntl(9, F_SETFD, FD_CLOEXEC) = 0
> [pid 48948] read(9, "rchar: 2012\nwchar: 0\nsyscr: 6\nsy"..., 255) = 93
> [pid 48948] open("/proc/48952/status", O_RDONLY) = 15
> [pid 48948] fstat(15, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> [pid 48948] mmap(NULL, 4096, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b068dafc000
> [pid 48948] read(15, "Name:\tsleep\nUmask:\t0022\nState:\tS"..., 1024) = 1024
> [pid 48948] close(15)                   = 0
> [pid 48948] munmap(0x2b068dafc000, 4096) = 0
> [pid 48948] close(9)                    = 0
> [pid 48948] open("/sys/devices/system/cpu/cpu7/cpufreq/cpuinfo_cur_freq",
> O_RDONLY) = 9
> [pid 48948] fcntl(9, F_SETFD, FD_CLOEXEC) = 0
> [pid 48948] read(9, "2401000", 7)       = 7
> [pid 48948] close(9)                    = 0
> [pid 48948] futex(0x2b068dfaf3fc, FUTEX_WAIT_PRIVATE, 10735, NULL
> <unfinished ...>
> [pid 48949] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
> [pid 48949] futex(0x2b068dfad160, FUTEX_WAKE_PRIVATE, 1) = 0
> [pid 48949] futex(0x2b068dfad124,
> FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 322199, {1554903582,
> 891838000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
> [pid 48949] futex(0x2b068dfad160, FUTEX_WAKE_PRIVATE, 1) = 0
> [pid 48949] futex(0x2b068dfad124,
> FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 322201, {1554903583,
> 891838000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
> [pid 48949] futex(0x2b068dfad160, FUTEX_WAKE_PRIVATE, 1) = 0
> [pid 48949] futex(0x2b068dfad124,
> FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 322203, {1554903584,
> 891838000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
> [pid 48949] futex(0x2b068dfad160, FUTEX_WAKE_PRIVATE, 1) = 0
> 
> The job in question shows as running:
> 
> 
> (scg)[I am root@frankie ~]# sacct -j 8592288
>        JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
> ------------ ---------- ---------- ---------- ---------- ---------- -------- 
> 8592288            GERM      batch   ccurtis2          1    RUNNING      0:0 
> 8592288.ext+     extern              ccurtis2          1    RUNNING      0:0 
> 
> although clearly all user parts of it have exited.
> 
> scancel works to kill this job.
> 
> I don't have much time to dig into this right now but will add more detail
> when I do.
> 
> jbh

Thanks John,

Can you do:

ls -R /sys/fs/cgroup/freezer/slurm/uid_281293/job_8592288/

Also:

cat /sys/fs/cgroup/freezer/slurm/uid_281293/job_8592288/step_extern/cgroup.procs


and if 'find /sys/fs/cgroup/freezer/slurm/uid_281293/job_8592288/ -name cgroup.procs' then cat each of them.

Then for each pid, correlate it with the pids on the system (ps aux|grep xxx).


Let's see if the race is the same than the one I already identified.
Comment 92 John Hanks 2019-04-10 09:15:22 MDT
That node got harvested by my node cleaner, but there are others. Here's one:

[root@sgisummit-frcf-111-10 ~]# ls -R /sys/fs/cgroup/freezer/slurm/uid_324759
/sys/fs/cgroup/freezer/slurm/uid_324759:
cgroup.clone_children  freezer.parent_freezing  job_8611335
cgroup.event_control   freezer.self_freezing    notify_on_release
cgroup.procs           freezer.state            tasks

/sys/fs/cgroup/freezer/slurm/uid_324759/job_8611335:
cgroup.clone_children  freezer.parent_freezing  notify_on_release
cgroup.event_control   freezer.self_freezing    step_batch
cgroup.procs           freezer.state            tasks

/sys/fs/cgroup/freezer/slurm/uid_324759/job_8611335/step_batch:
cgroup.clone_children  freezer.parent_freezing  notify_on_release
cgroup.event_control   freezer.self_freezing    tasks
cgroup.procs           freezer.state

There is no step_batch in the cgroups for this one, but other cgroup.procs appear to be empty. FWIW, the other job on this node which otherwise still appears to be normal also has empty cgroup.procs files.

[root@sgisummit-frcf-111-10 ~]# cd /sys/fs/cgroup/freezer/slurm/uid_324759/job_8611335/
[root@sgisummit-frcf-111-10 job_8611335]# ls
cgroup.clone_children  freezer.parent_freezing  notify_on_release
cgroup.event_control   freezer.self_freezing    step_batch
cgroup.procs           freezer.state            tasks
[root@sgisummit-frcf-111-10 job_8611335]# cat cgroup.procs 
[root@sgisummit-frcf-111-10 job_8611335]# cat step_batch/cgroup.procs 
[root@sgisummit-frcf-111-10 job_8611335]# 

Since this is a different node/job, strace and backtrace below for the sake of completeness.

[pid 169307] futex(0x2ba79a9e7160, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 169307] futex(0x2ba79a9e7124, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 310797, {1554909120, 234054000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid 169307] futex(0x2ba79a9e7160, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 169307] futex(0x2ba79a9e7124, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 310799, {1554909121, 234054000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid 169307] futex(0x2ba79a9e7160, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 169307] futex(0x2ba79a9e93fc, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x2ba79a9e93f8, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
[pid 169306] <... futex resumed> )      = 0
[pid 169307] futex(0x2ba79a9e7124, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 310801, {1554909122, 234054000}, ffffffff <unfinished ...>
[pid 169306] futex(0x2ba79a9e9428, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 169306] stat("/sys/fs/cgroup/freezer/slurm/uid_281293/job_8593185/step_extern/cgroup.procs", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
[pid 169306] open("/sys/fs/cgroup/freezer/slurm/uid_281293/job_8593185/step_extern/cgroup.procs", O_RDONLY) = 9
[pid 169306] lseek(9, 0, SEEK_CUR)      = 0
[pid 169306] lseek(9, 0, SEEK_SET)      = 0
[pid 169306] read(9, "1", 1)            = 1
[pid 169306] read(9, "6", 1)            = 1
[pid 169306] read(9, "9", 1)            = 1
[pid 169306] read(9, "3", 1)            = 1
[pid 169306] read(9, "1", 1)            = 1
[pid 169306] read(9, "0", 1)            = 1
[pid 169306] read(9, "\n", 1)           = 1
[pid 169306] read(9, "", 1)             = 0
[pid 169306] lseek(9, 0, SEEK_SET)      = 0
[pid 169306] read(9, "169310\n", 7)     = 7
[pid 169306] close(9)                   = 0
[pid 169306] open("/proc/169310/stat", O_RDONLY) = 9
[pid 169306] fcntl(9, F_SETFD, FD_CLOEXEC) = 0
[pid 169306] read(9, "169310 (sleep) S 169305 169310 1"..., 511) = 326
[pid 169306] open("/proc/169310/status", O_RDONLY) = 15
[pid 169306] fstat(15, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
[pid 169306] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2ba79a536000
[pid 169306] read(15, "Name:\tsleep\nUmask:\t0022\nState:\tS"..., 1024) = 1024
[pid 169306] close(15)                  = 0
[pid 169306] munmap(0x2ba79a536000, 4096) = 0
[pid 169306] close(9)                   = 0
[pid 169306] open("/proc/169310/io", O_RDONLY) = 9
[pid 169306] fcntl(9, F_SETFD, FD_CLOEXEC) = 0
[pid 169306] read(9, "rchar: 2012\nwchar: 0\nsyscr: 6\nsy"..., 255) = 93
[pid 169306] open("/proc/169310/status", O_RDONLY) = 15
[pid 169306] fstat(15, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
[pid 169306] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2ba79a536000
[pid 169306] read(15, "Name:\tsleep\nUmask:\t0022\nState:\tS"..., 1024) = 1024
[pid 169306] close(15)                  = 0
[pid 169306] munmap(0x2ba79a536000, 4096) = 0
[pid 169306] close(9)                   = 0
[pid 169306] open("/sys/devices/system/cpu/cpu12/cpufreq/cpuinfo_cur_freq", O_RDONLY) = 9
[pid 169306] fcntl(9, F_SETFD, FD_CLOEXEC) = 0
[pid 169306] read(9, "3199951", 7)      = 7
[pid 169306] close(9)                   = 0
[pid 169306] futex(0x2ba79a9e93fc, FUTEX_WAIT_PRIVATE, 10363, NULL <unfinished ...>
[pid 169307] <... futex resumed> )      = -1 ETIMEDOUT (Connection timed out)
[pid 169307] futex(0x2ba79a9e7160, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 169307] futex(0x2ba79a9e7124, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 310803, {1554909123, 234054000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid 169307] futex(0x2ba79a9e7160, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 169307] futex(0x2ba79a9e7124, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 310805, {1554909124, 234054000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)


(gdb) thread apply all bt full

Thread 5 (Thread 0x2ba79a535700 (LWP 169306)):
#0  0x00002ba79b676965 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
No symbol table info available.
#1  0x00002ba79a6f1303 in _watch_tasks (arg=<optimized out>) at slurm_jobacct_gather.c:373
        err = <optimized out>
        __func__ = "_watch_tasks"
#2  0x00002ba79b672dd5 in start_thread () from /usr/lib64/libpthread.so.0
No symbol table info available.
#3  0x00002ba79b984ead in clone () from /usr/lib64/libc.so.6
No symbol table info available.

Thread 4 (Thread 0x2ba7a1458700 (LWP 169307)):
#0  0x00002ba79b676d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
No symbol table info available.
#1  0x00002ba79a6eb6b0 in _timer_thread (args=<optimized out>) at slurm_acct_gather_profile.c:205
        err = <optimized out>
        i = 4
        now = 1554909165
        diff = <optimized out>
        tvnow = {tv_sec = 1554753721, tv_usec = 234054}
        abs = {tv_sec = 1554909166, tv_nsec = 234054000}
        __func__ = "_timer_thread"
#2  0x00002ba79b672dd5 in start_thread () from /usr/lib64/libpthread.so.0
No symbol table info available.
#3  0x00002ba79b984ead in clone () from /usr/lib64/libc.so.6
No symbol table info available.

Thread 3 (Thread 0x2ba7a1559700 (LWP 169308)):
#0  0x00002ba79b97a20d in poll () from /usr/lib64/libc.so.6
No symbol table info available.
#1  0x00002ba79a784d77 in _poll_internal (shutdown_time=<optimized out>, nfds=2, pfds=0x2ba7a40008d0) at eio.c:364
        n = <optimized out>
        timeout = -1
#2  eio_handle_mainloop (eio=0x1760ad0) at eio.c:328
        retval = 0
        pollfds = 0x2ba7a40008d0
        map = 0x2ba7a4000900
        maxnfds = 1
        nfds = 2
        n = <optimized out>
        shutdown_time = <optimized out>
        __func__ = "eio_handle_mainloop"
#3  0x000000000041d98a in _msg_thr_internal (job_arg=0x1759730) at req.c:273
        job = 0x1759730
#4  0x00002ba79b672dd5 in start_thread () from /usr/lib64/libpthread.so.0
No symbol table info available.
#5  0x00002ba79b984ead in clone () from /usr/lib64/libc.so.6
No symbol table info available.

Thread 2 (Thread 0x2ba7a248c700 (LWP 169309)):
#0  0x00002ba79b97a20d in poll () from /usr/lib64/libc.so.6
No symbol table info available.
#1  0x00002ba7a1b74942 in _oom_event_monitor (x=<optimized out>) at task_cgroup_memory.c:478
        args = 0x175e610
        ret = <optimized out>
        res = 0
        fds = {{fd = 12, events = 3, revents = 0}, {fd = 13, events = 8193, revents = 0}}
        __func__ = "_oom_event_monitor"
#2  0x00002ba79b672dd5 in start_thread () from /usr/lib64/libpthread.so.0
No symbol table info available.
#3  0x00002ba79b984ead in clone () from /usr/lib64/libc.so.6
No symbol table info available.

Thread 1 (Thread 0x2ba79a433040 (LWP 169305)):
#0  0x00002ba79b94baea in wait4 () from /usr/lib64/libc.so.6
No symbol table info available.
#1  0x000000000041084f in _spawn_job_container (job=0x1759730) at mgr.c:1183
        jobacct_id = {taskid = 0, nodeid = 0, job = 0x1759730}
        status = 0
        pid = 169310
        jobacct = 0x0
        rusage = {ru_utime = {tv_sec = 0, tv_usec = 0}, ru_stime = {tv_sec = 0, tv_usec = 0}, {ru_maxrss = 0, __ru_maxrss_word = 0}, {ru_ixrss = 0, 
            __ru_ixrss_word = 0}, {ru_idrss = 0, __ru_idrss_word = 0}, {ru_isrss = 0, __ru_isrss_word = 0}, {ru_minflt = 0, __ru_minflt_word = 0}, {
            ru_majflt = 0, __ru_majflt_word = 0}, {ru_nswap = 0, __ru_nswap_word = 0}, {ru_inblock = 0, __ru_inblock_word = 0}, {ru_oublock = 0, 
            __ru_oublock_word = 0}, {ru_msgsnd = 0, __ru_msgsnd_word = 0}, {ru_msgrcv = 0, __ru_msgrcv_word = 0}, {ru_nsignals = 0, 
            __ru_nsignals_word = 0}, {ru_nvcsw = 0, __ru_nvcsw_word = 0}, {ru_nivcsw = 0, __ru_nivcsw_word = 0}}
        rc = 0
        x11_pipe = {0, 0}
#2  job_manager (job=job@entry=0x1759730) at mgr.c:1292
        rc = 0
        io_initialized = false
        ckpt_type = 0x17556b0 "checkpoint/none"
        err_msg = 0x0
        __func__ = "job_manager"
#3  0x000000000040d452 in main (argc=1, argv=0x7ffce73c4188) at slurmstepd.c:174
        cli = 0x175b3a0
        self = 0x175b3d0
        msg = 0x1759a60
        rc = 0
        launch_params = 0x0
        __func__ = "main"
(gdb)
Comment 93 Felip Moll 2019-04-10 09:28:18 MDT
(In reply to John Hanks from comment #92)
> That node got harvested by my node cleaner, but there are others. Here's one:
> 
> [root@sgisummit-frcf-111-10 ~]# ls -R /sys/fs/cgroup/freezer/slurm/uid_324759
> /sys/fs/cgroup/freezer/slurm/uid_324759:
> cgroup.clone_children  freezer.parent_freezing  job_8611335
> cgroup.event_control   freezer.self_freezing    notify_on_release
> cgroup.procs           freezer.state            tasks
> 
> /sys/fs/cgroup/freezer/slurm/uid_324759/job_8611335:
> cgroup.clone_children  freezer.parent_freezing  notify_on_release
> cgroup.event_control   freezer.self_freezing    step_batch
> cgroup.procs           freezer.state            tasks
> 
> /sys/fs/cgroup/freezer/slurm/uid_324759/job_8611335/step_batch:
> cgroup.clone_children  freezer.parent_freezing  notify_on_release
> cgroup.event_control   freezer.self_freezing    tasks
> cgroup.procs           freezer.state
> 
> There is no step_batch in the cgroups for this one, but other cgroup.procs
> appear to be empty. FWIW, the other job on this node which otherwise still
> appears to be normal also has empty cgroup.procs files.


The issue is there's no a /sys/fs/cgroup/freezer/*/step_extern/cgroup.procs...

Note that you're pasting me the pid 169307's backtrace which doesn't seem to correspond to the ls -R you did:

[pid 169306] stat("/sys/fs/cgroup/freezer/slurm/uid_281293/job_8593185/step_extern/cgroup.procs", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0


The backtrace shows the pid trying to read the step_extern cgroup.procs, which doesn't exist, which means the race is very similar to what I already found. The step_extern directory should be there if you have PrologFlags=Contain (or X11).
Comment 94 John Hanks 2019-04-10 09:52:43 MDT
Sorry, bad tracking of jobids and pids on my part, too many open terminals. Let me try that again.

The job that caused the problem is gone now (as far as processes), it was job 8611335


(scg)[I am root@frankie ~]# scontrol show jobid=8611335
slurm_load_jobs error: Invalid job id specified
(scg)[I am root@frankie ~]# sacct -j 8611335
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
(scg)[I am root@frankie ~]# 

It did leave behind:

[root@sgisummit-frcf-111-10 job_8611335]# pwd
/sys/fs/cgroup/freezer/slurm/uid_324759/job_8611335
[root@sgisummit-frcf-111-10 job_8611335]# tree
.
├── cgroup.clone_children
├── cgroup.event_control
├── cgroup.procs
├── freezer.parent_freezing
├── freezer.self_freezing
├── freezer.state
├── notify_on_release
├── step_batch
│   ├── cgroup.clone_children
│   ├── cgroup.event_control
│   ├── cgroup.procs
│   ├── freezer.parent_freezing
│   ├── freezer.self_freezing
│   ├── freezer.state
│   ├── notify_on_release
│   └── tasks
└── tasks

[root@sgisummit-frcf-111-10 job_8611335]# cat cgroup.procs 
[root@sgisummit-frcf-111-10 job_8611335]# cat step_batch/cgroup.procs 
[root@sgisummit-frcf-111-10 job_8611335]# 


The log messages I have for it are:

(scg)[I am root@frankie ~]# grep 8611335 /srv/logs/messages-20190409 
Apr  9 18:30:12 frankie slurmctld[171811]: _slurm_rpc_submit_batch_job: JobId=8611335 InitPrio=270 usec=3141
Apr  9 19:05:41 frankie slurmctld[171811]: backfill: Started JobId=8611335 in batch on sgisummit-frcf-111-10
Apr  9 19:05:43 sgisummit-frcf-111-10 slurmd[25808]: task_p_slurmd_batch_request: 8611335
Apr  9 19:05:43 sgisummit-frcf-111-10 slurmd[25808]: task/affinity: job 8611335 CPU input mask for node: 0x002000
Apr  9 19:05:43 sgisummit-frcf-111-10 slurmd[25808]: task/affinity: job 8611335 CPU final HW mask for node: 0x002000
Apr  9 19:05:43 sgisummit-frcf-111-10 slurmd[25808]: _run_prolog: prolog with lock for job 8611335 ran for 0 seconds
Apr  9 19:05:43 sgisummit-frcf-111-10 slurmstepd[71324]: task/cgroup: /slurm/uid_324759/job_8611335: alloc=10240MB mem.limit=10240MB memsw.limit=15360MB
Apr  9 19:05:43 sgisummit-frcf-111-10 slurmstepd[71324]: task/cgroup: /slurm/uid_324759/job_8611335/step_extern: alloc=10240MB mem.limit=10240MB memsw.limit=15360MB
Apr  9 19:05:43 sgisummit-frcf-111-10 slurmd[25808]: Launching batch job 8611335 for UID 324759
Apr  9 19:05:43 sgisummit-frcf-111-10 slurmstepd[71331]: task/cgroup: /slurm/uid_324759/job_8611335: alloc=10240MB mem.limit=10240MB memsw.limit=15360MB
Apr  9 19:05:43 sgisummit-frcf-111-10 slurmstepd[71331]: task/cgroup: /slurm/uid_324759/job_8611335/step_batch: alloc=10240MB mem.limit=10240MB memsw.limit=15360MB
Apr  9 20:05:42 frankie slurmctld[171811]: Time limit exhausted for JobId=8611335
Apr  9 20:05:42 sgisummit-frcf-111-10 slurmstepd[71331]: error: *** JOB 8611335 ON sgisummit-frcf-111-10 CANCELLED AT 2019-04-09T20:05:42 DUE TO TIME LIMIT ***
Apr  9 20:08:50 frankie slurmctld[171811]: Resending TERMINATE_JOB request JobId=8611335 Nodelist=sgisummit-frcf-111-10
Apr  9 20:09:43 sgisummit-frcf-111-10 slurmstepd[71331]: error: *** JOB 8611335 STEPD TERMINATED ON sgisummit-frcf-111-10 AT 2019-04-09T20:09:42 DUE TO JOB NOT ENDING WITH SIGNALS ***
Apr  9 20:09:43 frankie slurmctld[171811]: error: slurmd error running JobId=8611335 on node(s)=sgisummit-frcf-111-10: Kill task failed
Apr  9 20:09:52 frankie slurmctld[171811]: cleanup_completing: JobId=8611335 completion process took 250 seconds


jbh
Comment 95 John Hanks 2019-04-10 10:03:26 MDT
Quick note about the scope of this, on April 9 we had 50 jobs die this way, none of them have records in the accounting database.
Comment 96 John Hanks 2019-04-10 10:35:33 MDT
Found one still in progress. Job has reached the end of it's time limit:

(scg)[I am root@frankie ~]# squeue -j 8610041
           JOBID    PARTITION                 NAME         USER      ACCOUNT ST    TIME_LEFT   TIME_LIMIT NODES CPUS MIN_MEMORY TRES_PER                 NODELIST(REASON)
         8610041        batch         vcf2fasta.sh      kwalter        jandr CG      INVALID        10:00     1    1         4G      N/A            sgisummit-frcf-111-30

On the node we have:

[root@sgisummit-frcf-111-30 job_8610041]# pwd
/sys/fs/cgroup/freezer/slurm/uid_324759/job_8610041
[root@sgisummit-frcf-111-30 job_8610041]# cat cgroup.procs 
17138
[root@sgisummit-frcf-111-30 job_8610041]# cat step_batch/cgroup.procs 
18556


There are remnants of many of this users jobs in /sys/fs/cgroup:

[root@sgisummit-frcf-111-30 uid_324759]# pwd
/sys/fs/cgroup/freezer/slurm/uid_324759
[root@sgisummit-frcf-111-30 uid_324759]# ls
cgroup.clone_children    freezer.self_freezing  job_8610086  job_8610112        tasks
cgroup.event_control     freezer.state          job_8610109  job_8610177
cgroup.procs             job_8610041            job_8610110  job_8610178
freezer.parent_freezing  job_8610064            job_8610111  notify_on_release

Looking at a sample of those, no job record:

(scg)[I am root@frankie ~]# sacct -j 8610041,8610064,8610086,8610177
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
(scg)[I am root@frankie ~]# 

Process tree looks like:

           ├─slurmd(25689)─┬─{slurmd}(22892)
           │               ├─{slurmd}(234651)
           │               ├─{slurmd}(96458)
           │               ├─{slurmd}(218953)
           │               ├─{slurmd}(89826)
           │               ├─{slurmd}(53393)
           │               ├─{slurmd}(846)
           │               └─{slurmd}(182218)
           ├─slurmstepd(17138)─┬─slurm_script(18556,kwalter)───sh(21597)
           │                   ├─{slurmstepd}(17139)
           │                   ├─{slurmstepd}(17140)
           │                   ├─{slurmstepd}(17141)
           │                   ├─{slurmstepd}(17142)
           │                   └─{slurmstepd}(24210)

and attaching to the 25689 results in a non-responsive gdb. strace shows:

[root@sgisummit-frcf-111-30 ~]# strace -ff -p 25689
strace: Process 25689 attached with 9 threads
[pid 182218] read(24,  <unfinished ...>
[pid   846] read(22,  <unfinished ...>
[pid 53393] read(20,  <unfinished ...>
[pid 89826] read(18,  <unfinished ...>
[pid 218953] read(16,  <unfinished ...>
[pid 96458] read(14,  <unfinished ...>
[pid 234651] read(11,  <unfinished ...>
[pid 22892] read(7,  <unfinished ...>
[pid 25689] accept(5, 

Those all seem to be reading from a socket:

[root@sgisummit-frcf-111-30 ~]# ls -l /proc/182218/fd/24
lrwx------ 1 root root 64 Apr 10 09:15 /proc/182218/fd/24 -> socket:[7847063]
[root@sgisummit-frcf-111-30 ~]# ls -l /proc/846/fd/22
lrwx------ 1 root root 64 Apr 10 09:15 /proc/846/fd/22 -> socket:[7781149]
[root@sgisummit-frcf-111-30 ~]# ls -l /proc/53393/fd/20
lrwx------ 1 root root 64 Apr 10 09:15 /proc/53393/fd/20 -> socket:[7819303]
[root@sgisummit-frcf-111-30 ~]# ls -l /proc/89826/fd/18
lrwx------ 1 root root 64 Apr 10 09:16 /proc/89826/fd/18 -> socket:[7800699]

which is connected to the slurmd?

[root@sgisummit-frcf-111-30 ~]# netstat -anp | grep 7847063
unix  3      [ ]         STREAM     CONNECTED     7847063  25689/slurmd         
[root@sgisummit-frcf-111-30 ~]# netstat -anp | grep 7781149
unix  3      [ ]         STREAM     CONNECTED     7781149  25689/slurmd         

Note, if I leave strace on there long enough I can see nhc run (health check) but then it goes back to the above state. 

I managed to kill slurmd (kill -11 PID) and now it hangs on startup:

[root@sgisummit-frcf-111-30 ~]# slurmd -D -v -v -v -v -v -v -v -v
slurmd: debug:  Log file re-opened
slurmd: debug4: found jobid = 8610041, stepid = 4294967294
slurmd: debug3: Trying to load plugin /scg/slurm/installed/slurm-18.08.6/lib/slurm/auth_munge.so
slurmd: debug:  Munge authentication plugin loaded
slurmd: debug3: Success.

Starting slurmd under gdb and letting it run for 30 seconds produces this bt:

(gdb) thread apply all bt full

Thread 1 (Thread 0x2aaaaaadb640 (LWP 148207)):
#0  0x00002aaaac3566e0 in __read_nocancel () from /usr/lib64/libpthread.so.0
No symbol table info available.
#1  0x00002aaaab244b36 in stepd_connect (directory=<optimized out>, nodename=<optimized out>, jobid=8610041, stepid=4294967294, 
    protocol_version=protocol_version@entry=0x66fe68) at stepd_api.c:282
        remaining = <optimized out>
        ptr = 0x7fffffffd908 ""
        rc = <optimized out>
        req = 0
        fd = <optimized out>
        rc = 0
        auth_cred = <optimized out>
        auth_info = 0x0
        local_nodename = 0x0
        buffer = 0x65ccf0
        len = 16384
        __func__ = "stepd_connect"
#2  0x000000000040b9b5 in _update_logging () at slurmd.c:2025
        fd = <optimized out>
        steps = 0x638ec0
        i = 0x66b940
        stepd = 0x66fe50
        o = <optimized out>
        cf = <optimized out>
#3  0x000000000040d130 in _read_config () at slurmd.c:921
        path_pubkey = 0x0
        cf = 0x633f40 <slurmctld_conf>
        cc = <optimized out>
        cr_flag = true
        gang_flag = false
        tok = <optimized out>
        save_ptr = 0x0
        over_memory_kill = false
        __func__ = "_read_config"
#4  0x000000000040eb57 in _slurmd_init () at slurmd.c:1543
        rlim = {rlim_cur = 18374686479671623680, rlim_max = 71776119061217280}
        stat_buf = {st_dev = 7596848497658783603, st_ino = 3543601059117542772, st_nlink = 18446462805749870897, st_mode = 4294967295, st_uid = 4294967295, 
          st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 0, st_blksize = 0, st_blocks = 0, st_atim = {tv_sec = 0, tv_nsec = 0}, st_mtim = {tv_sec = 0, 
            tv_nsec = 0}, st_ctim = {tv_sec = 0, tv_nsec = 0}, __unused = {0, 0, 0}}
        cpu_cnt = <optimized out>
#5  main (argc=10, argv=0x7fffffffdcd8) at slurmd.c:261
        i = <optimized out>
        pidfd = <optimized out>
        blocked_signals = {13, 0}
        cc = <optimized out>
        oom_value = <optimized out>
        slurmd_uid = 0
        curr_uid = 0
        time_stamp = '\000' <repeats 128 times>, "\002", '\000' <repeats 111 times>...
        __func__ = "main"


The hung job is still there:

[root@sgisummit-frcf-111-30 ~]# strace -ff -p 17138
strace: Process 17138 attached with 6 threads
[pid 17142] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid 17141] futex(0x2b515b002520, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 17140] futex(0x2b515b002520, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 17139] futex(0x2b515b0073fc, FUTEX_WAIT_PRIVATE, 61, NULL <unfinished ...>
[pid 17138] wait4(-1, 

But I can't interrupt to get a bt or convince it to dump core.

jbh
Comment 97 John Hanks 2019-04-10 10:43:26 MDT
slurmd *did* dump a core file at some point, I think maybe when I originally killed it with -11 but possibly durin gone of the startup attempts after that (timing is pretty close). It shows:

(gdb) thread apply all bt full

Thread 9 (Thread 0x2b93a0b3c700 (LWP 182218)):
#0  0x00002b9395b326fd in read () from /usr/lib64/libpthread.so.0
No symbol table info available.
#1  0x00002b9394a20b36 in stepd_connect (directory=<optimized out>, nodename=<optimized out>, jobid=8610041, stepid=4294967294, 
    protocol_version=protocol_version@entry=0x2b93fc01a638) at stepd_api.c:282
        remaining = <optimized out>
        ptr = 0x2b93a0b3b958 ""
        rc = <optimized out>
        req = 0
        fd = <optimized out>
        rc = 0
        auth_cred = <optimized out>
        auth_info = 0x0
        local_nodename = 0x0
        buffer = 0x2b93fc001450
        len = 16384
        __func__ = "stepd_connect"
#2  0x000000000040df81 in _fill_registration_msg (msg=0x2b93fc000cb0) at slurmd.c:770
        fd = <optimized out>
        steps = 0x14fc370
        stepd = 0x2b93fc01a620
        buf = {sysname = "Linux", '\000' <repeats 59 times>, nodename = "sgisummit-frcf-111-30.scg.stanford.edu", '\000' <repeats 26 times>, 
          release = "3.10.0-957.10.1.el7.x86_64", '\000' <repeats 38 times>, version = "#1 SMP Mon Mar 18 15:06:45 UTC 2019", '\000' <repeats 29 times>, 
          machine = "x86_64", '\000' <repeats 58 times>, domainname = "(none)", '\000' <repeats 58 times>}
        n = 0
        os = <optimized out>
        slurmd_start_time = 1554152209
        i = 0x152d700
        arch = <optimized out>
        gres_info = <optimized out>
        first_msg = false
#3  send_registration_msg (status=status@entry=0, startup=startup@entry=true) at slurmd.c:629
        ret_val = 0
        msg = 0x2b93fc000cb0
        __func__ = "send_registration_msg"
#4  0x000000000041a6dc in slurmd_req (msg=msg@entry=0x2b93fc0008d0) at req.c:388
        rc = 0
        __func__ = "slurmd_req"
#5  0x000000000040c973 in _service_connection (arg=<optimized out>) at slurmd.c:549
        con = 0x151a8c0
        msg = 0x2b93fc0008d0
        __func__ = "_service_connection"
        rc = 0
#6  0x00002b9395b2bdd5 in start_thread () from /usr/lib64/libpthread.so.0
No symbol table info available.
#7  0x00002b9395e3dead in clone () from /usr/lib64/libc.so.6
No symbol table info available.

Thread 8 (Thread 0x2b93a2438700 (LWP 846)):
#0  0x00002b9395b326fd in read () from /usr/lib64/libpthread.so.0
No symbol table info available.
#1  0x00002b9394a20b36 in stepd_connect (directory=<optimized out>, nodename=<optimized out>, jobid=8610041, stepid=4294967294, 
    protocol_version=protocol_version@entry=0x2b9400004be8) at stepd_api.c:282
        remaining = <optimized out>
        ptr = 0x2b93a2437958 ""
        rc = <optimized out>
        req = 0
        fd = <optimized out>
        rc = 0
        auth_cred = <optimized out>
        auth_info = 0x0
        local_nodename = 0x0
        buffer = 0x2b9400017d20
        len = 16384
        __func__ = "stepd_connect"
#2  0x000000000040df81 in _fill_registration_msg (msg=0x2b9400004d50) at slurmd.c:770
        fd = <optimized out>
        steps = 0x14fcb40
        stepd = 0x2b9400004bd0
        buf = {sysname = "Linux", '\000' <repeats 59 times>, nodename = "sgisummit-frcf-111-30.scg.stanford.edu", '\000' <repeats 26 times>, 
          release = "3.10.0-957.10.1.el7.x86_64", '\000' <repeats 38 times>, version = "#1 SMP Mon Mar 18 15:06:45 UTC 2019", '\000' <repeats 29 times>, 
          machine = "x86_64", '\000' <repeats 58 times>, domainname = "(none)", '\000' <repeats 58 times>}
        n = 0
        os = <optimized out>
        slurmd_start_time = 1554152209
        i = 0x152d6e0
        arch = <optimized out>
        gres_info = <optimized out>
        first_msg = false
#3  send_registration_msg (status=status@entry=0, startup=startup@entry=true) at slurmd.c:629
        ret_val = 0
        msg = 0x2b9400004d50
        __func__ = "send_registration_msg"
#4  0x000000000041a6dc in slurmd_req (msg=msg@entry=0x2b9400000970) at req.c:388
        rc = 0
        __func__ = "slurmd_req"
#5  0x000000000040c973 in _service_connection (arg=<optimized out>) at slurmd.c:549
        con = 0x1564520
        msg = 0x2b9400000970
        __func__ = "_service_connection"
---Type <return> to continue, or q <return> to quit---
        rc = 0
#6  0x00002b9395b2bdd5 in start_thread () from /usr/lib64/libpthread.so.0
No symbol table info available.
#7  0x00002b9395e3dead in clone () from /usr/lib64/libc.so.6
No symbol table info available.

Thread 7 (Thread 0x2b93a2d41700 (LWP 89826)):
#0  0x00002b9395b326fd in read () from /usr/lib64/libpthread.so.0
No symbol table info available.
#1  0x00002b9394a20b36 in stepd_connect (directory=<optimized out>, nodename=<optimized out>, jobid=8610041, stepid=4294967294, 
    protocol_version=protocol_version@entry=0x2b939c01cd98) at stepd_api.c:282
        remaining = <optimized out>
        ptr = 0x2b93a2d40958 ""
        rc = <optimized out>
        req = 0
        fd = <optimized out>
        rc = 0
        auth_cred = <optimized out>
        auth_info = 0x0
        local_nodename = 0x0
        buffer = 0x2b939c008400
        len = 16384
        __func__ = "stepd_connect"
#2  0x000000000040df81 in _fill_registration_msg (msg=0x2b939c01f040) at slurmd.c:770
        fd = <optimized out>
        steps = 0x14fcfa0
        stepd = 0x2b939c01cd80
        buf = {sysname = "Linux", '\000' <repeats 59 times>, nodename = "sgisummit-frcf-111-30.scg.stanford.edu", '\000' <repeats 26 times>, 
          release = "3.10.0-957.10.1.el7.x86_64", '\000' <repeats 38 times>, version = "#1 SMP Mon Mar 18 15:06:45 UTC 2019", '\000' <repeats 29 times>, 
          machine = "x86_64", '\000' <repeats 58 times>, domainname = "(none)", '\000' <repeats 58 times>}
        n = 0
        os = <optimized out>
        slurmd_start_time = 1554152209
        i = 0x152d620
        arch = <optimized out>
        gres_info = <optimized out>
        first_msg = false
#3  send_registration_msg (status=status@entry=0, startup=startup@entry=true) at slurmd.c:629
        ret_val = 0
        msg = 0x2b939c01f040
        __func__ = "send_registration_msg"
#4  0x000000000041a6dc in slurmd_req (msg=msg@entry=0x2b939c01df00) at req.c:388
        rc = 0
        __func__ = "slurmd_req"
#5  0x000000000040c973 in _service_connection (arg=<optimized out>) at slurmd.c:549
        con = 0x150d5c0
        msg = 0x2b939c01df00
        __func__ = "_service_connection"
        rc = 0
#6  0x00002b9395b2bdd5 in start_thread () from /usr/lib64/libpthread.so.0
No symbol table info available.
#7  0x00002b9395e3dead in clone () from /usr/lib64/libc.so.6
No symbol table info available.

Thread 6 (Thread 0x2b93a1f33700 (LWP 234651)):
#0  0x00002b9395b326fd in read () from /usr/lib64/libpthread.so.0
No symbol table info available.
#1  0x00002b9394a20b36 in stepd_connect (directory=<optimized out>, nodename=<optimized out>, jobid=8610041, stepid=4294967294, 
    protocol_version=protocol_version@entry=0x2b93e0019f08) at stepd_api.c:282
        remaining = <optimized out>
        ptr = 0x2b93a1f32958 ""
        rc = <optimized out>
        req = 0
        fd = <optimized out>
        rc = 0
        auth_cred = <optimized out>
        auth_info = 0x0
        local_nodename = 0x0
        buffer = 0x2b93e0003ce0
        len = 16384
        __func__ = "stepd_connect"
#2  0x000000000040df81 in _fill_registration_msg (msg=0x2b93e0000a80) at slurmd.c:770
        fd = <optimized out>
        steps = 0x14fc140
        stepd = 0x2b93e0019ef0
        buf = {sysname = "Linux", '\000' <repeats 59 times>, nodename = "sgisummit-frcf-111-30.scg.stanford.edu", '\000' <repeats 26 times>, 
          release = "3.10.0-957.10.1.el7.x86_64", '\000' <repeats 38 times>, version = "#1 SMP Mon Mar 18 15:06:45 UTC 2019", '\000' <repeats 29 times>, 
          machine = "x86_64", '\000' <repeats 58 times>, domainname = "(none)", '\000' <repeats 58 times>}
        n = 1
        os = <optimized out>
        slurmd_start_time = 1554152209
        i = 0x152d640
        arch = <optimized out>
        gres_info = <optimized out>
        first_msg = false
#3  send_registration_msg (status=status@entry=0, startup=startup@entry=true) at slurmd.c:629
        ret_val = 0
        msg = 0x2b93e0000a80
        __func__ = "send_registration_msg"
#4  0x000000000041a6dc in slurmd_req (msg=msg@entry=0x2b93e0000970) at req.c:388
        rc = 0
---Type <return> to continue, or q <return> to quit---
        __func__ = "slurmd_req"
#5  0x000000000040c973 in _service_connection (arg=<optimized out>) at slurmd.c:549
        con = 0x1561720
        msg = 0x2b93e0000970
        __func__ = "_service_connection"
        rc = 0
#6  0x00002b9395b2bdd5 in start_thread () from /usr/lib64/libpthread.so.0
No symbol table info available.
#7  0x00002b9395e3dead in clone () from /usr/lib64/libc.so.6
No symbol table info available.

Thread 5 (Thread 0x2b93a3549700 (LWP 22892)):
#0  0x00002b9395b326fd in read () from /usr/lib64/libpthread.so.0
No symbol table info available.
#1  0x00002b9394a20b36 in stepd_connect (directory=<optimized out>, nodename=<optimized out>, jobid=jobid@entry=8610041, stepid=4294967294, 
    protocol_version=protocol_version@entry=0x2b93f001bc28) at stepd_api.c:282
        remaining = <optimized out>
        ptr = 0x2b93a3548a58 ""
        rc = <optimized out>
        req = 0
        fd = <optimized out>
        rc = 0
        auth_cred = <optimized out>
        auth_info = 0x0
        local_nodename = 0x0
        buffer = 0x2b93f0001420
        len = 16384
        __func__ = "stepd_connect"
#2  0x0000000000416d12 in _job_still_running (job_id=8610041) at req.c:4661
        fd = <optimized out>
        retval = false
        steps = 0x14fc870
        i = 0x152d660
        s = 0x2b93f001bc10
#3  _pause_for_job_completion (job_id=8610041, nodes=<optimized out>, max_time=max_time@entry=0) at req.c:5752
        sec = 241
        pause = 10
        rc = <optimized out>
        count = 4
#4  0x0000000000418bff in _rpc_terminate_job (msg=msg@entry=0x2b93f0007280) at req.c:5554
        have_spank = <optimized out>
        rc = 0
        req = 0x2b93f0005c20
        uid = 383
        nsteps = <optimized out>
        delay = <optimized out>
        job_env = {jobid = 1, step_id = 0, node_list = 0x0, partition = 0x17f <Address 0x17f out of bounds>, 
          resv_id = 0x416c5d <_kill_all_active_steps+213> "E\205\344u&E\205\377\270u\220B", spank_job_env = 0xfffe0000, spank_job_env_size = 0, uid = 15, 
          user_name = 0xf00000000 <Address 0xf00000000 out of bounds>}
        __func__ = "_rpc_terminate_job"
#5  0x0000000000418f3c in _rpc_timelimit (msg=msg@entry=0x2b93f0007280) at req.c:3914
        uid = 383
        req = 0x2b93f0005c20
        nsteps = <optimized out>
        rc = <optimized out>
#6  0x00000000004196ec in slurmd_req (msg=msg@entry=0x2b93f0007280) at req.c:337
        rc = <optimized out>
        __func__ = "slurmd_req"
#7  0x000000000040c973 in _service_connection (arg=<optimized out>) at slurmd.c:549
        con = 0x1547ff0
        msg = 0x2b93f0007280
        __func__ = "_service_connection"
        rc = 0
#8  0x00002b9395b2bdd5 in start_thread () from /usr/lib64/libpthread.so.0
No symbol table info available.
#9  0x00002b9395e3dead in clone () from /usr/lib64/libc.so.6
No symbol table info available.

Thread 4 (Thread 0x2b93a2e42700 (LWP 53393)):
#0  0x00002b9395b326fd in read () from /usr/lib64/libpthread.so.0
No symbol table info available.
#1  0x00002b9394a20b36 in stepd_connect (directory=<optimized out>, nodename=<optimized out>, jobid=8610041, stepid=4294967294, 
    protocol_version=protocol_version@entry=0x2b93bc01a288) at stepd_api.c:282
        remaining = <optimized out>
        ptr = 0x2b93a2e41958 ""
        rc = <optimized out>
        req = 0
        fd = <optimized out>
        rc = 0
        auth_cred = <optimized out>
        auth_info = 0x0
        local_nodename = 0x0
        buffer = 0x2b93bc001fa0
        len = 16384
        __func__ = "stepd_connect"
#2  0x000000000040df81 in _fill_registration_msg (msg=0x2b93bc01b0e0) at slurmd.c:770
        fd = <optimized out>
        steps = 0x14fc230
        stepd = 0x2b93bc01a270
        buf = {sysname = "Linux", '\000' <repeats 59 times>, nodename = "sgisummit-frcf-111-30.scg.stanford.edu", '\000' <repeats 26 times>, 
          release = "3.10.0-957.10.1.el7.x86_64", '\000' <repeats 38 times>, version = "#1 SMP Mon Mar 18 15:06:45 UTC 2019", '\000' <repeats 29 times>, 
---Type <return> to continue, or q <return> to quit---
          machine = "x86_64", '\000' <repeats 58 times>, domainname = "(none)", '\000' <repeats 58 times>}
        n = 0
        os = <optimized out>
        slurmd_start_time = 1554152209
        i = 0x152d6c0
        arch = <optimized out>
        gres_info = <optimized out>
        first_msg = false
#3  send_registration_msg (status=status@entry=0, startup=startup@entry=true) at slurmd.c:629
        ret_val = 0
        msg = 0x2b93bc01b0e0
        __func__ = "send_registration_msg"
#4  0x000000000041a6dc in slurmd_req (msg=msg@entry=0x2b93bc01af50) at req.c:388
        rc = 0
        __func__ = "slurmd_req"
#5  0x000000000040c973 in _service_connection (arg=<optimized out>) at slurmd.c:549
        con = 0x154bdf0
        msg = 0x2b93bc01af50
        __func__ = "_service_connection"
        rc = 0
#6  0x00002b9395b2bdd5 in start_thread () from /usr/lib64/libpthread.so.0
No symbol table info available.
#7  0x00002b9395e3dead in clone () from /usr/lib64/libc.so.6
No symbol table info available.

Thread 3 (Thread 0x2b93a2034700 (LWP 218953)):
#0  0x00002b9395b326fd in read () from /usr/lib64/libpthread.so.0
No symbol table info available.
#1  0x00002b9394a20b36 in stepd_connect (directory=<optimized out>, nodename=<optimized out>, jobid=8610041, stepid=4294967294, 
    protocol_version=protocol_version@entry=0x2b93f8003528) at stepd_api.c:282
        remaining = <optimized out>
        ptr = 0x2b93a2033958 ""
        rc = <optimized out>
        req = 0
        fd = <optimized out>
        rc = 0
        auth_cred = <optimized out>
        auth_info = 0x0
        local_nodename = 0x0
        buffer = 0x2b93f8016ef0
        len = 16384
        __func__ = "stepd_connect"
#2  0x000000000040df81 in _fill_registration_msg (msg=0x2b93f80008d0) at slurmd.c:770
        fd = <optimized out>
        steps = 0x14fc6e0
        stepd = 0x2b93f8003510
        buf = {sysname = "Linux", '\000' <repeats 59 times>, nodename = "sgisummit-frcf-111-30.scg.stanford.edu", '\000' <repeats 26 times>, 
          release = "3.10.0-957.10.1.el7.x86_64", '\000' <repeats 38 times>, version = "#1 SMP Mon Mar 18 15:06:45 UTC 2019", '\000' <repeats 29 times>, 
          machine = "x86_64", '\000' <repeats 58 times>, domainname = "(none)", '\000' <repeats 58 times>}
        n = 0
        os = <optimized out>
        slurmd_start_time = 1554152209
        i = 0x152d6a0
        arch = <optimized out>
        gres_info = <optimized out>
        first_msg = false
#3  send_registration_msg (status=status@entry=0, startup=startup@entry=true) at slurmd.c:629
        ret_val = 0
        msg = 0x2b93f80008d0
        __func__ = "send_registration_msg"
#4  0x000000000041a6dc in slurmd_req (msg=msg@entry=0x2b93f8000e90) at req.c:388
        rc = 0
        __func__ = "slurmd_req"
#5  0x000000000040c973 in _service_connection (arg=<optimized out>) at slurmd.c:549
        con = 0x151a5e0
        msg = 0x2b93f8000e90
        __func__ = "_service_connection"
        rc = 0
#6  0x00002b9395b2bdd5 in start_thread () from /usr/lib64/libpthread.so.0
No symbol table info available.
#7  0x00002b9395e3dead in clone () from /usr/lib64/libc.so.6
No symbol table info available.

Thread 2 (Thread 0x2b93a3347700 (LWP 96458)):
#0  0x00002b9395b326fd in read () from /usr/lib64/libpthread.so.0
No symbol table info available.
#1  0x00002b9394a20b36 in stepd_connect (directory=<optimized out>, nodename=<optimized out>, jobid=8610041, stepid=4294967294, 
    protocol_version=protocol_version@entry=0x2b93ac0188e8) at stepd_api.c:282
        remaining = <optimized out>
        ptr = 0x2b93a3346958 ""
        rc = <optimized out>
        req = 0
        fd = <optimized out>
        rc = 0
        auth_cred = <optimized out>
        auth_info = 0x0
        local_nodename = 0x0
        buffer = 0x2b93ac015e90
        len = 16384
        __func__ = "stepd_connect"
#2  0x000000000040df81 in _fill_registration_msg (msg=0x2b93ac01bef0) at slurmd.c:770
---Type <return> to continue, or q <return> to quit---
        fd = <optimized out>
        steps = 0x14fc690
        stepd = 0x2b93ac0188d0
        buf = {sysname = "Linux", '\000' <repeats 59 times>, nodename = "sgisummit-frcf-111-30.scg.stanford.edu", '\000' <repeats 26 times>, 
          release = "3.10.0-957.10.1.el7.x86_64", '\000' <repeats 38 times>, version = "#1 SMP Mon Mar 18 15:06:45 UTC 2019", '\000' <repeats 29 times>, 
          machine = "x86_64", '\000' <repeats 58 times>, domainname = "(none)", '\000' <repeats 58 times>}
        n = 0
        os = <optimized out>
        slurmd_start_time = 1554152209
        i = 0x152d680
        arch = <optimized out>
        gres_info = <optimized out>
        first_msg = false
#3  send_registration_msg (status=status@entry=0, startup=startup@entry=true) at slurmd.c:629
        ret_val = 0
        msg = 0x2b93ac01bef0
        __func__ = "send_registration_msg"
#4  0x000000000041a6dc in slurmd_req (msg=msg@entry=0x2b93ac01c020) at req.c:388
        rc = 0
        __func__ = "slurmd_req"
#5  0x000000000040c973 in _service_connection (arg=<optimized out>) at slurmd.c:549
        con = 0x150a390
        msg = 0x2b93ac01c020
        __func__ = "_service_connection"
        rc = 0
#6  0x00002b9395b2bdd5 in start_thread () from /usr/lib64/libpthread.so.0
No symbol table info available.
#7  0x00002b9395e3dead in clone () from /usr/lib64/libc.so.6
No symbol table info available.

Thread 1 (Thread 0x2b93942b5600 (LWP 25689)):
#0  0x00002b9395b2fd12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
No symbol table info available.
#1  0x000000000040ca6b in _wait_for_all_threads (secs=secs@entry=120) at slurmd.c:501
        ts = {tv_sec = 1554913503, tv_nsec = 0}
        rc = <optimized out>
        __func__ = "_wait_for_all_threads"
#2  0x000000000040fb16 in main (argc=1, argv=0x7ffe9b39f5f8) at slurmd.c:389
        i = <optimized out>
        pidfd = 6
        blocked_signals = {13, 0}
        cc = <optimized out>
        oom_value = <optimized out>
        slurmd_uid = <optimized out>
        curr_uid = <optimized out>
        time_stamp = "Mon, 01 Apr 2019 13:56:47 -0700", '\000' <repeats 129 times>, "\002", '\000' <repeats 79 times>...
        __func__ = "main"
(gdb)
Comment 98 John Hanks 2019-04-10 11:12:32 MDT
In all the fun I forgot to check for our original symptom for this bug, we have zombies:

[root@sgisummit-frcf-111-30 slurmd]# ps aux | grep Z
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
kwalter   21597  0.0  0.0      0     0 ?        Z    Apr09   0:00 [sh] <defunct>
kwalter   25592  0.0  0.0      0     0 ?        Zl   Apr09   0:01 [javac] <defunct>
kwalter   25593  0.0  0.0      0     0 ?        Zl   Apr09   0:01 [javac] <defunct>
kwalter   25595  0.0  0.0      0     0 ?        Zl   Apr09   0:01 [javac] <defunct>
kwalter   25596  0.0  0.0      0     0 ?        Zl   Apr09   0:01 [javac] <defunct>
kwalter   25597  0.0  0.0      0     0 ?        Zl   Apr09   0:01 [javac] <defunct>
kwalter   25599  0.0  0.0      0     0 ?        Zl   Apr09   0:01 [javac] <defunct>
kwalter   25600  0.0  0.0      0     0 ?        Zl   Apr09   0:01 [javac] <defunct>
kwalter   25601  0.0  0.0      0     0 ?        Zl   Apr09   0:01 [javac] <defunct>

jbh
Comment 99 Felip Moll 2019-04-10 12:38:36 MDT
Again I cannot see evidence that the subdirectory step_extern exists.

In all your examples it should be there, but it isn't:

Apr  9 19:05:43 sgisummit-frcf-111-10 slurmstepd[71324]: task/cgroup: /slurm/uid_324759/job_8611335/step_extern: alloc=10240MB mem.limit=10240MB memsw.limit=15360MB

Thanks for the info, I'll dig more.

I'm sorry to take so long with this issue, it is not trivial and there are many different situations (also in different customers) that are still confusing.
Comment 100 John Hanks 2019-04-10 13:06:07 MDT
I've checked the remaining nodes with problem jobs and confirmed that there is no step_extern in /sys/fs/cgroup/slurm/.... for any of them. 

Will keep an eye out for any more changes in behavior of these.
Comment 101 Felip Moll 2019-05-06 07:15:54 MDT
John,

After analyzing line by line your backtraces, and after the fact that you restarted slurmd and it was hung on most of the threads in stepd_connect() I wanted to check the code in slurmd.c and the calls to this function, but again without finding the cause.

The stepd_connect() happens just in two places, in _update_logging() and in _fill_registration_msg().
Your backtraces just confirms that slurmd threads are stuck in these two functions, which would mean the problem is in slurmstepd since what is happening is that slurmd is connecting to the stepd socket.

The call is such as:

 fd = stepd_connect(stepd->directory, stepd->nodename, stepd->jobid, stepd->stepid,&stepd->protocol_version);

As already stated many times in this bug, you cannot even attach to slurmstepd:

..........
Reading symbols from /scg/slurm/installed/slurm-18.08.6/sbin/slurmstepd...done.
Attaching to program: /scg/slurm/current/sbin/slurmstepd, process 103421

^C^C^C^C^C^C^C^C


And strace is a dead end:

[root@dper7425-srcf-d15-27 ~]# strace -ff -p 103421
strace: Process 103421 attached
...........


Which I guess slurmd is also in the same situation, it is trying to connect to an unresponsive socket-side. stepd_connect() in the end just does a  connect() syscall.

At this point I think we need to forget about slurmd and other components, and focus in slurmstepd and why it is hung so hardly. We should be able to attach as we already commented before in this bug. What can do a program, non-attachable aside from I/O problems?

I also think, along this time, you've seen some other issues and we have mixed some other situation, but the original one is the one presented in your last comments.

One test I'd like you to do is to attach with gdb to a hung slurmstepd, and then try to send SIGCONT (and do a next in gdb) or if it doesn't respond at all, just SIGABRT or SIGKILL signals. We need to see what slurmstepd is doing. You can even try to attach to slurmstepd threads and extract a bt for each of them, until you found one non-attachable.

If slurmstepd is not ending with any signal other than SIGKILL, then when slurmd is trying to sigterm it indeed won't terminate and generate errors in the log, this was seen:
---
Aug  9 20:12:36 frankie slurmctld[121446]: Resending TERMINATE_JOB request JobId=3092791 Nodelist=sgiuv300-srcf-d10-01
Aug  9 20:13:16 sgiuv300-srcf-d10-01 kernel: [ 9512.041358] INFO: task slurmstepd:377005 blocked for more than 120 seconds.
Aug  9 20:13:16 sgiuv300-srcf-d10-01 kernel: [ 9512.041361] slurmstepd      D ffff96cac7dd0000     0 377005      1 0x00000080
Aug  9 20:14:12 sgiuv300-srcf-d10-01 slurmstepd[377480]: error: *** EXTERN STEP FOR 3092791 STEPD TERMINATED ON sgiuv300-srcf-d10-01 AT 2018-08-09T20:14:11 DUE TO JOB NOT ENDING WITH SIGNALS ***
---

Your user/program/enviroment/fs is able to trigger this conditions in someway...
Comment 102 Felip Moll 2019-06-03 10:08:13 MDT
Hi John,

Have you been able to do the commented test in my previous comment?
I do really think that we should be able to attach to these processes if they are not in a D state waiting for IO. At least with strace.

Otherwise I really don't know how to debug it and I am again inclined in some kind of system issue.

Just to recall, is this error happening on systems with the same node image?
Comment 103 John Hanks 2019-06-03 11:47:44 MDT
Hi Felip,

We've identified an issue with our Isilon storage which causes access to
individual files on an NFS mount to hang indefinitely while still working
for all other files on the same NFS mount. I don't have a direct link from
this to the Slurm problem yet and I haven't been able to catch a node with
this problem recently to do any further testing.

I am talking to Isilon about it this week, but again from their side it
occurs rarely and I have no reliable way to reproduce so I expect it to
take time to sort out the issue.

Hopefully I will have something to report soon. Regardless of the outcome
though I still think that Slurm dropping job accounting records for any
reason is a Slurm bug that should be addressed.

Best,

jbh

On Mon, Jun 3, 2019, 9:08 AM <bugs@schedmd.com> wrote:

> *Comment # 102 <https://bugs.schedmd.com/show_bug.cgi?id=5549#c102> on bug
> 5549 <https://bugs.schedmd.com/show_bug.cgi?id=5549> from Felip Moll
> <felip.moll@schedmd.com> *
>
> Hi John,
>
> Have you been able to do the commented test in my previous comment?
> I do really think that we should be able to attach to these processes if they
> are not in a D state waiting for IO. At least with strace.
>
> Otherwise I really don't know how to debug it and I am again inclined in some
> kind of system issue.
>
> Just to recall, is this error happening on systems with the same node image?
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 104 Felip Moll 2019-06-04 09:28:10 MDT
An issue with the storage like the one you're commenting would make a lot of sense on the behavior we're observating.
Maybe the nfs option of 'soft' could help a bit with this, at the cost of possibly corrupting data.

I will wait for more information on your side, specially if you are able to correlate the issues. I want to recall that this happened in multiple systems, not only in the UV300, so this reinforces the theory.

> Regardless of the outcome
> though I still think that Slurm dropping job accounting records for any
> reason is a Slurm bug that should be addressed.

If you're seeing this in 18.08.6, this is probably caused by a regression introduced in this specific version.

The issue was introduced in commit ade9101e95925b01005b465c5451e243578c2fd8 in 18.08.6.
Was fixed later in 18.08.7, commit 3361bf611c61de3bb90f8cadbacf58b4d1dc8707. See Bug 6697

The workaround is to specify a start time with '-S' even if you are specifying -j.

Tell me if this is the case. If it is not, I'll ask for a direct query in the database to see if the jobs are really accounted.
Comment 105 John Hanks 2019-06-04 12:23:22 MDT
The sacct problem is resolved, it works fine now. I missed that it was a different bug/problem.

Will let you know how our storage investigations go.
Comment 106 Felip Moll 2019-06-05 03:52:36 MDT
(In reply to John Hanks from comment #105)
> The sacct problem is resolved, it works fine now. I missed that it was a
> different bug/problem.

Great, did you upgrade?

> Will let you know how our storage investigations go.

Please, keep me updated.
Comment 107 John Hanks 2019-06-05 10:12:59 MDT
(In reply to Felip Moll from comment #106)
> (In reply to John Hanks from comment #105)
> > The sacct problem is resolved, it works fine now. I missed that it was a
> > different bug/problem.
> 
> Great, did you upgrade?
> 

Yes, updating the ticket to show 18.08.7.
Comment 108 Felip Moll 2019-07-01 04:29:52 MDT
Hi John,

How is everything going? Did you fix your storage? Is the problem gone afterwards?
Comment 109 John Hanks 2019-07-01 09:55:33 MDT
(In reply to Felip Moll from comment #108)
> Hi John,
> 
> How is everything going? Did you fix your storage? Is the problem gone
> afterwards?

Since changing my mount options for NFS from "hard,intr" to "soft,intr" we've had a few nodes enter "Kill task failed" state, but I haven't found any hanging files on them. We've also done some kernel updates along the way. 

There's not much reason to leave this open until I catch the problem happening again and can provide some debugging feedback, maybe we'll get lucky and the 'soft' change will be a sufficient workaround. I'll mark this resolved, cannotreproduce and will re-open if that status changes.
Comment 110 Felip Moll 2019-07-01 10:04:44 MDT
(In reply to John Hanks from comment #109)
> (In reply to Felip Moll from comment #108)
> > Hi John,
> > 
> > How is everything going? Did you fix your storage? Is the problem gone
> > afterwards?
> 
> Since changing my mount options for NFS from "hard,intr" to "soft,intr"
> we've had a few nodes enter "Kill task failed" state, but I haven't found
> any hanging files on them. We've also done some kernel updates along the
> way. 
> 
> There's not much reason to leave this open until I catch the problem
> happening again and can provide some debugging feedback, maybe we'll get
> lucky and the 'soft' change will be a sufficient workaround. I'll mark this
> resolved, cannotreproduce and will re-open if that status changes.

Understood, reopen as needed.

If you still want me to check about this nodes with the kill task failed attach the logs here, but I think you definetively have some issue with this NFS.

Note that intr / nointr  option is deprecated and ignored after kernel 2.6.25.
See 'man nfs'.
Comment 111 John Hanks 2019-07-12 14:21:15 MDT
Just adding a note here, we had this occur again this week with `soft` in place and Slurm seems to be able to properly kill and clean up jobs that were hung on a specific file with the updated mount options. All signs point to an NFS issue with our Isilon OneFS storage which is mounted NFS4 vers=4.0. Just throwing a lot of keywords in here in case anyone else has the same problem.

Still can not force this to reproduce and it is still difficult to debug due to the nature of the hanging jobs and the rarity of it occurring.
Comment 112 Felip Moll 2019-07-25 13:36:03 MDT
Thank you John for this valuable information.
Please, don't hesitate in adding what you think is necessary here.

Regards,
Felip