Hi, I have a special node, an SGI UV300, which recently (a couple of months ago) started having an issue with jobs completing. After a reboot, things work great for some period of time, days or in some cases a few weeks, until jobs begin to fail to complete and the slurmstepd processes leave behind zombie sleep process. For example: Using pstree, here is an example of a job that has exited, but it's slurmstepd is left hanging: [root@sgiuv300-srcf-d10-01 ~]# pstree -a -c -p -S 421660 slurmstepd,421660 ├─sleep,421678 100000000 ├─{slurmstepd},421665 ├─{slurmstepd},421666 ├─{slurmstepd},421667 └─{slurmstepd},421677 Here's a backtrace from that slurmstepd: (gdb) bt #0 0x00002b6e1f95722a in wait4 () from /usr/lib64/libc.so.6 #1 0x000000000040fbff in _spawn_job_container (job=0x1690ec0) at mgr.c:1102 #2 job_manager (job=job@entry=0x1690ec0) at mgr.c:1211 #3 0x000000000040ca57 in main (argc=1, argv=0x7ffe93055fe8) at slurmstepd.c:172 The zombies seem to result after this message is logged: Aug 9 20:14:12 sgiuv300-srcf-d10-01 slurmstepd[377480]: error: *** EXTERN STEP FOR 3092791 STEPD TERMINATED ON sgiuv300-srcf-d10-01 AT 2018-08-09T20:14:11 DUE TO JOB NOT ENDING WITH SIGNALS *** and for the wait4() slurmstepd, the slurm logs look completely normal. Since none of our other nodes are having this behavior I'm inclined to blame the UV300 (a node which is not my friend). But the problem doesn't affect anything else on the machine, just stuff started via SLURM. Since the end result is slurmstepd in wait4() or zombie sleep processes, not quite sure where to prod for more information. Suggestions for the next troubleshooting step are appreciated. Other details of possible interest. The first instance of this problem I can recall was with 17.11.7 (it may have happened before and gone unnoticed as the system does get rebooted more often that I'd like). The node was upgraded to CentOS 7.5 from 7.4, but worked fine for a while after that upgrade. jbh
Hi John, This issue is very similar to your other bug 5443. Isn't it a duplicate? In case you think it isn't: It would be useful to get a "thread apply all bt full" to see the full dump of GDB in all the threads. In any case I guess the process failed to send the proper signal to slurmstepd for some reason, so the bt may be correct. I would also need to see an "scontrol show job" for the CG job and the full slurmd+slurmctld log. If you can attach your up-to-date slurm.conf + cgroup.conf it would be great. Finally if you have prolog/epilog scripts, attach them also here please. To do a quick test, can you submit a: srun /bin/true and see if it reproduces. Are you really running 17.11.9? Kilian commented that you downgraded to 17.11.8, but I don't know in which system.
(In reply to Felip Moll from comment #2) > Hi John, > > This issue is very similar to your other bug 5443. Isn't it a duplicate? Sorry, I should have mentioned this is a different cluster. Kilian's bug is with Sherlock, this system is the SCG cluster. You can think of us as a poor little system that lives in subsidised housing on Sherlock's street. > In case you think it isn't: It's possible they share underlying causes, but the behavior I'm seeing is different. My jobs either complete, but leave behind the zombie sleeps or they stay in Running state forever, even after the slurmstepd has been killed (again leaving behind an optional zombie sleep.) > It would be useful to get a "thread apply all bt full" to see the full dump > of GDB in all the threads. In any case I guess the process failed to send > the proper signal to slurmstepd for some reason, so the bt may be correct. Here's a sample, I'll need to increase loglevels and send more later. Job 3092804 ended, but it's slurmstepd is still hanging around and it's not in completing state: Logs: /srv/logs/messages-20180810:Aug 10 10:07:24 sgiuv300-srcf-d10-01 slurmd[36881]: _handle_stray_script: Purging vestigial job script /var/spool/slurm/slurmd/job3092804/slurm_script /srv/logs/messages-20180810:Aug 10 10:07:39 frankie slurmctld[121446]: Batch JobId=3092804 missing from node 0 (not found BatchStartTime after startup), Requeuing job /srv/logs/messages-20180810:Aug 10 10:07:39 frankie slurmctld[121446]: _job_complete: JobID=3092804 State=0x1 NodeCnt=1 WTERMSIG 126 /srv/logs/messages-20180810:Aug 10 10:07:39 frankie slurmctld[121446]: _job_complete: JobID=3092804 State=0x1 NodeCnt=1 cancelled by node failure /srv/logs/messages-20180810:Aug 10 10:07:39 frankie slurmctld[121446]: _job_complete: requeue JobID=3092804 State=0x8000 NodeCnt=1 due to node failure /srv/logs/messages-20180810:Aug 10 10:07:39 frankie slurmctld[121446]: _job_complete: JobID=3092804 State=0x8000 NodeCnt=1 done /srv/logs/messages-20180810:Aug 10 10:07:39 frankie slurmctld[121446]: Requeuing JobID=3092804 State=0x0 NodeCnt=0 /srv/logs/messages-20180810:Aug 10 12:20:36 frankie slurmctld[121446]: sched: Allocate JobID=3092804 NodeList=sgiuv300-srcf-d10-01 #CPUs=16 Partition=nih_s10 /srv/logs/messages-20180810:Aug 10 12:20:36 sgiuv300-srcf-d10-01 slurmd[36884]: task_p_slurmd_batch_request: 3092804 /srv/logs/messages-20180810:Aug 10 12:20:36 sgiuv300-srcf-d10-01 slurmd[36884]: task/affinity: job 3092804 CPU input mask for node: 0x0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000FFFFFFFF000000000000000000000000000000000000000000000000000000000000000000000000000000000 /srv/logs/messages-20180810:Aug 10 12:20:36 sgiuv300-srcf-d10-01 slurmd[36884]: task/affinity: job 3092804 CPU final HW mask for node: 0x0000000000000000000000000000000000000000000003FFFC00000000000000000000000000000000000000000000000000000000000000000000000000000000000003FFFC0000000000000000000000000000000000000000 /srv/logs/messages-20180810:Aug 10 12:20:36 sgiuv300-srcf-d10-01 slurmd[36884]: _run_prolog: prolog with lock for job 3092804 ran for 0 seconds /srv/logs/messages-20180810:Aug 10 12:20:36 sgiuv300-srcf-d10-01 slurmstepd[377334]: task/cgroup: /slurm/uid_22292/job_3092804: alloc=65536MB mem.limit=65536MB memsw.limit=98304MB /srv/logs/messages-20180810:Aug 10 12:20:36 sgiuv300-srcf-d10-01 slurmstepd[377334]: task/cgroup: /slurm/uid_22292/job_3092804/step_extern: alloc=65536MB mem.limit=65536MB memsw.limit=98304MB /srv/logs/messages-20180810:Aug 10 12:20:37 sgiuv300-srcf-d10-01 slurmd[36884]: Launching batch job 3092804 for UID 22292 /srv/logs/messages-20180810:Aug 10 12:20:37 sgiuv300-srcf-d10-01 slurmstepd[377399]: task/cgroup: /slurm/uid_22292/job_3092804: alloc=65536MB mem.limit=65536MB memsw.limit=98304MB /srv/logs/messages-20180810:Aug 10 12:20:37 sgiuv300-srcf-d10-01 slurmstepd[377399]: task/cgroup: /slurm/uid_22292/job_3092804/step_batch: alloc=65536MB mem.limit=65536MB memsw.limit=98304MB /srv/logs/messages-20180811:Aug 11 12:39:04 frankie slurmctld[121446]: _slurm_rpc_kill_job: REQUEST_KILL_JOB job 3092804 uid 22292 /srv/logs/messages-20180811:Aug 11 12:39:04 sgiuv300-srcf-d10-01 slurmstepd[377399]: error: *** JOB 3092804 ON sgiuv300-srcf-d10-01 CANCELLED AT 2018-08-11T12:39:04 *** /srv/logs/messages-20180811:Aug 11 12:42:07 frankie slurmctld[121446]: Resending TERMINATE_JOB request JobId=3092804 Nodelist=sgiuv300-srcf-d10-01 /srv/logs/messages-20180811:Aug 11 12:43:35 sgiuv300-srcf-d10-01 slurmstepd[377334]: error: *** EXTERN STEP FOR 3092804 STEPD TERMINATED ON sgiuv300-srcf-d10-01 AT 2018-08-11T12:43:34 DUE TO JOB NOT ENDING WITH SIGNALS *** /srv/logs/messages-20180811:Aug 11 12:43:35 sgiuv300-srcf-d10-01 slurmstepd[377399]: error: *** JOB 3092804 STEPD TERMINATED ON sgiuv300-srcf-d10-01 AT 2018-08-11T12:43:34 DUE TO JOB NOT ENDING WITH SIGNALS *** /srv/logs/messages-20180811:Aug 11 12:43:35 frankie slurmctld[121446]: error: slurmd error running JobId=3092804 on node(s)=sgiuv300-srcf-d10-01: Kill task failed /srv/logs/messages-20180811:Aug 11 12:43:35 frankie slurmctld[121446]: cleanup_completing: job 3092804 completion process took 271 seconds When I try to attach with gdb, gdb hangs forever (or at least up to the 15 minutes I was willing to wait). Tried this on several of these with the same result. > I would also need to see an "scontrol show job" for the CG job and the full > slurmd+slurmctld log. JOb has finished, so nothing to show. I'll need to increase loglevel and turn on logging to a file, which means I'll have this for you next time. > If you can attach your up-to-date slurm.conf + cgroup.conf it would be great. > > Finally if you have prolog/epilog scripts, attach them also here please. Will attach these shortly. > Are you really running 17.11.9? Kilian commented that you downgraded to > 17.11.8, but I don't know in which system. SCG doesn't downgrade. No retreat, no surrender. Thanks, jbh
Created attachment 7591 [details] SCG SLURM config Current SLURM config on SCG.
At a first glance your configuration is fine. I'd like to see also your slurmd log in the node sgiuv300-srcf-d10-01. The job was requeued by some reason: /srv/logs/messages-20180810:Aug 10 10:07:39 frankie slurmctld[121446]: _job_complete: requeue JobID=3092804 State=0x8000 NodeCnt=1 due to node failure Thanks
(In reply to Felip Moll from comment #5) > At a first glance your configuration is fine. > > I'd like to see also your slurmd log in the node sgiuv300-srcf-d10-01. > > The job was requeued by some reason: > > /srv/logs/messages-20180810:Aug 10 10:07:39 frankie slurmctld[121446]: > _job_complete: requeue JobID=3092804 State=0x8000 NodeCnt=1 due to node > failure > > Thanks Ah sorry, I see your logs are centralized. In any case I need more history, I need to see what happened to the node before that time and see why it shown a node failure. Also, can you analyze the system logs for any problem around 10th Aug 10:00 and before?
(In reply to Felip Moll from comment #6) > (In reply to Felip Moll from comment #5) > > The job was requeued by some reason: > > > > /srv/logs/messages-20180810:Aug 10 10:07:39 frankie slurmctld[121446]: > > _job_complete: requeue JobID=3092804 State=0x8000 NodeCnt=1 due to node > > failure When the zombie processes start to accumulate, the node eventually reaches a condition where no new jobs will start. At that point it is rebooted and any jobs still in Running state (although there are no processes associated with them) get requeued due to node failure. > > Ah sorry, I see your logs are centralized. > > In any case I need more history, I need to see what happened to the node > before that time and see why it shown a node failure. Also, can you analyze > the system logs for any problem around 10th Aug 10:00 and before? Our workload shifted this week from lots of small jobs turning over relatively quickly to a small number of very large jobs and not much turnover. Currently the node is working fine. When I get some time, hopefully next week, I'll crank up the logging verbosity and try to reproduce with a flood of small test jobs.
> Our workload shifted this week from lots of small jobs turning over > relatively quickly to a small number of very large jobs and not much > turnover. Currently the node is working fine. When I get some time, > hopefully next week, I'll crank up the logging verbosity and try to > reproduce with a flood of small test jobs. Ok, don't you have the older logs in the meantime?
(In reply to Felip Moll from comment #8) > > Our workload shifted this week from lots of small jobs turning over > > relatively quickly to a small number of very large jobs and not much > > turnover. Currently the node is working fine. When I get some time, > > hopefully next week, I'll crank up the logging verbosity and try to > > reproduce with a flood of small test jobs. > > Ok, don't you have the older logs in the meantime? Only what was logged at info and logged with syslog. Which means I have lots of logs like the ones for the job I posted here earlier, so I can increase that sample size. Will upload logs for August. During this period the node was rebooted to clear the problem on 1st @ 8:00 AM 5th @ 6:30 AM 9th @ 6:00 AM 9th @ 5:30 PM 10th @ 10:20 AM 13th @ 3:15 PM
Created attachment 7620 [details] logs for uv300 for August
From what I see at some point there is a slurmstepd in uninterruptible sleep (D state) before the problem start. Aug 9 20:12:36 frankie slurmctld[121446]: Resending TERMINATE_JOB request JobId=3092791 Nodelist=sgiuv300-srcf-d10-01 Aug 9 20:13:16 sgiuv300-srcf-d10-01 kernel: [ 9512.041358] INFO: task slurmstepd:377005 blocked for more than 120 seconds. Aug 9 20:13:16 sgiuv300-srcf-d10-01 kernel: [ 9512.041361] slurmstepd D ffff96cac7dd0000 0 377005 1 0x00000080 Aug 9 20:14:12 sgiuv300-srcf-d10-01 slurmstepd[377480]: error: *** EXTERN STEP FOR 3092791 STEPD TERMINATED ON sgiuv300-srcf-d10-01 AT 2018-08-09T20:14:11 DUE TO JOB NOT ENDING WITH SIGNALS *** Aug 9 20:14:12 frankie slurmctld[121446]: update_node: node sgiuv300-srcf-d10-01 reason set to: Kill task failed Aug 9 20:14:12 frankie slurmctld[121446]: update_node: node sgiuv300-srcf-d10-01 state set to DRAINING This could make me think about some I/O issues. What filesystems do you mount in this server? Can you check everything is ok about filesystem I/O? I also see that the node stops responding until reboot: Aug 9 21:56:41 frankie slurmctld[121446]: Resending TERMINATE_JOB request JobId=3092808 Nodelist=sgiuv300-srcf-d10-01 Aug 10 09:51:22 frankie slurmctld[121446]: error: Nodes sgiuv300-srcf-d10-01 not responding Aug 10 09:56:22 frankie slurmctld[121446]: error: Nodes sgiuv300-srcf-d10-01 not responding Aug 10 10:01:22 frankie slurmctld[121446]: error: Nodes sgiuv300-srcf-d10-01 not responding Also curious about the "Brain damage", how do you detect that? I guess that's being done in NHC script. Aug 5 06:27:11 frankie slurmctld[23546]: update_node: node sgiuv300-srcf-d10-01 reason set to: REBOOTME: Brain damage. Aug 5 06:27:11 frankie slurmctld[23546]: update_node: node sgiuv300-srcf-d10-01 state set to DRAINING Finally, can you set this in slurm.conf?: MemLimitEnforce=no JobAcctGatherParams=NoOverMemoryKill Currently there are 2 mechanisms for enforcing memory: task/cgroup and jobacctgather plugins. The first one is set by defining ConstrainRAMSpace=yes or ConstrainSwapSpace=yes in cgroup.conf, it kills jobs when the cgroup memory stat exceeds the limit. The second one is enabled by default, and kills jobs if MemLimitEnforce=yes, or steps individually if JobAcctGatherParams is not set to NoOverMemoryKill by just reading the /proc pid stats of all the pids of the job. Having both mechanisms enabled can give problems, we are going to address that in 18.08 but not in older versions. In your case you have both mechanisms enabled by default, so I would encourage to set the mentioned values and just keep the cgroup mechanism. I see some oom kills so you may eventually hit some situation.
(In reply to Felip Moll from comment #11) > This could make me think about some I/O issues. What filesystems do you > mount in this server? Can you check everything is ok about filesystem I/O? There haven't been any I/O issues lately. The filesystems are all either local disk or NFS 3 and 4 mounted. As far as I can tell there were no mount issues, but I'll add that to the list of things for close inspection next time. > > I also see that the node stops responding until reboot: > Aug 9 21:56:41 frankie slurmctld[121446]: Resending TERMINATE_JOB request > JobId=3092808 Nodelist=sgiuv300-srcf-d10-01 > Aug 10 09:51:22 frankie slurmctld[121446]: error: Nodes sgiuv300-srcf-d10-01 > not responding > Aug 10 09:56:22 frankie slurmctld[121446]: error: Nodes sgiuv300-srcf-d10-01 > not responding > Aug 10 10:01:22 frankie slurmctld[121446]: error: Nodes sgiuv300-srcf-d10-01 > not responding > > Also curious about the "Brain damage", how do you detect that? I guess > that's being done in NHC script. > > Aug 5 06:27:11 frankie slurmctld[23546]: update_node: node > sgiuv300-srcf-d10-01 reason set to: REBOOTME: Brain damage. > Aug 5 06:27:11 frankie slurmctld[23546]: update_node: node > sgiuv300-srcf-d10-01 state set to DRAINING > As the number of zombies goes up, the node eventually reaches a state where it no longer starts jobs. But before then if/when I notice a lot of zombies I'll drain the node with the Reason "REBOOTME: Brain damage" in the hopes that some of the current jobs might finish ok. I have a cron job that looks for drained nodes matching "REBOOTME" and power cycles them. "Brain damage" is a purely subjective diagnosis, and fortunately for this batch of logs I had chosen a term that was safe for work. > > Finally, can you set this in slurm.conf?: > > MemLimitEnforce=no > JobAcctGatherParams=NoOverMemoryKill Done. OOM kills are the most frequent cause of death for jobs, it's primarily a genomics workload and it's difficult to predict memory usage for most codes we use.
> Done. OOM kills are the most frequent cause of death for jobs, it's > primarily a genomics workload and it's difficult to predict memory usage for > most codes we use. John, Have the situation improved in any way after changing the commented parameters? Setting the ctld and slurmd debug level to "debug" and debugflags to could help, can you set it and check it again? Tell me if you see any I/O, net (dropped packages?), system error at the same time, and run a "sdiag" at the time it is happening. thanks
Hi Felip, It's been pretty quiet here and the workload has shifted to an assembler I don't think we were running before last week. Since the last reboot the node is completely stable. If/when this recurs, I'll try to correlate it to a specific user/job. Maybe the problem here is something some specific app and/or user is doing that kicks slurmstepd off the rails. jbh
(In reply to John Hanks from comment #14) > Hi Felip, > > It's been pretty quiet here and the workload has shifted to an assembler I > don't think we were running before last week. Since the last reboot the node > is completely stable. If/when this recurs, I'll try to correlate it to a > specific user/job. Maybe the problem here is something some specific app > and/or user is doing that kicks slurmstepd off the rails. > > jbh Ok John, I'll keep this open for at least 2 weeks, let's see if it happens again.
I caught a zombie attack in progress. Currently have a job which is losing track of some forked R processes. Uploading the slurmd log (debug5) and ps and pstree output from the system. jbh
Created attachment 7691 [details] slurmd log from node with zombies
Created attachment 7692 [details] pstree of processes on node
Created attachment 7693 [details] ps -aux of processes.
Here is a 'thread apply all bt full' from a slurmstepd that has lost a child to the zombie process. (gdb) thread apply all bt full Thread 7 (Thread 0x2ad2d158b700 (LWP 260138)): #0 0x00002ad2d2698995 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0 No symbol table info available. #1 0x00002ad2d1719cce in _watch_tasks (arg=<optimized out>) at slurm_jobacct_gather.c:227 err = <optimized out> __func__ = "_watch_tasks" #2 0x00002ad2d2694e25 in start_thread () from /usr/lib64/libpthread.so.0 No symbol table info available. #3 0x00002ad2d29a7bad in clone () from /usr/lib64/libc.so.6 No symbol table info available. Thread 6 (Thread 0x2ad2d7a72700 (LWP 260139)): #0 0x00002ad2d2698d42 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0 No symbol table info available. #1 0x00002ad2d1713c91 in _timer_thread (args=<optimized out>) at slurm_acct_gather_profile.c:204 err = <optimized out> i = 4 now = 1535060418 diff = <optimized out> tvnow = {tv_sec = 1535055855, tv_usec = 835349} abs = {tv_sec = 1535060419, tv_nsec = 835349000} __func__ = "_timer_thread" #2 0x00002ad2d2694e25 in start_thread () from /usr/lib64/libpthread.so.0 No symbol table info available. #3 0x00002ad2d29a7bad in clone () from /usr/lib64/libc.so.6 No symbol table info available. Thread 5 (Thread 0x2ad2d7b73700 (LWP 260140)): #0 0x00002ad2d299cf0d in poll () from /usr/lib64/libc.so.6 No symbol table info available. #1 0x00002ad2d17a6f17 in _poll_internal (shutdown_time=<optimized out>, nfds=2, pfds=0x2ad2e00009f0) at eio.c:364 n = <optimized out> timeout = -1 #2 eio_handle_mainloop (eio=0xd325a0) at eio.c:328 retval = 0 pollfds = 0x2ad2e00009f0 map = 0x2ad2e0000950 maxnfds = 1 nfds = 2 n = <optimized out> shutdown_time = <optimized out> __func__ = "eio_handle_mainloop" #3 0x000000000041cbd8 in _msg_thr_internal (job_arg=0xd47d30) at req.c:245 job = 0xd47d30 #4 0x00002ad2d2694e25 in start_thread () from /usr/lib64/libpthread.so.0 No symbol table info available. #5 0x00002ad2d29a7bad in clone () from /usr/lib64/libc.so.6 No symbol table info available. Thread 4 (Thread 0x2ad2e4d40700 (LWP 260141)): #0 0x00002ad2d299cf0d in poll () from /usr/lib64/libc.so.6 No symbol table info available. #1 0x00002ad2d17a6f17 in _poll_internal (shutdown_time=<optimized out>, nfds=3, pfds=0x2ad2e8000ac0) at eio.c:364 n = <optimized out> timeout = -1 #2 eio_handle_mainloop (eio=0x2ad2e80008d0) at eio.c:328 retval = 0 pollfds = 0x2ad2e8000ac0 map = 0x2ad2e8000af0 maxnfds = 2 nfds = 3 n = <optimized out> shutdown_time = <optimized out> __func__ = "eio_handle_mainloop" #3 0x00002ad2e4a2e2cc in _agent (unused=<optimized out>) at agent.c:326 tree_listen_obj = <optimized out> task_obj = <optimized out> i = <optimized out> __func__ = "_agent" #4 0x00002ad2d2694e25 in start_thread () from /usr/lib64/libpthread.so.0 No symbol table info available. #5 0x00002ad2d29a7bad in clone () from /usr/lib64/libc.so.6 No symbol table info available. Thread 3 (Thread 0x2ad2e4e41700 (LWP 260144)): #0 0x00002ad2d299cf0d in poll () from /usr/lib64/libc.so.6 No symbol table info available. #1 0x00002ad2e4211962 in _oom_event_monitor (x=<optimized out>) at task_cgroup_memory.c:478 args = 0xd8a3f0 ret = <optimized out> res = 0 fds = {{fd = 17, events = 3, revents = 0}, {fd = 18, events = 8193, revents = 0}} __func__ = "_oom_event_monitor" #2 0x00002ad2d2694e25 in start_thread () from /usr/lib64/libpthread.so.0 No symbol table info available. #3 0x00002ad2d29a7bad in clone () from /usr/lib64/libc.so.6 No symbol table info available. ---Type <return> to continue, or q <return> to quit--- Thread 2 (Thread 0x2ad2e4f42700 (LWP 260145)): #0 0x00002ad2d299cf0d in poll () from /usr/lib64/libc.so.6 No symbol table info available. #1 0x00002ad2d17a6f17 in _poll_internal (shutdown_time=<optimized out>, nfds=4, pfds=0x2ad2f0000910) at eio.c:364 n = <optimized out> timeout = -1 #2 eio_handle_mainloop (eio=0xd4c9a0) at eio.c:328 retval = 0 pollfds = 0x2ad2f0000910 map = 0x2ad2f00008d0 maxnfds = 4 nfds = 4 n = <optimized out> shutdown_time = <optimized out> __func__ = "eio_handle_mainloop" #3 0x0000000000413bcb in _io_thr (arg=0xd47d30) at io.c:1479 job = 0xd47d30 set = {__val = {4097, 0 <repeats 15 times>}} rc = <optimized out> #4 0x00002ad2d2694e25 in start_thread () from /usr/lib64/libpthread.so.0 No symbol table info available. #5 0x00002ad2d29a7bad in clone () from /usr/lib64/libc.so.6 No symbol table info available. Thread 1 (Thread 0x2ad2d1463000 (LWP 260097)): #0 0x00002ad2d296e22a in wait4 () from /usr/lib64/libc.so.6 No symbol table info available. #1 0x000000000040e85f in _wait_for_any_task (job=job@entry=0xd47d30, waitflag=waitflag@entry=true) at mgr.c:2042 t = <optimized out> rc = <optimized out> status = 0 pid = <optimized out> completed = 0 jobacct = <optimized out> rusage = {ru_utime = {tv_sec = 0, tv_usec = 0}, ru_stime = {tv_sec = 0, tv_usec = 0}, {ru_maxrss = 0, __ru_maxrss_word = 0}, {ru_ixrss = 0, __ru_ixrss_word = 0}, {ru_idrss = 0, __ru_idrss_word = 0}, {ru_isrss = 0, __ru_isrss_word = 0}, {ru_minflt = 0, __ru_minflt_word = 0}, {ru_majflt = 0, __ru_majflt_word = 0}, {ru_nswap = 0, __ru_nswap_word = 0}, {ru_inblock = 0, __ru_inblock_word = 0}, {ru_oublock = 0, __ru_oublock_word = 0}, { ru_msgsnd = 0, __ru_msgsnd_word = 0}, {ru_msgrcv = 4294836224, __ru_msgrcv_word = 4294836224}, {ru_nsignals = 0, __ru_nsignals_word = 0}, { ru_nvcsw = 214748364800, __ru_nvcsw_word = 214748364800}, {ru_nivcsw = 0, __ru_nivcsw_word = 0}} tmp_env = <optimized out> task_offset = 0 __func__ = "_wait_for_any_task" #2 0x00000000004100b4 in _wait_for_all_tasks (job=0xd47d30) at mgr.c:2153 rc = <optimized out> tasks_left = 1 i = 0 #3 job_manager (job=job@entry=0xd47d30) at mgr.c:1350 rc = 0 io_initialized = true ckpt_type = 0xd2f170 "checkpoint/none" err_msg = 0x0 __func__ = "job_manager" #4 0x000000000040ca57 in main (argc=1, argv=0x7ffc98d21018) at slurmstepd.c:172 cli = 0xd33570 self = 0xd402d0 msg = 0xd3e3a0 rc = 0 launch_params = 0x0 __func__ = "main" (gdb)
Hi John, your logs have been helpful. I spent some time analyzing all your information and our code. It seems clear that what's related with the issue here, as you already guessed, is that R processes die and remain in Z state because the parent process of R doesn't wait for them. Slurm doesn't seem to monitor these "dead" pids, but the jobacctgather plugin tries to access /proc/<pid>, and a lot (674791 matches in your slurm.log) of errors like this are logged in: slurm.log:[2018-08-23T14:13:15.848] [3360552.0] debug3: jobacct_gather_linux: unable to read requested pattern in /proc/261412/status It accesses these status because these pids are in the tree of the job's main pid. I've taken as example job 3360552 with multiple steps, one of them looks like: |-slurmstepd,260097 | |-qiime,260146 /home/abganz/.conda/envs/qiime2/bin/qiime dada2 denoise-paired --i-demultiplexed-seqs paired-end-demux.qza --output-dir dada2-209 --p-n-threads 15 --p-trunc-len-f 283 --p-trunc-len-r 209 | | `-R,261285 --slave --no-restore --file=/home/abganz/.conda/envs/qiime2/bin/run_dada_paired.R --args /tmp/tmpwb3nmuzq/forward /tmp/tmpwb3nmuzq/reverse /tmp/tmpwb3nmuzq/output.tsv.biom /tmp/tmpwb3nmuzq/track.tsv /tmp/tmpwb3nmuzq/f ilt_f /tmp/tmpwb3nmuzq/filt_r 283 209 0 0 2.0 2 consensus 1.0 15 1000000 | | |-(R,261412) | | |-{R},301848 | | |-{R},301849 | | |-{R},301850 | | |-{R},301851 | | |-{R},301852 | | |-{R},301853 | | |-{R},301854 | | |-{R},301855 | | |-{R},301856 | | |-{R},301857 | | |-{R},301858 | | |-{R},301859 | | |-{R},301860 | | `-{R},301861 | |-{slurmstepd},260138 | |-{slurmstepd},260139 | |-{slurmstepd},260140 | |-{slurmstepd},260141 | |-{slurmstepd},260144 | `-{slurmstepd},260145 As you see the parent of R is "qiime". Slurmstepd seem to wait for qiime and not for the sub-forked R processes. qiime doesn't seem to wait 'R' and if one of these forked R dies, then it becomes a zombie. Your ps aux for R process 261412 shows: abganz 261412 83.8 0.0 0 0 ? Z 13:27 51:03 [R] <defunct> Other processes related to this job are: root 259214 0.0 0.0 385504 8188 ? Sl 13:24 0:02 slurmstepd: [3360552.extern] root 259235 0.0 0.0 385904 8628 ? Sl 13:24 0:01 slurmstepd: [3360552.batch] abganz 259254 0.0 0.0 113188 1568 ? S 13:24 0:00 /bin/bash /var/spool/slurm/slurmd/job3360552/slurm_script root 260097 0.1 0.0 521056 8636 ? Sl 13:24 0:03 slurmstepd: [3360552.0] One cause of your system being stuck could be that qiime doesn't finish, so the job continues, and the Z processes keep increasing. At a certain point the kernel process table may become full since the Z pids are not removed from the system if no wait call is issued, and since the qiime process continues working the job doesn't die and the Z is not inherited by init. From "man 2 wait": A child that terminates, but has not been waited for becomes a "zombie". The kernel maintains a minimal set of information about the zombie process (PID, termination status, resource usage information) in order to allow the parent to later perform a wait to obtain information about the child. As long as a zombie is not removed from the system via a wait, it will consume a slot in the kernel process table, and if this table fills, it will not be possible to create further processes. If a parent process terminates, then its "zombie" children (if any) are adopted by init(1), which automatically performs a wait to remove the zombies. This is my first hypotesis, now I will try to reproduce this situation, but in any case I'd hardly suggest to your user abganz to fix the qiime software. Have you tried 'qiime' outside of slurm? Does it generate Z processes? This theory doesn't directly explain the following logged errors: Aug 9 20:13:16 sgiuv300-srcf-d10-01 kernel: [ 9512.041358] INFO: task slurmstepd:377005 blocked for more than 120 seconds. Aug 9 20:13:16 sgiuv300-srcf-d10-01 kernel: [ 9512.041361] slurmstepd D ffff96cac7dd0000 0 377005 1 0x00000080 Aug 9 20:14:12 sgiuv300-srcf-d10-01 slurmstepd[377480]: error: *** EXTERN STEP FOR 3092791 STEPD TERMINATED ON sgiuv300-srcf-d10-01 AT 2018-08-09T20:14:11 DUE TO JOB NOT ENDING WITH SIGNALS *** But given the kernel process table seems to be filled up we cannot discard a relation.
We have cycled back to the ugly part of this. These qiime jobs (as far as the ones I have been involved with) run fine on our "normal" nodes, this problem just occurs on the UV300. And, qiime is just today's example, previous problems were with different applications and the abandoned children where the sleep 1000000 processes that slurmstepd starts. We've had problems before which stemmed from something counting up the CPUs on the UV300 and not being prepared to handle the large number. Is there anything about slurmstepd or slurmd for which 720 threads is a corner case? We ran into this with ZFS at one point and had to force it to use 128 instead of the actual CPU count. Just a thought. jbh
> Is there > anything about slurmstepd or slurmd for which 720 threads is a corner case? > We ran into this with ZFS at one point and had to force it to use 128 > instead of the actual CPU count. Just a thought. > I will look into that but it seems not probable. What OS/Kernel version are you on?
(In reply to Felip Moll from comment #23) > > What OS/Kernel version are you on? Stock CentOS 7.5 3.10.0-862.3.2.el7.x86_64 One other thing to note, those previous R zombies seem to have cleaned themselves up over the weekend. The are no zombie processes on the node now. I'm not sure if the previously observed sleep zombies would have done this, in those cases the node eventually ground to a halt and had to be rebooted. jbh
(In reply to John Hanks from comment #24) > (In reply to Felip Moll from comment #23) > > > > What OS/Kernel version are you on? > > Stock CentOS 7.5 3.10.0-862.3.2.el7.x86_64 > > One other thing to note, those previous R zombies seem to have cleaned > themselves up over the weekend. The are no zombie processes on the node now. > I'm not sure if the previously observed sleep zombies would have done this, > in those cases the node eventually ground to a halt and had to be rebooted. > > jbh I am starting to think that zombies are just a side effect or another issue that may misguide us on the diagnosis. A zombie process should not be harmful as long as the main process ends correctly; if the main process does a wait() this is fine, if not, init() will do a wait and clean the zombies. It seems the zombies are being created because of some bad reason, but the real problem is that slurmstepd doesn't end. Still under investigation but bug 5103 looks very similar to what's happening to you about the hung slurmstepd processes, and there are a lot of related bugs, all of them happening under RHEL 7.[4-5] and Glibc 2.17-xxx I guess your glibc is 2.17-222.
Yes, our glibc is glibc-2.17-222.el7.x86_64.
After looking deeper in your logs, I see that there not seem to be any real hung slurmstepd. I correlated the pstree slurmstepd processes (not LWP) with ps-aux, which gave me 145 slurmstepd processes corresponding to extern, batch and .0 steps. After that I grabbed the jobid of each extern step and correlated with the job_script processes found in ps aux. All of them had one running script, except for job 3360118, which as found in pstree to be an interactive bash that was running "less". There were 71 jobs. This means that all jobs in your output were running. If you think it isn't this way, please, show me the exact hung process and job id, because I am not seeing it. Also, the bt that you show in comment 20 is normal for a .0 step. In your first comment in this bug you do a "pstree -a -c -p -S 421660", but the logged error "NOT ENDING WITH SIGNAL" refers to an extern step with pid 377480. Can you please define what do you mean by "a slurmstepd that has lost a child to the zombie process"? Note that a job may have 3 or more slurmstepds, for example, this corresponds to one single job in my system, and to the extern, batch and .0 steps: |-slurmstepd(10412)-+-sleep(10416) | |-{acctg_prof}(10414) | |-{acctg}(10413) | `-{slurmstepd}(10415) |-slurmstepd(10419)-+-slurm_script(10423)---srun(10424)-+-srun(10427) | | |-{srun}(10428) | | |-{srun}(10429) | | |-{srun}(10430) | | `-{srun}(10431) | |-{acctg_prof}(10421) | |-{acctg}(10420) | `-{slurmstepd}(10422) |-slurmstepd(10435)-+-sleep(10441) | |-{acctg_prof}(10437) | |-{acctg}(10436) | |-{slurmstepd}(10438) | |-{slurmstepd}(10439) | `-{slurmstepd}(10440) Sorry for starting again from the beginning, but I need to have everything clearly identified. In the meantime, you may do this checks: a) Please, do a: 'grep -r "possible file system problem" slurm.log', you will see lots of: [2018-08-23T08:46:52.792] Warning: slurmstepd startup took XX sec, possible file system problem or full memory which reinforces the theory of I/O problems, maybe is just a warning but there's a considerable slowdown at some point as you will see. b) Are the system limits tuned properly, specially nr_open_files? With systemd it can change the slurmd limits, take a look at /proc/$(pidof slurmd)/limits Check specially these pages if you haven't done so: https://slurm.schedmd.com/high_throughput.html https://slurm.schedmd.com/big_sys.html c) In comment 3 you said: "but leave behind the zombie sleeps or they stay in Running state forever, even after the slurmstepd has been killed (again leaving behind an optional zombie sleep.)" Do you always see them in squeue? I need a more clear case, when you identify one I need: - jobid - squeue - pstree -p - ps aux - thread apply all bt full of the related slurmstepd cases, this means the processes of checking ps aux|grep $jobid corresponding to extern, batch, .0 steps, provided that the user task is still not running. Thanks John
(In reply to Felip Moll from comment #28) > This means that all jobs in your output were running. If you think it isn't > this way, please, show me the exact hung process and job id, because I am > not seeing it. This is expanding my understanding of zombies. I'm still not sure how the slurmstep losing track of it's children is ok, even if they do eventually terminate successfully, but since these jobs did finish that makes them unlike the previous jobs which left behind the sleep processes and slurmstepd's in wait() or in a state where I couldn't attach with gdb. > Can you please define what do you mean by "a slurmstepd that has lost a > child to the zombie process"? In the originally observed cases with sleep, the processes did not eventually clear up and the associated jobs remained in Running state with a sleep zombie and sometimes a hung slurmstepd. Eventually the node would be reporting it was fully utilized but there wouldn't be any user processes running on it, at which point we'd reboot it, jobs would requeue and run/complete normally until the next time it happened. > Sorry for starting again from the beginning, but I need to have everything > clearly identified. No worries, I may end up having to look back in time to correlate this with a specific users jobs if it doesn't appear again. > In the meantime, you may do this checks: > > a) Please, do a: 'grep -r "possible file system problem" slurm.log', you > will see lots of: > > [2018-08-23T08:46:52.792] Warning: slurmstepd startup took XX sec, possible > file system problem or full memory > > which reinforces the theory of I/O problems, maybe is just a warning but > there's a considerable slowdown at some point as you will see. I see that message logged a lot on this node in but not necessarily correlated with any of it's reboots or kernel panics. I don't see it logged much around previous reboots associated with this problem. It's extremely rare on any other node so I'm not sure I can actually blame the storage. I'll do some testing to see if there really is an I/O problem here, but since I recently used this node to specifically do a bunch of I/O heavy tasks without any problems, again, I'm not sure I can blame storage for the slowdown in slurmstepd startup. I don't mind blaming SGI, err, I mean, HP. > b) Are the system limits tuned properly, specially nr_open_files? With > systemd it can change the slurmd limits, take a look at /proc/$(pidof > slurmd)/limits [root@sgiuv300-srcf-d10-01 ~]# cat /proc/sys/fs/file-max 1013864338 [root@sgiuv300-srcf-d10-01 ~]# cat /etc/systemd/system/slurmd.service | grep Limit LimitNOFILE=51200 LimitMEMLOCK=infinity LimitSTACK=infinity Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size unlimited unlimited bytes Max core file size unlimited unlimited bytes Max resident set unlimited unlimited bytes Max processes 39993724 39993724 processes Max open files 51200 51200 files Max locked memory unlimited unlimited bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 39993724 39993724 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us Just a few days ago we ran some jobs that produced > 14000 processes/threads and load > 3000 without hitting any limit problems (or having those jobs produce any zombies.) > Check specially these pages if you haven't done so: > https://slurm.schedmd.com/high_throughput.html > https://slurm.schedmd.com/big_sys.html It's been a while since I went through these, but will revisit them again in case something has changed or I have missed a critical bit. > c) In comment 3 you said: > > "but leave behind the zombie sleeps or they stay in Running state forever, > even after the slurmstepd has been killed (again leaving behind an optional > zombie sleep.)" > > Do you always see them in squeue? I need a more clear case, when you > identify one I need: > > - jobid > - squeue > - pstree -p > - ps aux > - thread apply all bt full of the related slurmstepd cases, this means the > processes of checking ps aux|grep $jobid corresponding to extern, batch, .0 > steps, provided that the user task is still not running. > It looks like I need to either look back in time and find a specific set of jobs this is correlated with and build a reproducer or wait until the problem re-occurs and let it reach the point of stopping the node before collecting more info. If you happen to have an epiphany, please send it our way otherwise consider the ball in my court and leave this on the back burner until I have found a way to reproduce. Thanks and sorry for the wild goose chase with the R processes although I have enjoyed learning more about how the kernel handles orphaned processes. jbh
(In reply to John Hanks from comment #29) > (In reply to Felip Moll from comment #28) > > This means that all jobs in your output were running. If you think it isn't > > this way, please, show me the exact hung process and job id, because I am > > not seeing it. > > This is expanding my understanding of zombies. I'm still not sure how the > slurmstep losing track of it's children is ok, even if they do eventually > terminate successfully, but since these jobs did finish that makes them > unlike the previous jobs which left behind the sleep processes and > slurmstepd's in wait() or in a state where I couldn't attach with gdb. Keep in mind that slurmstepd will only wait4() for the processes that are immediately created by him. This means that if a process does fork(), the subprocesses will not be waited by slurmstepd. This was the case in comment 21, where stepd would wait4() for qiime but not for R and their threads. So Zombies can certainly occur. |-slurmstepd,260097 | |-qiime,260146 /home/abganz/.conda/envs/qiime2/bin/qiime dada2 denoise-paired --i-demultiplexed-seqs paired-end-demux.qza --output-dir dada2-209 --p-n-threads 15 ... | | `-R,261285 --slave --no-restore --file=/home/abganz/.conda/envs/qiime2/bin/run_dada_paired.R --args /tmp/tmpwb3nmuzq/forward /tmp/tmpwb3nmuzq/reverse /tmp/tmpwb... | | |-(R,261412) | | |-{R},301848 | | |-{R},301849 Of course, sleep process must be waited for, so if you tell me you saw zombie sleeps around that could be because slurmstepd died ad some point, i.e. related to something like this? and this should be investigated: Aug 9 20:13:16 sgiuv300-srcf-d10-01 kernel: [ 9512.041358] INFO: task slurmstepd:377005 blocked for more than 120 seconds. Aug 9 20:13:16 sgiuv300-srcf-d10-01 kernel: [ 9512.041361] slurmstepd D ffff96cac7dd0000 0 377005 1 0x00000080 Maybe it was a particular situation? > Eventually the node would be > reporting it was fully utilized but there wouldn't be any user processes > running on it, at which point we'd reboot it, jobs would requeue and > run/complete normally until the next time it happened. If a lot of processes created a lot of zombies, the kernel table could be filled up and the OS be unable to create new processes. Having a lot of jobs that creates a lot of threads that generates zombies, could have this effect. The question is, if the user app. is waiting correctly, why are the zombies created? Maybe it would be worth to take a look at the user's code. Don't either discard some kernel bug. > It looks like I need to either look back in time and find a specific set of > jobs this is correlated with and build a reproducer or wait until the > problem re-occurs and let it reach the point of stopping the node before > collecting more info. If you happen to have an epiphany, please send it our > way otherwise consider the ball in my court and leave this on the back > burner until I have found a way to reproduce. Cool, I agree this is the correct way for now. I'll be waiting for fresh news. Btw, from Thursday I am on vacation until 9th Sept, but my colleagues will be looking at the bug in the meantime. > Thanks and sorry for the wild goose chase with the R processes although I > have enjoyed learning more about how the kernel handles orphaned processes. No worries!, I am also enjoying with this case and this forced me to refresh my knowledge on the matter :)
Hi John, I assume you still haven't found a reproducer. I think that at this point and after +2weeks the best option is to mark the bug as INFOGIVEN, and let you reopen and mark as UNCONFIRMED again if at some point you are able to get fresh info. What do you think?
Hi Felip, We've moved on to 18.08.0 now and upgraded to the latest CentOS kernel and so far haven't seen this reappear. It's safe to close this and if I ever find a reproducer for it, I know where to find you. Thanks, jbh
Very good, I am closing this issue now. Regards!
And, we're back. Another recurrence of teh slurm(step)d losing track of it's children, this time a gzip process went astray. The view from top: top - 08:51:48 up 16:45, 2 users, load average: 84.28, 84.62, 86.74 Tasks: 3581 total, 6 running, 3574 sleeping, 0 stopped, 1 zombie %Cpu(s): 2.9 us, 0.1 sy, 0.0 ni, 97.1 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 10238839+total, 82755717+free, 38465574+used, 15786117+buff/cache KiB Swap: 13674601+total, 13674505+free, 95744 used. 98321561+avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 173479 kameronr 20 0 113180 1560 1312 S 0.0 0.0 0:00.03 slurm_script 173733 kameronr 20 0 3017240 2.7g 2396 D 0.0 0.0 11:00.28 perl 174988 kameronr 20 0 0 0 0 Z 0.0 0.0 0:00.62 gzip 175000 kameronr 20 0 3017240 2.7g 544 D 0.0 0.0 10:38.84 perl 175001 kameronr 20 0 3017240 2.7g 536 D 0.0 0.0 10:20.54 perl 175003 kameronr 20 0 3017240 2.7g 536 D 0.0 0.0 10:04.20 perl 175006 kameronr 20 0 3017240 2.7g 536 D 0.0 0.0 10:12.28 perl 175007 kameronr 20 0 3017240 2.7g 544 D 0.0 0.0 11:13.32 perl 175010 kameronr 20 0 3017240 2.7g 536 D 0.0 0.0 10:08.07 perl 175013 kameronr 20 0 3017240 2.7g 536 D 0.0 0.0 10:21.57 perl 175015 kameronr 20 0 3017240 2.7g 548 D 0.0 0.0 10:41.05 perl 175016 kameronr 20 0 3017240 2.7g 536 D 0.0 0.0 10:26.48 perl 175019 kameronr 20 0 3017240 2.7g 544 D 0.0 0.0 10:36.88 perl 175024 kameronr 20 0 3017240 2.7g 544 D 0.0 0.0 10:27.69 perl This job things it is in Running state, but I think will never finish because gzip is still hanging around: [root@sgiuv300-srcf-d10-01 foundation-2.19]# pstree -u kameronr slurm_script───perl─┬─gzip └─11*[perl] In dmesg I see this, but I'm pretty sure this happened before the job above was started. Just putting it here for the sake of completeness: [Tue Nov 13 16:18:05 2018] ------------[ cut here ]------------ [Tue Nov 13 16:18:05 2018] WARNING: CPU: 350 PID: 42860 at kernel/cpuset.c:1296 cpuset_write_resmask+0x80a/0xa10 [Tue Nov 13 16:18:05 2018] Modules linked in: nvidia_uvm(POE) nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) drm_kms_helper syscopyarea sysfillrect sysimgblt fb_s ys_fops ipmi_devintf vfio_iommu_type1 vfio nfsd nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache ipt_REJECT nf_reject_ipv4 nf_c onntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c iptable_filter zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) dm_serv ice_time dm_multipath ipmi_si ipmi_msghandler sunrpc dm_mirror dm_region_hash dm_log dm_mod sr_mod cdrom intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kv m iTCO_wdt iTCO_vendor_support irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcs pkr joydev [Tue Nov 13 16:18:05 2018] mxm_wmi video ttm uas usb_storage nvme drm sg nvme_core i2c_i801 lpc_ich ioatdma shpchp wmi binfmt_misc ip_tables mpt3sas raid_class sc si_transport_sas i40e igb i2c_algo_bit dca ptp pps_core i2c_core megaraid_sas ahci libahci libata sd_mod crc_t10dif crct10dif_common [last unloaded: ipmi_devintf] [Tue Nov 13 16:18:05 2018] CPU: 350 PID: 42860 Comm: slurmstepd Tainted: P OE ------------ 3.10.0-862.14.4.el7.x86_64 #1 [Tue Nov 13 16:18:05 2018] Hardware name: SGI UV300/UV300, BIOS SGI UV 300 series BIOS 06/15/2018 [Tue Nov 13 16:18:05 2018] Call Trace: [Tue Nov 13 16:18:05 2018] [<ffffffffa3d13754>] dump_stack+0x19/0x1b [Tue Nov 13 16:18:05 2018] [<ffffffffa36945d8>] __warn+0xd8/0x100 [Tue Nov 13 16:18:05 2018] [<ffffffffa369471d>] warn_slowpath_null+0x1d/0x20 [Tue Nov 13 16:18:05 2018] [<ffffffffa3725f0a>] cpuset_write_resmask+0x80a/0xa10 [Tue Nov 13 16:18:05 2018] [<ffffffffa371b020>] ? cgroup_rightmost_descendant+0x80/0x80 [Tue Nov 13 16:18:05 2018] [<ffffffffa3725700>] ? update_cpumasks_hier+0x410/0x410 [Tue Nov 13 16:18:05 2018] [<ffffffffa371d411>] cgroup_file_write+0x1d1/0x2d0 [Tue Nov 13 16:18:05 2018] [<ffffffffa38221c8>] ? __sb_start_write+0x58/0x110 [Tue Nov 13 16:18:05 2018] [<ffffffffa38d5437>] ? security_file_permission+0x27/0xa0 [Tue Nov 13 16:18:05 2018] [<ffffffffa381f240>] vfs_write+0xc0/0x1f0 [Tue Nov 13 16:18:05 2018] [<ffffffffa382006f>] SyS_write+0x7f/0xf0 [Tue Nov 13 16:18:05 2018] [<ffffffffa3d2579b>] system_call_fastpath+0x22/0x27 [Tue Nov 13 16:18:05 2018] ---[ end trace 1a3d8cbb1a4bc6dd ]--- And a bt from the guilty slurmstepd: (gdb) thread apply all bt full Thread 5 (Thread 0x2b02a998b700 (LWP 173420)): #0 0x00002b02aaacb995 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0 No symbol table info available. #1 0x00002b02a9b46433 in _watch_tasks (arg=<optimized out>) at slurm_jobacct_gather.c:369 err = <optimized out> __func__ = "_watch_tasks" #2 0x00002b02aaac7e25 in start_thread () from /usr/lib64/libpthread.so.0 No symbol table info available. #3 0x00002b02aaddabad in clone () from /usr/lib64/libc.so.6 No symbol table info available. Thread 4 (Thread 0x2b02b08b4700 (LWP 173421)): #0 0x00002b02aaacbd42 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0 No symbol table info available. #1 0x00002b02a9b407e6 in _timer_thread (args=<optimized out>) at slurm_acct_gather_profile.c:205 err = <optimized out> i = 4 now = 1542214614 diff = <optimized out> tvnow = {tv_sec = 1542162036, tv_usec = 11681} abs = {tv_sec = 1542214615, tv_nsec = 11681000} __func__ = "_timer_thread" #2 0x00002b02aaac7e25 in start_thread () from /usr/lib64/libpthread.so.0 No symbol table info available. #3 0x00002b02aaddabad in clone () from /usr/lib64/libc.so.6 No symbol table info available. Thread 3 (Thread 0x2b02b09b5700 (LWP 173422)): #0 0x00002b02aadcff0d in poll () from /usr/lib64/libc.so.6 No symbol table info available. #1 0x00002b02a9bd99d7 in _poll_internal (shutdown_time=<optimized out>, nfds=2, pfds=0x2b02b40008d0) at eio.c:364 n = <optimized out> timeout = -1 #2 eio_handle_mainloop (eio=0x1fc9350) at eio.c:328 retval = 0 pollfds = 0x2b02b40008d0 map = 0x2b02b4000900 maxnfds = 1 nfds = 2 n = <optimized out> shutdown_time = <optimized out> __func__ = "eio_handle_mainloop" #3 0x000000000041d7a0 in _msg_thr_internal (job_arg=0x1fe2e20) at req.c:273 job = 0x1fe2e20 #4 0x00002b02aaac7e25 in start_thread () from /usr/lib64/libpthread.so.0 No symbol table info available. #5 0x00002b02aaddabad in clone () from /usr/lib64/libc.so.6 No symbol table info available. Thread 2 (Thread 0x2b02b18e8700 (LWP 173424)): #0 0x00002b02aadcff0d in poll () from /usr/lib64/libc.so.6 No symbol table info available. #1 0x00002b02b0fd0942 in _oom_event_monitor (x=<optimized out>) at task_cgroup_memory.c:478 args = 0x20f0b70 ret = <optimized out> res = 0 fds = {{fd = 12, events = 3, revents = 0}, {fd = 13, events = 8193, revents = 0}} __func__ = "_oom_event_monitor" #2 0x00002b02aaac7e25 in start_thread () from /usr/lib64/libpthread.so.0 No symbol table info available. Hopefully something in all that has a useful clue. Checking with user to see if it's ok to leave this job hanging around for troubleshooting. Best, jbh
Sorry, bt got truncated in last comment, full is here: (gdb) thread apply all bt full Thread 5 (Thread 0x2b02a998b700 (LWP 173420)): #0 0x00002b02aaacb995 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0 No symbol table info available. #1 0x00002b02a9b46433 in _watch_tasks (arg=<optimized out>) at slurm_jobacct_gather.c:369 err = <optimized out> __func__ = "_watch_tasks" #2 0x00002b02aaac7e25 in start_thread () from /usr/lib64/libpthread.so.0 No symbol table info available. #3 0x00002b02aaddabad in clone () from /usr/lib64/libc.so.6 No symbol table info available. Thread 4 (Thread 0x2b02b08b4700 (LWP 173421)): #0 0x00002b02aaacbd42 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0 No symbol table info available. #1 0x00002b02a9b407e6 in _timer_thread (args=<optimized out>) at slurm_acct_gather_profile.c:205 err = <optimized out> i = 4 now = 1542214614 diff = <optimized out> tvnow = {tv_sec = 1542162036, tv_usec = 11681} abs = {tv_sec = 1542214615, tv_nsec = 11681000} __func__ = "_timer_thread" #2 0x00002b02aaac7e25 in start_thread () from /usr/lib64/libpthread.so.0 No symbol table info available. #3 0x00002b02aaddabad in clone () from /usr/lib64/libc.so.6 No symbol table info available. Thread 3 (Thread 0x2b02b09b5700 (LWP 173422)): #0 0x00002b02aadcff0d in poll () from /usr/lib64/libc.so.6 No symbol table info available. #1 0x00002b02a9bd99d7 in _poll_internal (shutdown_time=<optimized out>, nfds=2, pfds=0x2b02b40008d0) at eio.c:364 n = <optimized out> timeout = -1 #2 eio_handle_mainloop (eio=0x1fc9350) at eio.c:328 retval = 0 pollfds = 0x2b02b40008d0 map = 0x2b02b4000900 maxnfds = 1 nfds = 2 n = <optimized out> shutdown_time = <optimized out> __func__ = "eio_handle_mainloop" #3 0x000000000041d7a0 in _msg_thr_internal (job_arg=0x1fe2e20) at req.c:273 job = 0x1fe2e20 #4 0x00002b02aaac7e25 in start_thread () from /usr/lib64/libpthread.so.0 No symbol table info available. #5 0x00002b02aaddabad in clone () from /usr/lib64/libc.so.6 No symbol table info available. Thread 2 (Thread 0x2b02b18e8700 (LWP 173424)): #0 0x00002b02aadcff0d in poll () from /usr/lib64/libc.so.6 No symbol table info available. #1 0x00002b02b0fd0942 in _oom_event_monitor (x=<optimized out>) at task_cgroup_memory.c:478 args = 0x20f0b70 ret = <optimized out> res = 0 fds = {{fd = 12, events = 3, revents = 0}, {fd = 13, events = 8193, revents = 0}} __func__ = "_oom_event_monitor" #2 0x00002b02aaac7e25 in start_thread () from /usr/lib64/libpthread.so.0 No symbol table info available. ---Type <return> to continue, or q <return> to quit--- #3 0x00002b02aaddabad in clone () from /usr/lib64/libc.so.6 No symbol table info available. Thread 1 (Thread 0x2b02a9889000 (LWP 173411)): #0 0x00002b02aada122a in wait4 () from /usr/lib64/libc.so.6 No symbol table info available. #1 0x000000000040f246 in _wait_for_any_task (job=job@entry=0x1fe2e20, waitflag=waitflag@entry=true) at mgr.c:2116 t = <optimized out> rc = <optimized out> status = 0 pid = <optimized out> completed = 0 jobacct = <optimized out> rusage = {ru_utime = {tv_sec = 0, tv_usec = 0}, ru_stime = {tv_sec = 0, tv_usec = 0}, {ru_maxrss = 0, __ru_maxrss_word = 0}, {ru_ixrss = 0, __ru_ixrss_word = 0}, {ru_idrss = 0, __ru_idrss_word = 0}, {ru_isrss = 0, __ru_isrss_word = 0}, {ru_minflt = 0, __ru_minflt_word = 0}, { ru_majflt = 0, __ru_majflt_word = 0}, {ru_nswap = 0, __ru_nswap_word = 0}, {ru_inblock = 0, __ru_inblock_word = 0}, {ru_oublock = 0, __ru_oublock_word = 0}, {ru_msgsnd = 0, __ru_msgsnd_word = 0}, {ru_msgrcv = 0, __ru_msgrcv_word = 0}, {ru_nsignals = 0, __ru_nsignals_word = 0}, { ru_nvcsw = 6151706, __ru_nvcsw_word = 6151706}, {ru_nivcsw = 47290437680538, __ru_nivcsw_word = 47290437680538}} tmp_env = <optimized out> task_offset = 0 __func__ = "_wait_for_any_task" #2 0x0000000000410cb3 in _wait_for_all_tasks (job=0x1fe2e20) at mgr.c:2227 rc = <optimized out> tasks_left = 1 i = 0 #3 job_manager (job=job@entry=0x1fe2e20) at mgr.c:1424 rc = 0 io_initialized = true ckpt_type = 0x1fc93c0 "checkpoint/none" err_msg = 0x0 __func__ = "job_manager" #4 0x000000000040d3ea in main (argc=1, argv=0x7ffc18229038) at slurmstepd.c:174 cli = 0x1fc9ff0 self = 0x0 msg = 0x1fc6bc0 rc = 0 launch_params = 0x0 __func__ = "main" (gdb)
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > > 173479 kameronr 20 0 113180 1560 1312 S 0.0 0.0 0:00.03 > slurm_script > > 173733 kameronr 20 0 3017240 2.7g 2396 D 0.0 0.0 11:00.28 perl > > 174988 kameronr 20 0 0 0 0 Z 0.0 0.0 0:00.62 gzip > > 175000 kameronr 20 0 3017240 2.7g 544 D 0.0 0.0 10:38.84 perl > > 175001 kameronr 20 0 3017240 2.7g 536 D 0.0 0.0 10:20.54 perl > > 175003 kameronr 20 0 3017240 2.7g 536 D 0.0 0.0 10:04.20 perl > > 175006 kameronr 20 0 3017240 2.7g 536 D 0.0 0.0 10:12.28 perl > > 175007 kameronr 20 0 3017240 2.7g 544 D 0.0 0.0 11:13.32 perl > > 175010 kameronr 20 0 3017240 2.7g 536 D 0.0 0.0 10:08.07 perl > > 175013 kameronr 20 0 3017240 2.7g 536 D 0.0 0.0 10:21.57 perl > > 175015 kameronr 20 0 3017240 2.7g 548 D 0.0 0.0 10:41.05 perl > > 175016 kameronr 20 0 3017240 2.7g 536 D 0.0 0.0 10:26.48 perl > > 175019 kameronr 20 0 3017240 2.7g 544 D 0.0 0.0 10:36.88 perl > > 175024 kameronr 20 0 3017240 2.7g 544 D 0.0 0.0 10:27.69 perl wow, that's absolutely interesting. The slurm_script is run from slurmstepd (or it should be, a complete pstree -psal as root would help even more), and it haven't finished. This means that perl haven't correctly finished.. and we see it in D state. Gzip is in Z state, which indicates that perl haven't waited for him. > [root@sgiuv300-srcf-d10-01 foundation-2.19]# pstree -u kameronr > slurm_script───perl─┬─gzip > └─11*[perl] It seems that perl script have had some problem in its execution... it is curious that now it is perl, and before an R software. I am pretty sure that the provided kernel stack trace is somewhat related. It seems there may be a bug in kernel cgroups or similar, cpuset_write_resmask() is the function in the kernel used to modify a cpu or mem cgroup file and is indirectly used by slurm when adding/removing tasks to the cgroup or setting limits. Does the kernel dump coincide with the start time of the job, or similar? I think we should investigate this path.
Created attachment 8302 [details] pstree -psal
pstree -psal has been attacked. That kernel stack trace happened two hours before the job in question started. The user cancelled the job and it stuck in CG state for a while, eventually SLURM cleared it and marked the node "draining Kill task failed". The processes are still hanging around though, if you want anything else collected from them. jbh
Rebooted the node and it looks like the kernel backtrace is consistently happening (once?) at every boot when a job starts:: [ 4157.609389] ------------[ cut here ]------------ [ 4157.609428] WARNING: CPU: 310 PID: 232813 at kernel/cpuset.c:1296 cpuset_write_resmask+0x80a/0xa10 [ 4157.609431] Modules linked in: nvidia_uvm(POE) nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ipmi_devintf vfio_iommu_type1 vfio nfsd nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c iptable_filter zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) dm_service_time dm_multipath ipmi_si ipmi_msghandler sunrpc dm_mirror dm_region_hash dm_log dm_mod sr_mod cdrom intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt iTCO_vendor_support pcspkr joydev [ 4157.609517] mxm_wmi video ttm uas usb_storage drm nvme sg nvme_core i2c_i801 lpc_ich ioatdma shpchp wmi binfmt_misc ip_tables mpt3sas raid_class scsi_transport_sas i40e igb i2c_algo_bit dca ptp pps_core i2c_core megaraid_sas ahci libahci libata sd_mod crc_t10dif crct10dif_common [last unloaded: ipmi_devintf] [ 4157.609564] CPU: 310 PID: 232813 Comm: slurmstepd Tainted: P OE ------------ 3.10.0-862.14.4.el7.x86_64 #1 [ 4157.609567] Hardware name: SGI UV300/UV300, BIOS SGI UV 300 series BIOS 06/15/2018 [ 4157.609570] Call Trace: [ 4157.609590] [<ffffffff99313754>] dump_stack+0x19/0x1b [ 4157.609609] [<ffffffff98c945d8>] __warn+0xd8/0x100 [ 4157.609613] [<ffffffff98c9471d>] warn_slowpath_null+0x1d/0x20 [ 4157.609616] [<ffffffff98d25f0a>] cpuset_write_resmask+0x80a/0xa10 [ 4157.609634] [<ffffffff98d1b020>] ? cgroup_rightmost_descendant+0x80/0x80 [ 4157.609668] [<ffffffff98d25700>] ? update_cpumasks_hier+0x410/0x410 [ 4157.609672] [<ffffffff98d1d411>] cgroup_file_write+0x1d1/0x2d0 [ 4157.609689] [<ffffffff98e221c8>] ? __sb_start_write+0x58/0x110 [ 4157.609729] [<ffffffff98ed5437>] ? security_file_permission+0x27/0xa0 [ 4157.609758] [<ffffffff98e1f240>] vfs_write+0xc0/0x1f0 [ 4157.609779] [<ffffffff98e2006f>] SyS_write+0x7f/0xf0 [ 4157.609811] [<ffffffff9932579b>] system_call_fastpath+0x22/0x27 [ 4157.609830] ---[ end trace f3345453c14bf075 ]---
Another job with some zombies, bt: (gdb) thread apply all bt full Thread 5 (Thread 0x2afcac39c700 (LWP 233196)): #0 0x00002afcad4dc995 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0 No symbol table info available. #1 0x00002afcac557433 in _watch_tasks (arg=<optimized out>) at slurm_jobacct_gather.c:369 err = <optimized out> __func__ = "_watch_tasks" #2 0x00002afcad4d8e25 in start_thread () from /usr/lib64/libpthread.so.0 No symbol table info available. #3 0x00002afcad7ebbad in clone () from /usr/lib64/libc.so.6 No symbol table info available. Thread 4 (Thread 0x2afcb32c5700 (LWP 233197)): #0 0x00002afcad4dcd42 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0 No symbol table info available. #1 0x00002afcac5517e6 in _timer_thread (args=<optimized out>) at slurm_acct_gather_profile.c:205 err = <optimized out> i = 4 now = 1542298269 diff = <optimized out> tvnow = {tv_sec = 1542240723, tv_usec = 671182} abs = {tv_sec = 1542298270, tv_nsec = 671182000} __func__ = "_timer_thread" #2 0x00002afcad4d8e25 in start_thread () from /usr/lib64/libpthread.so.0 No symbol table info available. #3 0x00002afcad7ebbad in clone () from /usr/lib64/libc.so.6 No symbol table info available. Thread 3 (Thread 0x2afcb33c6700 (LWP 233198)): #0 0x00002afcad7e0f0d in poll () from /usr/lib64/libc.so.6 No symbol table info available. #1 0x00002afcac5ea9d7 in _poll_internal (shutdown_time=<optimized out>, nfds=2, pfds=0x2afcb40008d0) at eio.c:364 n = <optimized out> timeout = -1 #2 eio_handle_mainloop (eio=0x1277a30) at eio.c:328 retval = 0 pollfds = 0x2afcb40008d0 map = 0x2afcb4000900 maxnfds = 1 nfds = 2 n = <optimized out> shutdown_time = <optimized out> __func__ = "eio_handle_mainloop" #3 0x000000000041d7a0 in _msg_thr_internal (job_arg=0x1179e20) at req.c:273 job = 0x1179e20 #4 0x00002afcad4d8e25 in start_thread () from /usr/lib64/libpthread.so.0 No symbol table info available. #5 0x00002afcad7ebbad in clone () from /usr/lib64/libc.so.6 No symbol table info available. Thread 2 (Thread 0x2afcb8302700 (LWP 233213)): #0 0x00002afcad7e0f0d in poll () from /usr/lib64/libc.so.6 No symbol table info available. #1 0x00002afcb39e1942 in _oom_event_monitor (x=<optimized out>) at task_cgroup_memory.c:478 args = 0x11ba950 ret = <optimized out> res = 0 fds = {{fd = 12, events = 3, revents = 0}, {fd = 13, events = 8193, revents = 0}} __func__ = "_oom_event_monitor" #2 0x00002afcad4d8e25 in start_thread () from /usr/lib64/libpthread.so.0 No symbol table info available. #3 0x00002afcad7ebbad in clone () from /usr/lib64/libc.so.6 No symbol table info available. Thread 1 (Thread 0x2afcac29a000 (LWP 233191)): #0 0x00002afcad7b222a in wait4 () from /usr/lib64/libc.so.6 No symbol table info available. #1 0x000000000040f246 in _wait_for_any_task (job=job@entry=0x1179e20, waitflag=waitflag@entry=true) at mgr.c:2116 ---Type <return> to continue, or q <return> to quit--- t = <optimized out> rc = <optimized out> status = 0 pid = <optimized out> completed = 0 jobacct = <optimized out> rusage = {ru_utime = {tv_sec = 0, tv_usec = 0}, ru_stime = {tv_sec = 0, tv_usec = 0}, {ru_maxrss = 0, __ru_maxrss_word = 0}, {ru_ixrss = 0, __ru_ixrss_word = 0}, {ru_idrss = 0, __ru_idrss_word = 0}, {ru_isrss = 0, __ru_isrss_word = 0}, {ru_minflt = 0, __ru_minflt_word = 0}, {ru_majflt = 0, __ru_majflt_word = 0}, {ru_nswap = 0, __ru_nswap_word = 0}, { ru_inblock = 0, __ru_inblock_word = 0}, {ru_oublock = 0, __ru_oublock_word = 0}, {ru_msgsnd = 0, __ru_msgsnd_word = 0}, {ru_msgrcv = 18325024, __ru_msgrcv_word = 18325024}, { ru_nsignals = 0, __ru_nsignals_word = 0}, {ru_nvcsw = 6152281, __ru_nvcsw_word = 6152281}, {ru_nivcsw = 47264711986586, __ru_nivcsw_word = 47264711986586}} tmp_env = <optimized out> task_offset = 0 __func__ = "_wait_for_any_task" #2 0x0000000000410cb3 in _wait_for_all_tasks (job=0x1179e20) at mgr.c:2227 rc = <optimized out> tasks_left = 1 i = 0 #3 job_manager (job=job@entry=0x1179e20) at mgr.c:1424 rc = 0 io_initialized = true ckpt_type = 0x1154320 "checkpoint/none" err_msg = 0x0 __func__ = "job_manager" #4 0x000000000040d3ea in main (argc=1, argv=0x7ffc763ac398) at slurmstepd.c:174 cli = 0x1160ff0 self = 0x0 msg = 0x115dbc0 rc = 0 launch_params = 0x0 __func__ = "main" (gdb)
Created attachment 8309 [details] pstree -psal
John, After looking at the kernel code I see that the kernel stack trace is dumping in a call to : WARN_ON(!is_in_v2_mode() && !nodes_equal(cp->mems_allowed, cp->effective_mems)); This may indicate two failures, or that the nodes_equal comparison is wrong, which could mean a corrupted bitmap, or that your cgroups is not operating in v2 mode. Operating in v1 mode shouldn't be a concern on normal systems but since this is a SGI UV300 special machine, I don't know how it manages cpus. Quoting RHEL Release Notes: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html-single/7.5_release_notes/index Cpuset is now able to restore the effective CPU mask after a pair of offline and online events Prior to this update, the cpuset filesystem, which confines processes to processor and memory node subsets, had one bitmap set enabled for CPUs used in a cpuset. As a consequence, a CPU offline event followed by a CPU online event caused the affected CPU to be removed from all non-root cpusets. With this update, cpuset has two bitmap sets enabled. As a result, cpuset is now able to properly track CPU online or offline events to restore the effective CPU mask as long as the -o cpuset_v2_mode mount flag is used when mounting cpuset cgroup. (BZ#947004) I am not saying this is the exact cause but it could, given your high number of cpus (720), how SGI deals with it, and the usage of cpusets by slurm: [2018-08-23T08:47:43.950] [3358358.extern] debug3: xcgroup_set_param: parameter 'cpuset.cpus' set to '32,392,0-23,25-31,33-42,44,46-51,360-383,385-391,393-402,404,406-411' for '/sys/fs/cgroup/cpuset/slurm/uid_327881' Would you be able to do the test and try to mount cgroup with this option, in order to see if after a reboot you still see the kernel dump and/or the issue? Thanks Felip M
In addition to my last comment, it would also be good to know if you have any particular configuration for SGI cpusets. I don't know these machines in deep but I remember that you can configure different cpusets for different purposes. Can you check your cpusets configuration in this machine, if any?
Our cgroups config is whatever CentOS 7.5 does out-of-the-box, which appears to be v1 configured via systemd. As far as I can tell v2 is not an option: [root@sgiuv300-srcf-d10-01 /]# mkdir /mnt/cgroup2 [root@sgiuv300-srcf-d10-01 /]# mount -t cgroup2 none /mnt/cgroup2 mount: unknown filesystem type 'cgroup2' The related mounts are: [root@dper7425-srcf-d10-37 ~]# mount | grep cgroup tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755) cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd) cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer) cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory) cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices) cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event) cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids) cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb) cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_prio,net_cls) cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset) cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio) cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpuacct,cpu) We are not tied to this config, so if it needs to be modified to test anything, happy to do it. We've reproduced this outside of SLURM on a different node now (Dell PE 7425 with AMD EPYC CPUs), at least for this particular application and it doesn't look like the app itself cares that the gzip process is a zombie, but we'll need to let it run another day or two to be sure it finishes correctly. Since previous occurrences were the sleep processes started by SLURM as well as application processes from jobs, maybe we have a generic problem here that SLURM just highlights because things go off the rails at job completion when SLURM attempts to clean up the job?
For the sake of completeness/comparison, a `pstree -u kameronr -psal` for the non-SLURM zombie gzip incident: screen,16735 -S analysis └─bash,16737 └─perl,17085 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bismark --genome /ifs/labs/sjaiswal/genomes/mm9_fasta --fastq 0hr_noBS_S38_R1_001.trimmed.fastq.gz,0hr_S31_R1_001.trimmed.fastq.gz,1hr_S32_R1_001.trimmed.fastq.gz,2hr_S33_R1_001.trimmed.fastq.gz,3hr_S34_R1_001.trimmed.fastq.gz,4hr_S35_R1_001.trimmed.fastq.gz,5hr_S36_R1_001.trimmed.fastq.gz,6hr_S37_R1_001.trimmed.fastq.gz,0hr_noBS_S38_R2_001.trimmed.fastq.gz,0hr_S31_R2_001.trimmed.fastq.gz,1hr_S32_R2_001.trimmed.fastq.gz,2hr_S33_R2_001.trimmed.fastq.gz,3hr_S34_R2_001.trimmed.fastq.gz,4hr_S35_R2_001.trimmed.fastq.gz,5hr_S36_R2_001.trimmed.fastq.gz,6hr_S37_R2_001.trimmed.fastq.gz -1 0hr_noBS_S38_R1_001.trimmed.fastq.gz,0hr_S31_R1_001.trimmed.fastq.gz,1hr_S32_R1_001.trimmed.fastq.gz,2hr_S33_R1_001.trimmed.fastq.gz,3hr_S34_R1_001.trimmed.fastq.gz,4hr_S35_R1_001.trimmed.fastq.gz,5hr_S36_R1_001.trimmed.fastq.gz,6hr_S37_R1_001.trimmed.fastq.gz -2 0hr_noBS_S38_R2_001.trimmed.fastq.gz,0hr_S31_R2_001.trimmed.fastq.gz,1hr_S32_R2_001.trimmed.fastq.gz,2hr_S33_R2_001.trimmed.fastq.gz,3hr_S34_R2_001.trimmed.fastq.gz,4hr_S35_R2_001.trimmed.fastq.gz,5hr_S36_R2_001.trimmed.fastq.gz,6hr_S37_R2_001.trimmed.fastq.gz --multicore 8 --un --ambiguous --output_dir /ifs/labs/sjaiswal/kameron/bisulfite/run180925_analysis4 --temp_dir /tmp --prefix run180925_mm9 -N 0 -L 20 ├─(gzip,17168) ├─perl,17127 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bowtie2 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --norc -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/CT_conversion/BS_CT -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.8_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.8_G_to_A.fastq │ └─bowtie2-align-s,18529 --wrapper basic-0 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --norc -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/CT_conversion/BS_CT -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.8_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.8_G_to_A.fastq │ └─{bowtie2-align-s},21283 ├─perl,21347 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bowtie2 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --nofw -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/GA_conversion/BS_GA -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.8_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.8_G_to_A.fastq │ └─bowtie2-align-s,21500 --wrapper basic-0 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --nofw -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/GA_conversion/BS_GA -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.8_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.8_G_to_A.fastq │ └─{bowtie2-align-s},22684 ├─perl,79980 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bismark --genome /ifs/labs/sjaiswal/genomes/mm9_fasta --fastq 0hr_noBS_S38_R1_001.trimmed.fastq.gz,0hr_S31_R1_001.trimmed.fastq.gz,1hr_S32_R1_001.trimmed.fastq.gz,2hr_S33_R1_001.trimmed.fastq.gz,3hr_S34_R1_001.trimmed.fastq.gz,4hr_S35_R1_001.trimmed.fastq.gz,5hr_S36_R1_001.trimmed.fastq.gz,6hr_S37_R1_001.trimmed.fastq.gz,0hr_noBS_S38_R2_001.trimmed.fastq.gz,0hr_S31_R2_001.trimmed.fastq.gz,1hr_S32_R2_001.trimmed.fastq.gz,2hr_S33_R2_001.trimmed.fastq.gz,3hr_S34_R2_001.trimmed.fastq.gz,4hr_S35_R2_001.trimmed.fastq.gz,5hr_S36_R2_001.trimmed.fastq.gz,6hr_S37_R2_001.trimmed.fastq.gz -1 0hr_noBS_S38_R1_001.trimmed.fastq.gz,0hr_S31_R1_001.trimmed.fastq.gz,1hr_S32_R1_001.trimmed.fastq.gz,2hr_S33_R1_001.trimmed.fastq.gz,3hr_S34_R1_001.trimmed.fastq.gz,4hr_S35_R1_001.trimmed.fastq.gz,5hr_S36_R1_001.trimmed.fastq.gz,6hr_S37_R1_001.trimmed.fastq.gz -2 0hr_noBS_S38_R2_001.trimmed.fastq.gz,0hr_S31_R2_001.trimmed.fastq.gz,1hr_S32_R2_001.trimmed.fastq.gz,2hr_S33_R2_001.trimmed.fastq.gz,3hr_S34_R2_001.trimmed.fastq.gz,4hr_S35_R2_001.trimmed.fastq.gz,5hr_S36_R2_001.trimmed.fastq.gz,6hr_S37_R2_001.trimmed.fastq.gz --multicore 8 --un --ambiguous --output_dir /ifs/labs/sjaiswal/kameron/bisulfite/run180925_analysis4 --temp_dir /tmp --prefix run180925_mm9 -N 0 -L 20 │ ├─perl,21026 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bowtie2 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --norc -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/CT_conversion/BS_CT -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.1_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.1_G_to_A.fastq │ │ └─bowtie2-align-s,21272 --wrapper basic-0 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --norc -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/CT_conversion/BS_CT -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.1_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.1_G_to_A.fastq │ │ └─{bowtie2-align-s},22222 │ ├─perl,22264 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bowtie2 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --nofw -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/GA_conversion/BS_GA -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.1_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.1_G_to_A.fastq │ │ └─bowtie2-align-s,22579 --wrapper basic-0 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --nofw -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/GA_conversion/BS_GA -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.1_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.1_G_to_A.fastq │ │ └─{bowtie2-align-s},24946 │ └─sh,25007 -c /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/samtools view -bSh 2>/dev/null - > /ifs/scg/labs/sjaiswal/kameron/bisulfite/run180925_analysis4/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.1_bismark_bt2_pe.bam │ └─samtools,25022 view -bSh - ├─perl,79981 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bismark --genome /ifs/labs/sjaiswal/genomes/mm9_fasta --fastq 0hr_noBS_S38_R1_001.trimmed.fastq.gz,0hr_S31_R1_001.trimmed.fastq.gz,1hr_S32_R1_001.trimmed.fastq.gz,2hr_S33_R1_001.trimmed.fastq.gz,3hr_S34_R1_001.trimmed.fastq.gz,4hr_S35_R1_001.trimmed.fastq.gz,5hr_S36_R1_001.trimmed.fastq.gz,6hr_S37_R1_001.trimmed.fastq.gz,0hr_noBS_S38_R2_001.trimmed.fastq.gz,0hr_S31_R2_001.trimmed.fastq.gz,1hr_S32_R2_001.trimmed.fastq.gz,2hr_S33_R2_001.trimmed.fastq.gz,3hr_S34_R2_001.trimmed.fastq.gz,4hr_S35_R2_001.trimmed.fastq.gz,5hr_S36_R2_001.trimmed.fastq.gz,6hr_S37_R2_001.trimmed.fastq.gz -1 0hr_noBS_S38_R1_001.trimmed.fastq.gz,0hr_S31_R1_001.trimmed.fastq.gz,1hr_S32_R1_001.trimmed.fastq.gz,2hr_S33_R1_001.trimmed.fastq.gz,3hr_S34_R1_001.trimmed.fastq.gz,4hr_S35_R1_001.trimmed.fastq.gz,5hr_S36_R1_001.trimmed.fastq.gz,6hr_S37_R1_001.trimmed.fastq.gz -2 0hr_noBS_S38_R2_001.trimmed.fastq.gz,0hr_S31_R2_001.trimmed.fastq.gz,1hr_S32_R2_001.trimmed.fastq.gz,2hr_S33_R2_001.trimmed.fastq.gz,3hr_S34_R2_001.trimmed.fastq.gz,4hr_S35_R2_001.trimmed.fastq.gz,5hr_S36_R2_001.trimmed.fastq.gz,6hr_S37_R2_001.trimmed.fastq.gz --multicore 8 --un --ambiguous --output_dir /ifs/labs/sjaiswal/kameron/bisulfite/run180925_analysis4 --temp_dir /tmp --prefix run180925_mm9 -N 0 -L 20 │ ├─perl,21830 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bowtie2 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --norc -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/CT_conversion/BS_CT -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.2_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.2_G_to_A.fastq │ │ └─bowtie2-align-s,21902 --wrapper basic-0 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --norc -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/CT_conversion/BS_CT -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.2_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.2_G_to_A.fastq │ │ └─{bowtie2-align-s},23746 │ ├─perl,23807 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bowtie2 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --nofw -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/GA_conversion/BS_GA -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.2_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.2_G_to_A.fastq │ │ └─bowtie2-align-s,24063 --wrapper basic-0 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --nofw -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/GA_conversion/BS_GA -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.2_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.2_G_to_A.fastq │ │ └─{bowtie2-align-s},26016 │ └─sh,26064 -c /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/samtools view -bSh 2>/dev/null - > /ifs/scg/labs/sjaiswal/kameron/bisulfite/run180925_analysis4/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.2_bismark_bt2_pe.bam │ └─samtools,26067 view -bSh - ├─perl,79983 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bismark --genome /ifs/labs/sjaiswal/genomes/mm9_fasta --fastq 0hr_noBS_S38_R1_001.trimmed.fastq.gz,0hr_S31_R1_001.trimmed.fastq.gz,1hr_S32_R1_001.trimmed.fastq.gz,2hr_S33_R1_001.trimmed.fastq.gz,3hr_S34_R1_001.trimmed.fastq.gz,4hr_S35_R1_001.trimmed.fastq.gz,5hr_S36_R1_001.trimmed.fastq.gz,6hr_S37_R1_001.trimmed.fastq.gz,0hr_noBS_S38_R2_001.trimmed.fastq.gz,0hr_S31_R2_001.trimmed.fastq.gz,1hr_S32_R2_001.trimmed.fastq.gz,2hr_S33_R2_001.trimmed.fastq.gz,3hr_S34_R2_001.trimmed.fastq.gz,4hr_S35_R2_001.trimmed.fastq.gz,5hr_S36_R2_001.trimmed.fastq.gz,6hr_S37_R2_001.trimmed.fastq.gz -1 0hr_noBS_S38_R1_001.trimmed.fastq.gz,0hr_S31_R1_001.trimmed.fastq.gz,1hr_S32_R1_001.trimmed.fastq.gz,2hr_S33_R1_001.trimmed.fastq.gz,3hr_S34_R1_001.trimmed.fastq.gz,4hr_S35_R1_001.trimmed.fastq.gz,5hr_S36_R1_001.trimmed.fastq.gz,6hr_S37_R1_001.trimmed.fastq.gz -2 0hr_noBS_S38_R2_001.trimmed.fastq.gz,0hr_S31_R2_001.trimmed.fastq.gz,1hr_S32_R2_001.trimmed.fastq.gz,2hr_S33_R2_001.trimmed.fastq.gz,3hr_S34_R2_001.trimmed.fastq.gz,4hr_S35_R2_001.trimmed.fastq.gz,5hr_S36_R2_001.trimmed.fastq.gz,6hr_S37_R2_001.trimmed.fastq.gz --multicore 8 --un --ambiguous --output_dir /ifs/labs/sjaiswal/kameron/bisulfite/run180925_analysis4 --temp_dir /tmp --prefix run180925_mm9 -N 0 -L 20 │ ├─perl,21925 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bowtie2 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --norc -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/CT_conversion/BS_CT -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.3_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.3_G_to_A.fastq │ │ └─bowtie2-align-s,22000 --wrapper basic-0 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --norc -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/CT_conversion/BS_CT -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.3_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.3_G_to_A.fastq │ │ └─{bowtie2-align-s},23813 │ ├─perl,23883 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bowtie2 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --nofw -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/GA_conversion/BS_GA -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.3_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.3_G_to_A.fastq │ │ └─bowtie2-align-s,24114 --wrapper basic-0 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --nofw -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/GA_conversion/BS_GA -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.3_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.3_G_to_A.fastq │ │ └─{bowtie2-align-s},25810 │ └─sh,25865 -c /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/samtools view -bSh 2>/dev/null - > /ifs/scg/labs/sjaiswal/kameron/bisulfite/run180925_analysis4/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.3_bismark_bt2_pe.bam │ └─samtools,25873 view -bSh - ├─perl,79985 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bismark --genome /ifs/labs/sjaiswal/genomes/mm9_fasta --fastq 0hr_noBS_S38_R1_001.trimmed.fastq.gz,0hr_S31_R1_001.trimmed.fastq.gz,1hr_S32_R1_001.trimmed.fastq.gz,2hr_S33_R1_001.trimmed.fastq.gz,3hr_S34_R1_001.trimmed.fastq.gz,4hr_S35_R1_001.trimmed.fastq.gz,5hr_S36_R1_001.trimmed.fastq.gz,6hr_S37_R1_001.trimmed.fastq.gz,0hr_noBS_S38_R2_001.trimmed.fastq.gz,0hr_S31_R2_001.trimmed.fastq.gz,1hr_S32_R2_001.trimmed.fastq.gz,2hr_S33_R2_001.trimmed.fastq.gz,3hr_S34_R2_001.trimmed.fastq.gz,4hr_S35_R2_001.trimmed.fastq.gz,5hr_S36_R2_001.trimmed.fastq.gz,6hr_S37_R2_001.trimmed.fastq.gz -1 0hr_noBS_S38_R1_001.trimmed.fastq.gz,0hr_S31_R1_001.trimmed.fastq.gz,1hr_S32_R1_001.trimmed.fastq.gz,2hr_S33_R1_001.trimmed.fastq.gz,3hr_S34_R1_001.trimmed.fastq.gz,4hr_S35_R1_001.trimmed.fastq.gz,5hr_S36_R1_001.trimmed.fastq.gz,6hr_S37_R1_001.trimmed.fastq.gz -2 0hr_noBS_S38_R2_001.trimmed.fastq.gz,0hr_S31_R2_001.trimmed.fastq.gz,1hr_S32_R2_001.trimmed.fastq.gz,2hr_S33_R2_001.trimmed.fastq.gz,3hr_S34_R2_001.trimmed.fastq.gz,4hr_S35_R2_001.trimmed.fastq.gz,5hr_S36_R2_001.trimmed.fastq.gz,6hr_S37_R2_001.trimmed.fastq.gz --multicore 8 --un --ambiguous --output_dir /ifs/labs/sjaiswal/kameron/bisulfite/run180925_analysis4 --temp_dir /tmp --prefix run180925_mm9 -N 0 -L 20 │ ├─perl,18898 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bowtie2 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --norc -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/CT_conversion/BS_CT -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.4_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.4_G_to_A.fastq │ │ └─bowtie2-align-s,19757 --wrapper basic-0 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --norc -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/CT_conversion/BS_CT -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.4_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.4_G_to_A.fastq │ │ └─{bowtie2-align-s},21566 │ ├─perl,21643 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bowtie2 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --nofw -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/GA_conversion/BS_GA -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.4_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.4_G_to_A.fastq │ │ └─bowtie2-align-s,21750 --wrapper basic-0 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --nofw -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/GA_conversion/BS_GA -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.4_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.4_G_to_A.fastq │ │ └─{bowtie2-align-s},23263 │ └─sh,23404 -c /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/samtools view -bSh 2>/dev/null - > /ifs/scg/labs/sjaiswal/kameron/bisulfite/run180925_analysis4/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.4_bismark_bt2_pe.bam │ └─samtools,23444 view -bSh - ├─perl,79988 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bismark --genome /ifs/labs/sjaiswal/genomes/mm9_fasta --fastq 0hr_noBS_S38_R1_001.trimmed.fastq.gz,0hr_S31_R1_001.trimmed.fastq.gz,1hr_S32_R1_001.trimmed.fastq.gz,2hr_S33_R1_001.trimmed.fastq.gz,3hr_S34_R1_001.trimmed.fastq.gz,4hr_S35_R1_001.trimmed.fastq.gz,5hr_S36_R1_001.trimmed.fastq.gz,6hr_S37_R1_001.trimmed.fastq.gz,0hr_noBS_S38_R2_001.trimmed.fastq.gz,0hr_S31_R2_001.trimmed.fastq.gz,1hr_S32_R2_001.trimmed.fastq.gz,2hr_S33_R2_001.trimmed.fastq.gz,3hr_S34_R2_001.trimmed.fastq.gz,4hr_S35_R2_001.trimmed.fastq.gz,5hr_S36_R2_001.trimmed.fastq.gz,6hr_S37_R2_001.trimmed.fastq.gz -1 0hr_noBS_S38_R1_001.trimmed.fastq.gz,0hr_S31_R1_001.trimmed.fastq.gz,1hr_S32_R1_001.trimmed.fastq.gz,2hr_S33_R1_001.trimmed.fastq.gz,3hr_S34_R1_001.trimmed.fastq.gz,4hr_S35_R1_001.trimmed.fastq.gz,5hr_S36_R1_001.trimmed.fastq.gz,6hr_S37_R1_001.trimmed.fastq.gz -2 0hr_noBS_S38_R2_001.trimmed.fastq.gz,0hr_S31_R2_001.trimmed.fastq.gz,1hr_S32_R2_001.trimmed.fastq.gz,2hr_S33_R2_001.trimmed.fastq.gz,3hr_S34_R2_001.trimmed.fastq.gz,4hr_S35_R2_001.trimmed.fastq.gz,5hr_S36_R2_001.trimmed.fastq.gz,6hr_S37_R2_001.trimmed.fastq.gz --multicore 8 --un --ambiguous --output_dir /ifs/labs/sjaiswal/kameron/bisulfite/run180925_analysis4 --temp_dir /tmp --prefix run180925_mm9 -N 0 -L 20 │ ├─perl,15326 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bowtie2 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --norc -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/CT_conversion/BS_CT -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.5_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.5_G_to_A.fastq │ │ └─bowtie2-align-s,16580 --wrapper basic-0 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --norc -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/CT_conversion/BS_CT -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.5_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.5_G_to_A.fastq │ │ └─{bowtie2-align-s},20983 │ ├─perl,21018 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bowtie2 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --nofw -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/GA_conversion/BS_GA -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.5_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.5_G_to_A.fastq │ │ └─bowtie2-align-s,21298 --wrapper basic-0 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --nofw -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/GA_conversion/BS_GA -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.5_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.5_G_to_A.fastq │ │ └─{bowtie2-align-s},22785 │ └─sh,22910 -c /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/samtools view -bSh 2>/dev/null - > /ifs/scg/labs/sjaiswal/kameron/bisulfite/run180925_analysis4/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.5_bismark_bt2_pe.bam │ └─samtools,22939 view -bSh - ├─perl,79989 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bismark --genome /ifs/labs/sjaiswal/genomes/mm9_fasta --fastq 0hr_noBS_S38_R1_001.trimmed.fastq.gz,0hr_S31_R1_001.trimmed.fastq.gz,1hr_S32_R1_001.trimmed.fastq.gz,2hr_S33_R1_001.trimmed.fastq.gz,3hr_S34_R1_001.trimmed.fastq.gz,4hr_S35_R1_001.trimmed.fastq.gz,5hr_S36_R1_001.trimmed.fastq.gz,6hr_S37_R1_001.trimmed.fastq.gz,0hr_noBS_S38_R2_001.trimmed.fastq.gz,0hr_S31_R2_001.trimmed.fastq.gz,1hr_S32_R2_001.trimmed.fastq.gz,2hr_S33_R2_001.trimmed.fastq.gz,3hr_S34_R2_001.trimmed.fastq.gz,4hr_S35_R2_001.trimmed.fastq.gz,5hr_S36_R2_001.trimmed.fastq.gz,6hr_S37_R2_001.trimmed.fastq.gz -1 0hr_noBS_S38_R1_001.trimmed.fastq.gz,0hr_S31_R1_001.trimmed.fastq.gz,1hr_S32_R1_001.trimmed.fastq.gz,2hr_S33_R1_001.trimmed.fastq.gz,3hr_S34_R1_001.trimmed.fastq.gz,4hr_S35_R1_001.trimmed.fastq.gz,5hr_S36_R1_001.trimmed.fastq.gz,6hr_S37_R1_001.trimmed.fastq.gz -2 0hr_noBS_S38_R2_001.trimmed.fastq.gz,0hr_S31_R2_001.trimmed.fastq.gz,1hr_S32_R2_001.trimmed.fastq.gz,2hr_S33_R2_001.trimmed.fastq.gz,3hr_S34_R2_001.trimmed.fastq.gz,4hr_S35_R2_001.trimmed.fastq.gz,5hr_S36_R2_001.trimmed.fastq.gz,6hr_S37_R2_001.trimmed.fastq.gz --multicore 8 --un --ambiguous --output_dir /ifs/labs/sjaiswal/kameron/bisulfite/run180925_analysis4 --temp_dir /tmp --prefix run180925_mm9 -N 0 -L 20 │ ├─perl,22508 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bowtie2 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --norc -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/CT_conversion/BS_CT -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.6_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.6_G_to_A.fastq │ │ └─bowtie2-align-s,22664 --wrapper basic-0 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --norc -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/CT_conversion/BS_CT -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.6_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.6_G_to_A.fastq │ │ └─{bowtie2-align-s},24877 │ ├─perl,24928 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bowtie2 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --nofw -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/GA_conversion/BS_GA -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.6_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.6_G_to_A.fastq │ │ └─bowtie2-align-s,25110 --wrapper basic-0 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --nofw -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/GA_conversion/BS_GA -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.6_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.6_G_to_A.fastq │ │ └─{bowtie2-align-s},27069 │ └─sh,27169 -c /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/samtools view -bSh 2>/dev/null - > /ifs/scg/labs/sjaiswal/kameron/bisulfite/run180925_analysis4/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.6_bismark_bt2_pe.bam │ └─samtools,27196 view -bSh - ├─perl,79991 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bismark --genome /ifs/labs/sjaiswal/genomes/mm9_fasta --fastq 0hr_noBS_S38_R1_001.trimmed.fastq.gz,0hr_S31_R1_001.trimmed.fastq.gz,1hr_S32_R1_001.trimmed.fastq.gz,2hr_S33_R1_001.trimmed.fastq.gz,3hr_S34_R1_001.trimmed.fastq.gz,4hr_S35_R1_001.trimmed.fastq.gz,5hr_S36_R1_001.trimmed.fastq.gz,6hr_S37_R1_001.trimmed.fastq.gz,0hr_noBS_S38_R2_001.trimmed.fastq.gz,0hr_S31_R2_001.trimmed.fastq.gz,1hr_S32_R2_001.trimmed.fastq.gz,2hr_S33_R2_001.trimmed.fastq.gz,3hr_S34_R2_001.trimmed.fastq.gz,4hr_S35_R2_001.trimmed.fastq.gz,5hr_S36_R2_001.trimmed.fastq.gz,6hr_S37_R2_001.trimmed.fastq.gz -1 0hr_noBS_S38_R1_001.trimmed.fastq.gz,0hr_S31_R1_001.trimmed.fastq.gz,1hr_S32_R1_001.trimmed.fastq.gz,2hr_S33_R1_001.trimmed.fastq.gz,3hr_S34_R1_001.trimmed.fastq.gz,4hr_S35_R1_001.trimmed.fastq.gz,5hr_S36_R1_001.trimmed.fastq.gz,6hr_S37_R1_001.trimmed.fastq.gz -2 0hr_noBS_S38_R2_001.trimmed.fastq.gz,0hr_S31_R2_001.trimmed.fastq.gz,1hr_S32_R2_001.trimmed.fastq.gz,2hr_S33_R2_001.trimmed.fastq.gz,3hr_S34_R2_001.trimmed.fastq.gz,4hr_S35_R2_001.trimmed.fastq.gz,5hr_S36_R2_001.trimmed.fastq.gz,6hr_S37_R2_001.trimmed.fastq.gz --multicore 8 --un --ambiguous --output_dir /ifs/labs/sjaiswal/kameron/bisulfite/run180925_analysis4 --temp_dir /tmp --prefix run180925_mm9 -N 0 -L 20 │ ├─perl,21307 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bowtie2 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --norc -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/CT_conversion/BS_CT -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.7_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.7_G_to_A.fastq │ │ └─bowtie2-align-s,21473 --wrapper basic-0 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --norc -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/CT_conversion/BS_CT -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.7_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.7_G_to_A.fastq │ │ └─{bowtie2-align-s},22774 │ ├─perl,22895 /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/bowtie2 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --nofw -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/GA_conversion/BS_GA -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.7_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.7_G_to_A.fastq │ │ └─bowtie2-align-s,23183 --wrapper basic-0 -q -N 0 -L 20 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --nofw -x /ifs/scg/labs/sjaiswal/genomes/mm9_fasta/Bisulfite_Genome/GA_conversion/BS_GA -1 /tmp/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.7_C_to_T.fastq -2 /tmp/run180925_mm9.0hr_S31_R2_001.trimmed.fastq.gz.temp.7_G_to_A.fastq │ │ └─{bowtie2-align-s},24641 │ └─sh,24734 -c /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/samtools view -bSh 2>/dev/null - > /ifs/scg/labs/sjaiswal/kameron/bisulfite/run180925_analysis4/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.7_bismark_bt2_pe.bam │ └─samtools,24775 view -bSh - └─sh,22753 -c /scg/apps/software/anaconda/3_5.0.1_20180125/envs/bismark_0.19.1/bin/samtools view -bSh 2>/dev/null - > /ifs/scg/labs/sjaiswal/kameron/bisulfite/run180925_analysis4/run180925_mm9.0hr_S31_R1_001.trimmed.fastq.gz.temp.8_bismark_bt2_pe.bam └─samtools,22756 view -bSh -
(In reply to John Hanks from comment #44) > Our cgroups config is whatever CentOS 7.5 does out-of-the-box, which appears > to be v1 configured via systemd. As far as I can tell v2 is not an option: > > [root@sgiuv300-srcf-d10-01 /]# mkdir /mnt/cgroup2 > [root@sgiuv300-srcf-d10-01 /]# mount -t cgroup2 none /mnt/cgroup2 > mount: unknown filesystem type 'cgroup2' The mount is a mount option for cgroup version1 to behave like cgroup2. mount option is '-o cpuset_v2_mode' as I quoted from RHEL release note. > We are not tied to this config, so if it needs to be modified to test > anything, happy to do it. I think is worth to try at least to see if it removes this kernel warning. > We've reproduced this outside of SLURM on a different node now (Dell PE 7425 > with AMD EPYC CPUs), at least for this particular application and it doesn't > look like the app itself cares that the gzip process is a zombie, but we'll > need to let it run another day or two to be sure it finishes correctly. > Since previous occurrences were the sleep processes started by SLURM as well > as application processes from jobs, maybe we have a generic problem here > that SLURM just highlights because things go off the rails at job completion > when SLURM attempts to clean up the job? This is not good at all and brokes our previous assumptions: 1. it is not due to SGI UV300 architecture 2. it is not due to the application And shows a possible ones: 3. it is not Slurm's fault 4. it may be a kernel bug? Does this other machine have the same kernel? May we discard a kernel bug or may this theory still be valid? Does dmesg also show the kernel dump warning in this other node? The application could be ignoring child processes and masking signals, which are two very bad things. We should verify that in perl and qiime cases. Was qiime a perl process? Are we in the sync with these ideas?
(In reply to Felip Moll from comment #46) > (In reply to John Hanks from comment #44) > > Our cgroups config is whatever CentOS 7.5 does out-of-the-box, which appears > > to be v1 configured via systemd. As far as I can tell v2 is not an option: > > > > [root@sgiuv300-srcf-d10-01 /]# mkdir /mnt/cgroup2 > > [root@sgiuv300-srcf-d10-01 /]# mount -t cgroup2 none /mnt/cgroup2 > > mount: unknown filesystem type 'cgroup2' > > The mount is a mount option for cgroup version1 to behave like cgroup2. > > mount option is '-o cpuset_v2_mode' as I quoted from RHEL release note. > > > > We are not tied to this config, so if it needs to be modified to test > > anything, happy to do it. > > I think is worth to try at least to see if it removes this kernel warning. > > > We've reproduced this outside of SLURM on a different node now (Dell PE 7425 > > with AMD EPYC CPUs), at least for this particular application and it doesn't > > look like the app itself cares that the gzip process is a zombie, but we'll > > need to let it run another day or two to be sure it finishes correctly. > > Since previous occurrences were the sleep processes started by SLURM as well > > as application processes from jobs, maybe we have a generic problem here > > that SLURM just highlights because things go off the rails at job completion > > when SLURM attempts to clean up the job? > > This is not good at all and brokes our previous assumptions: > 1. it is not due to SGI UV300 architecture > 2. it is not due to the application > > And shows a possible ones: > 3. it is not Slurm's fault > 4. it may be a kernel bug? This is what I am concluding as well. > Does this other machine have the same kernel? May we discard a kernel bug or > may > this theory still be valid? Does dmesg also show the kernel dump warning in > this other node? Both nodes boot disklessly from the same image, same kernel. From an OS perspective, they are identical. > The application could be ignoring child processes and masking > signals, which are two very bad things. We should verify that in > perl and qiime cases. Was qiime a perl process? qiime is a python wrapper that calls a lot of other stuff. bismark is a perl wrapper that calls a bunch of other tools. The R job was just R, but in it's case it was teh sleep commands started by slurmstepd which zombied. The bismark run outside of SLURM finished successfully (as far as the apps output is concerned), leaving behind a zombie gzip and two zobmie perl proecesses. > Are we in the sync with these ideas? I think we are. systemd rants in next comment :)
I've tried this: [root@sgiuv300-srcf-d10-01 ~]# mount | grep cpuset cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset) [root@sgiuv300-srcf-d10-01 ~]# mount -o remount,cpuset_v2_mode /sys/fs/cgroup/cpuset mount: /sys/fs/cgroup/cpuset not mounted or bad option In some cases useful info is found in syslog - try dmesg | tail or so. Of course nothing in dmesg with any clues. Between google, looking in my /etc and probing numerous man pages, I have not yet found where I would actually configure systemd to use this mount option for mounting cpuset. Still looking, only stopping to occasionally stick pins in my Poettering voodoo doll or delicately roast it over an open candle flame.
> Does dmesg also show the kernel dump warning in this other node? Checked all node, at least 42 (out of 77) have the kernel stack trace at boot time. Sample from a Dell R730xd: [24259.423477] ------------[ cut here ]------------ [24259.423493] WARNING: CPU: 11 PID: 38788 at kernel/cpuset.c:1296 cpuset_write_resmask+0x80a/0xa10 [24259.423496] Modules linked in: nfsd nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c iptable_filter zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) sunrpc dm_mirror dm_region_hash dm_log dm_mod dcdbas iTCO_wdt iTCO_vendor_support mxm_wmi sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd i2c_algo_bit ttm drm_kms_helper pcspkr syscopyarea sysfillrect sysimgblt fb_sys_fops drm i2c_core mei_me lpc_ich mei sg shpchp ipmi_si ipmi_devintf ipmi_msghandler wmi acpi_power_meter binfmt_misc ip_tables ixgbe [24259.423568] dca mdio megaraid_sas tg3 ptp pps_core ahci libahci libata sd_mod crc_t10dif crct10dif_common [24259.423584] CPU: 11 PID: 38788 Comm: slurmstepd Tainted: P OE ------------ 3.10.0-862.14.4.el7.x86_64 #1 [24259.423587] Hardware name: Dell Inc. PowerEdge R730xd/0599V5, BIOS 2.6.0 10/26/2017 [24259.423589] Call Trace: [24259.423598] [<ffffffffaa713754>] dump_stack+0x19/0x1b [24259.423606] [<ffffffffaa0945d8>] __warn+0xd8/0x100 [24259.423611] [<ffffffffaa09471d>] warn_slowpath_null+0x1d/0x20 [24259.423615] [<ffffffffaa125f0a>] cpuset_write_resmask+0x80a/0xa10 [24259.423623] [<ffffffffaa11b020>] ? cgroup_rightmost_descendant+0x80/0x80 [24259.423628] [<ffffffffaa125700>] ? update_cpumasks_hier+0x410/0x410 [24259.423632] [<ffffffffaa11d411>] cgroup_file_write+0x1d1/0x2d0 [24259.423639] [<ffffffffaa2221c8>] ? __sb_start_write+0x58/0x110 [24259.423646] [<ffffffffaa2d5437>] ? security_file_permission+0x27/0xa0 [24259.423651] [<ffffffffaa21f240>] vfs_write+0xc0/0x1f0 [24259.423656] [<ffffffffaa22006f>] SyS_write+0x7f/0xf0 [24259.423663] [<ffffffffaa72579b>] system_call_fastpath+0x22/0x27 [24259.423667] ---[ end trace 2f4806cb5ac5b506 ]---
Well, the config file of interest seems to be: https://github.com/systemd/systemd/blob/master/src/core/mount-setup.c Any suggestions welcome. My Poettering voodoo doll can't take much more abuse, though, and it doesn't seem to help anyway. Can I revert to having SLURM mount its own view of these and control them as we did back in the CentOS 6 days of old?
It seems that it is possible to mount cgroups with the v2 option at another location, [root@sgiuv300-srcf-d10-01 ~]# mount -t cgroup -o cpuset,cpuset_v2_mode none /mnt/cgroup2/ [root@sgiuv300-srcf-d10-01 ~]# ls /mnt/cgroup2/ cgroup.clone_children cpuset.cpu_exclusive cpuset.mem_exclusive cpuset.memory_pressure_enabled cpuset.sched_load_balance slurm cgroup.event_control cpuset.cpus cpuset.mem_hardwall cpuset.memory_spread_page cpuset.sched_relax_domain_level tasks cgroup.procs cpuset.effective_cpus cpuset.memory_migrate cpuset.memory_spread_slab notify_on_release cgroup.sane_behavior cpuset.effective_mems cpuset.memory_pressure cpuset.mems release_agent Based on that I'm going to test building a separate CgroupMountpoint= location for SLURM to use and if it works, will split the cluster between using v1 and v2 and see if there is a difference in outcomes.
(In reply to John Hanks from comment #51) > It seems that it is possible to mount cgroups with the v2 option at another > location, > > [root@sgiuv300-srcf-d10-01 ~]# mount -t cgroup -o cpuset,cpuset_v2_mode none > /mnt/cgroup2/ > [root@sgiuv300-srcf-d10-01 ~]# ls /mnt/cgroup2/ > cgroup.clone_children cpuset.cpu_exclusive cpuset.mem_exclusive > cpuset.memory_pressure_enabled cpuset.sched_load_balance slurm > cgroup.event_control cpuset.cpus cpuset.mem_hardwall > cpuset.memory_spread_page cpuset.sched_relax_domain_level tasks > cgroup.procs cpuset.effective_cpus cpuset.memory_migrate > cpuset.memory_spread_slab notify_on_release > cgroup.sane_behavior cpuset.effective_mems cpuset.memory_pressure > cpuset.mems release_agent > > Based on that I'm going to test building a separate CgroupMountpoint= > location for SLURM to use and if it works, will split the cluster between > using v1 and v2 and see if there is a difference in outcomes. Hi John, Were you been able to test the black magic here? Do we have any news?
I haven't had a chance to dig deeper lately and it'll probably be a few weeks before I have a chance to make any real progress, unless of course this begins increasing in frequency and becomes the squeakiest wheel.
(In reply to John Hanks from comment #53) > I haven't had a chance to dig deeper lately and it'll probably be a few > weeks before I have a chance to make any real progress, unless of course > this begins increasing in frequency and becomes the squeakiest wheel. Hi, It's been a bit since your last comment. Do you have any news? Were you able tot test the cgroup mount option? Thanks! Felip
Hi Felip, In the interim we've upgraded to 18.08.4 and CentOS 7.6 and since then, the problem hasn't been showing up since those updates. Things are also winding down here for the winter break, so it's unlikely this will see any more activity this year unless it suddenly becomes a chronic problem. jbh
(In reply to John Hanks from comment #55) > Hi Felip, > > In the interim we've upgraded to 18.08.4 and CentOS 7.6 and since then, the > problem hasn't been showing up since those updates. Things are also winding > down here for the winter break, so it's unlikely this will see any more > activity this year unless it suddenly becomes a chronic problem. > > jbh Ok, I will keep this one open until, let's say, 17th January, and if it doesn't show up again we will close it. I am not sure the upgrade have fixed the issue, but if you also upgraded the OS it could. Do you agree?
(In reply to Felip Moll from comment #56) > I will keep this one open until, let's say, 17th January, and if it > doesn't show up again we will close it. > > I am not sure the upgrade have fixed the issue, but if you also upgraded > the OS it could. > > Do you agree? Yes, it seems like the problem is some corner case with the kernel and the way Slurm uses cgroups, so it's entirely possible the kernel update fixed it. It's also possible the user(s) with the jobs that triggered it are already on break :) It's all speculation until I track down a repeatable reproducer test case.
(In reply to John Hanks from comment #57) > (In reply to Felip Moll from comment #56) > > I will keep this one open until, let's say, 17th January, and if it > > doesn't show up again we will close it. > > > > I am not sure the upgrade have fixed the issue, but if you also upgraded > > the OS it could. > > > > Do you agree? > > Yes, it seems like the problem is some corner case with the kernel and the > way Slurm uses cgroups, so it's entirely possible the kernel update fixed > it. It's also possible the user(s) with the jobs that triggered it are > already on break :) It's all speculation until I track down a repeatable > reproducer test case. Hey John, First of all, happy new year! I revisited this bug because it turns out we have another internal one where slurmstepd processes are stuck, which then are killed by Slurm after a timeout, and may leave sleep processes behind. It's similar in some way to your issue but we haven't managed to diagnose yet. It seems related to the PrologFlags=contain parameter. Have you seen any other occurrence of this issue?
> Hey John, > > First of all, happy new year! > > I revisited this bug because it turns out we have another internal one > where slurmstepd processes are stuck, which then are killed by Slurm after > a timeout, and may leave sleep processes behind. It's similar in some way > to your issue but we haven't managed to diagnose yet. It seems related to > the PrologFlags=contain parameter. > > Have you seen any other occurrence of this issue? Hi Felip, I must shamefully admit that this did recur over the holiday break and I just rebooted the affected node and quickly went back to gluttonous overeating, pretending like there was no problem that couldn't be solved by more mashed potatoes. Our PrologFlags are: PrologFlags=Alloc,Contain,X11 so these may be the same issue. jbh
(In reply to John Hanks from comment #59) > > Hey John, > > > > First of all, happy new year! > > > > I revisited this bug because it turns out we have another internal one > > where slurmstepd processes are stuck, which then are killed by Slurm after > > a timeout, and may leave sleep processes behind. It's similar in some way > > to your issue but we haven't managed to diagnose yet. It seems related to > > the PrologFlags=contain parameter. > > > > Have you seen any other occurrence of this issue? > > Hi Felip, > > I must shamefully admit that this did recur over the holiday break and I > just rebooted the affected node and quickly went back to gluttonous > overeating, pretending like there was no problem that couldn't be solved by > more mashed potatoes. > > Our PrologFlags are: > > PrologFlags=Alloc,Contain,X11 > > so these may be the same issue. > > jbh Hi John, I escalated this issue to my team and took a look at it. We all think this must be a kernel or a zfs bug because in your situations we see D processes hung. If there are D processes, as I commented before, there should be no way for Slurm to kill them. I see you have zfs module loaded, are you using zfs filesystems? Are the affected processes touching files in these filesystems? We've some concerns about scalability of ZFS with large number of cores.
This is a valid theory, we do use ZFS on all compute nodes and the /tmp space for each user is on a zfs filesystem. That doesn't guarantee that affected processes are using ZFS, but it does make it possible/likely. And regardless of how it's triggered it does seem like a kernel bug at this point. I'm still waiting for a re-occurrence or a repeatable reproducer to appear and will add this to the list of things to check if/when the problem appears again. jbh
(In reply to John Hanks from comment #61) > This is a valid theory, we do use ZFS on all compute nodes and the /tmp > space for each user is on a zfs filesystem. That doesn't guarantee that > affected processes are using ZFS, but it does make it possible/likely. And > regardless of how it's triggered it does seem like a kernel bug at this > point. > > I'm still waiting for a re-occurrence or a repeatable reproducer to appear > and will add this to the list of things to check if/when the problem appears > again. > > jbh One option would be to get a kernel dump and sent it to SGI, though you reproduced it in other (Dell) machines. Unfortunately I think that analyzing a kernel dump is out of SchedMD's support scope. At this point, do you want to keep this bug open until it reproduces again, or otherwise you feel comfortable closing it now? Thanks and sorry for the late response.
(In reply to Felip Moll from comment #62) > > One option would be to get a kernel dump and sent it to SGI, though you > reproduced it in other (Dell) machines. > > Unfortunately I think that analyzing a kernel dump is out of SchedMD's > support scope. > > At this point, do you want to keep this bug open until it reproduces again, > or otherwise you feel comfortable closing it now? > > Thanks and sorry for the late response. Sadly it's HPE now, not SGI and they already beat you to the "unsupported configuration" punch line, on both CentOS and ZFS. I think it's fine to close this bug, if/when it pops up again I'll need to recheck everything anyway since we've updated both SLURM and kernels since it happened last.
> Sadly it's HPE now, not SGI and they already beat you to the "unsupported > configuration" punch line, on both CentOS and ZFS. Yeah, I know :( , this is unfortunate. > I think it's fine to > close this bug, if/when it pops up again I'll need to recheck everything > anyway since we've updated both SLURM and kernels since it happened last. That's ok. Our other recommendation is to try to change the ZFS filesystem of /tmp to another one. At least just to test and discard things. I know this sounds like trial/error, but given the D states I think is worth to try. Also an lsof of the stuck processes could help. As you know, just reopen again if it reproduces again.
Like any good zombie story, there's always another season. This morning I have several nodes on which slurmd has died and fails to restart: [root@dper7425-srcf-d15-19 ~]# slurmd -D -v -v -v -v -v -v -v slurmd: debug: Log file re-opened slurmd: debug4: found jobid = 7273685, stepid = 4294967294 slurmd: debug4: found jobid = 7271593, stepid = 4294967294 slurmd: debug4: found jobid = 7271594, stepid = 4294967294 slurmd: debug4: found jobid = 7271543, stepid = 4294967294 slurmd: debug3: Trying to load plugin /scg/slurm/installed/slurm-18.08.5/lib/slurm/auth_munge.so slurmd: debug: Munge authentication plugin loaded slurmd: debug3: Success. it never gets any further than that. gdb and a bt: [root@dper7425-srcf-d15-19 ~]# gdb `which slurmd` GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-114.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /scg/slurm/installed/slurm-18.08.5/sbin/slurmd...done. (gdb) set args -D -v -v -v -v -v -v -v (gdb) run Starting program: /scg/slurm/current/sbin/slurmd -D -v -v -v -v -v -v -v [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". slurmd: debug: Log file re-opened slurmd: debug4: found jobid = 7273685, stepid = 4294967294 slurmd: debug4: found jobid = 7271593, stepid = 4294967294 slurmd: debug4: found jobid = 7271594, stepid = 4294967294 slurmd: debug4: found jobid = 7271543, stepid = 4294967294 slurmd: debug3: Trying to load plugin /scg/slurm/installed/slurm-18.08.5/lib/slurm/auth_munge.so slurmd: debug: Munge authentication plugin loaded slurmd: debug3: Success. ^C Program received signal SIGINT, Interrupt. 0x00002aaaac3569c0 in __connect_nocancel () from /usr/lib64/libpthread.so.0 Missing separate debuginfos, use: debuginfo-install audit-libs-2.8.4-4.el7.x86_64 glibc-2.17-260.el7.x86_64 hwloc-libs-1.11.8-4.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-34.el7.x86_64 libcap-ng-0.7.5-4.el7.x86_64 libcom_err-1.42.9-13.el7.x86_64 libgcc-4.8.5-36.el7.x86_64 libselinux-2.5-14.1.el7.x86_64 libssh2-1.4.3-12.el7.x86_64 libtool-ltdl-2.4.2-22.el7_3.x86_64 lz4-1.7.5-2.el7.x86_64 munge-libs-0.5.11-3.el7.x86_64 numactl-libs-2.0.9-7.el7.x86_64 openssl-libs-1.0.2k-16.el7.x86_64 pam-1.1.8-22.el7.x86_64 pcre-8.32-17.el7.x86_64 sssd-client-1.16.2-13.el7.x86_64 zlib-1.2.7-18.el7.x86_64 (gdb) thread apply all bt full Thread 1 (Thread 0x2aaaaaadb700 (LWP 126544)): #0 0x00002aaaac3569c0 in __connect_nocancel () from /usr/lib64/libpthread.so.0 No symbol table info available. #1 0x00002aaaab24437c in _step_connect (stepid=4294967294, jobid=7273685, nodename=0x692fa0 "dper7425-srcf-d15-19", directory=0x692f70 "/var/spool/slurm/slurmd") at stepd_api.c:173 fd = 5 len = <optimized out> addr = {sun_family = 1, sun_path = "/var/spool/slurm/slurmd/dper7425-srcf-d15-19_7273685.4294967294", '\000' <repeats 44 times>} name = 0x693a60 "/var/spool/slurm/slurmd/dper7425-srcf-d15-19_7273685.4294967294" #2 stepd_connect (directory=0x692f70 "/var/spool/slurm/slurmd", nodename=0x692fa0 "dper7425-srcf-d15-19", jobid=7273685, stepid=4294967294, protocol_version=protocol_version@entry=0x68ff58) at stepd_api.c:273 req = 0 fd = -1 rc = 0 auth_cred = <optimized out> auth_info = 0x0 local_nodename = 0x0 buffer = 0x693c80 len = 0 __func__ = "stepd_connect" #3 0x000000000040b955 in _update_logging () at slurmd.c:2027 fd = <optimized out> steps = 0x638ec0 i = 0x669830 stepd = 0x68ff40 o = <optimized out> cf = <optimized out> #4 0x000000000040d0d0 in _read_config () at slurmd.c:921 path_pubkey = 0x0 cf = 0x633f40 <slurmctld_conf> cc = <optimized out> cr_flag = true gang_flag = false tok = <optimized out> save_ptr = 0x0 over_memory_kill = false __func__ = "_read_config" #5 0x000000000040eb06 in _slurmd_init () at slurmd.c:1545 rlim = {rlim_cur = 0, rlim_max = 0} stat_buf = {st_dev = 3833183645592809572, st_ino = 3559019523307893549, st_nlink = 18446742975157447989, ---Type <return> to continue, or q <return> to quit--- st_mode = 4294967295, st_uid = 4294967295, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 0, st_blksize = 0, st_blocks = 0, st_atim = {tv_sec = 0, tv_nsec = 0}, st_mtim = {tv_sec = 0, tv_nsec = 0}, st_ctim = {tv_sec = 0, tv_nsec = 0}, __unused = {0, 0, 0}} cpu_cnt = <optimized out> #6 main (argc=9, argv=0x7fffffffdd98) at slurmd.c:261 i = <optimized out> pidfd = <optimized out> blocked_signals = {13, 0} cc = <optimized out> oom_value = <optimized out> slurmd_uid = 0 curr_uid = 0 time_stamp = '\000' <repeats 32 times>, "\377\377\377\377\377\377\377\377\020\211~\\\000\000\000\000\345\332R\031\000\000\000\000\374\366\377\377\377\037\000\000\337/l\253\252*\000\000\000\000\000\000\000\000\000\000\300\334\377\377\377\177\000\000Ѕc\000\000\000\000\000\360\333\377\377\377\177\000\000\065&l\253\252*\000\000\377\000\000\000\000\000\000\000\360\001\000\000\000\000\000\000\002\000\000\000\000\000\000\000\006\000\000\000\000\000\000\200", '\000' <repeats 56 times>, "\200"... __func__ = "main" A pstree shows the slurmstepds have lost the children: ├─slurmstepd(6678)─┬─{slurmstepd}(6679) │ ├─{slurmstepd}(6680) │ ├─{slurmstepd}(6681) │ ├─{slurmstepd}(6684) │ └─{slurmstepd}(3829) ├─slurmstepd(6683)─┬─{slurmstepd}(6687) │ └─{slurmstepd}(3842) ├─slurmstepd(107340)─┬─{slurmstepd}(107343) │ └─{slurmstepd}(3836) ├─slurmstepd(531)─┬─{slurmstepd}(540) │ ├─{slurmstepd}(541) │ ├─{slurmstepd}(543) │ ├─{slurmstepd}(546) │ ├─{slurmstepd}(126253) │ └─{slurmstepd}(71084) ├─slurmstepd(2977)─┬─{slurmstepd}(2986) │ ├─{slurmstepd}(2987) │ ├─{slurmstepd}(2989) │ ├─{slurmstepd}(2992) │ ├─{slurmstepd}(6467) │ └─{slurmstepd}(81942) ├─smartd(14458) ├─sshd(20298)───sshd(6038)───bash(6087)───pstree(100405) ├─sssd(20290)─┬─sssd_autofs(20294) │ ├─sssd_be(20291) │ ├─sssd_nss(20292) │ └─sssd_pam(20293) ├─starter-suid(6952)─┬─call_vars_deepv(7102)───call_variants(51944)───python(52011)───python(52203)─┬─{python}(76410) │ │ ├─{python}(76411) │ │ ├─{python}(76412) │ │ ├─{python}(76413) │ │ ├─{python}(76414) │ │ ├─{python}(76415) │ │ ├─{python}(76416) │ │ ├─{python}(76417) │ │ ├─{python}(76418) │ │ ├─{python}(76419) │ │ └─{python}(76420) │ ├─{starter-suid}(7105) │ ├─{starter-suid}(7106) │ ├─{starter-suid}(7107) │ ├─{starter-suid}(7108) │ ├─{starter-suid}(7110) │ └─{starter-suid}(7115) ├─starter-suid(6953)─┬─call_vars_deepv(7081)───call_variants(51946)───python(52096)───python(52204) │ └─{starter-suid}(7087) ├─starter-suid(107349)─┬─call_vars_deepv(107369)───postprocess_var(54867)───python(56259)───python(57586)───{python}(76641) │ └─{starter-suid}(107373) strace of a slurmstepd shows: [root@dper7425-srcf-d15-19 ~]# strace -ff -p 6678 strace: Process 6678 attached with 6 threads [pid 6684] restart_syscall(<... resuming interrupted restart_syscall ...> <unfinished ...> [pid 6681] futex(0x2b547634a520, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> [pid 6680] futex(0x2b547634a520, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> [pid 6679] futex(0x2b547634a520, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> [pid 6678] futex(0x2b547634d520, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> [pid 6684] <... restart_syscall resumed> ) = -1 ENOSYS (Function not implemented) [pid 6681] <... futex resumed> ) = -1 ENOSYS (Function not implemented) [pid 6680] <... futex resumed> ) = -1 ENOSYS (Function not implemented) [pid 6679] <... futex resumed> ) = -1 ENOSYS (Function not implemented) [pid 6678] <... futex resumed> ) = -1 ENOSYS (Function not implemented) Attaching to those slurmsted processes results in a hung gdb (killable with kill -9). Looking into the /proc/PID of these and trying to ls /proc/PID/fd results in a hung ls process (not killable with kill -9). There are no recent dmesg entries, at least none since these jobs started. The last dmesg entry is 02/28 4 am. These first of these jobs started later that day around 9 am. Every hung node I looked at so far is running a similar job from the same user, so I'll follow up on what the job is doing and maybe we'll get a reproducer out of this. Since I can't attach successfully with gdb, and lsof, /proc, etc. are non-responsive I'm rebooting these to focus on reproducing it. jbh
The jobs that are associated with these zombies all run a container using singularity which binds to several NFS filesystems. Looking across several of these I can see that some of the earlier starting jobs do overlap with [Thu Feb 28 03:13:19 2019] nfs: server gsfs0 not responding, still trying [Thu Feb 28 03:13:22 2019] nfs: server gsfs0 not responding, still trying [Thu Feb 28 03:13:32 2019] nfs: server gsfs0 not responding, still trying [Thu Feb 28 03:16:41 2019] nfs: server gsfs0 OK [Thu Feb 28 03:16:41 2019] nfs: server gsfs0 OK [Thu Feb 28 03:16:41 2019] nfs: server gsfs0 OK which is one of the locations singularity is binding to: singularity exec --bind /srv/gsfs0 --bind /ifs/lab/... There is a decidedly p > 0 probability that previous jobs were also using singularity, at least one prior occurrence of this was with a user who runs a pipeline dependent on a couple of container images. Other data: - a mix of Intel and AMD EPYC nodes were affected. - these jobs are part of a batch of 600 jobs, about 8 or so triggered zombies so ~ 1.3% failed. I'm not hopeful about getting a reproducer out of this. - the nfs chatter in dmesg was common in all cases. - around the time of at least some of these messages we did have a controller failure in the DDN/GPFS behind /srv/gsfs0, but it seemed to recover ok at the time and works fine on the other nodes. Sorry for the unfiltered dump, just trying to not overlook something that might be relevant. Is it possible a trigger for this could be if (USING_SINGULARITY and BINDING an NFS mount and ACTIVELY READ/WRITE and NFS chokes) then hang/zombies? The number of things that would have to come together there at the same time make the low incident rate for any set of jobs that potentially meet these conditions seem reasonable.
Another possible clue, this message seems to be suspiciously present around the times we've seen this problem and shows up in old logs with a similar occurrence rate, which is to say weeks go by without it then a few jobs get hit: Mar 1 15:06:32 frankie slurmctld[176615]: error: A non superuser 324759 tried to complete batch JobId=7271361 It's happening right now to two jobs that look like this on the node: [root@sgisummit-frcf-111-20 ~]# pstree -u dwyman slurmstepd───6*[{slurmstepd}] slurmstepd───6*[{slurmstepd}] If I attach with gdb, it hangs and I can't interrupt to get a bt. strace show (in a repeating way): [root@sgisummit-frcf-111-20 proc]# strace -ff -p 221000 strace: Process 221000 attached with 6 threads [pid 42795] futex(0x6372c0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> [pid 129792] restart_syscall(<... resuming interrupted nanosleep ...> <unfinished ...> [pid 221027] restart_syscall(<... resuming interrupted poll ...> <unfinished ...> [pid 221018] restart_syscall(<... resuming interrupted poll ...> <unfinished ...> [pid 221015] restart_syscall(<... resuming interrupted futex ...> <unfinished ...> [pid 221014] futex(0x2b005bab53fc, FUTEX_WAIT_PRIVATE, 503, NULL <unfinished ...> [pid 221015] <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 221015] futex(0x2b005bab3160, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 221015] futex(0x2b005bab3124, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 15019, {1551814838, 782476000}, ffffffff) = -1 ETIMEDOUT (Connection timed out) [pid 221015] futex(0x2b005bab3160, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 221015] futex(0x2b005bab3124, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 15021, {1551814839, 782476000}, ffffffff) = -1 ETIMEDOUT (Connection timed out) [pid 221015] futex(0x2b005bab3160, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 221015] futex(0x2b005bab3124, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 15023, {1551814840, 782476000}, ffffffff) = -1 ETIMEDOUT (Connection timed out) [pid 221015] futex(0x2b005bab3160, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 221015] futex(0x2b005bab3124, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 15025, {1551814841, 782476000}, ffffffff) = -1 ETIMEDOUT (Connection timed out) [pid 221015] futex(0x2b005bab3160, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 221015] futex(0x2b005bab3124, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 15027, {1551814842, 782476000}, ffffffff) = -1 ETIMEDOUT (Connection timed out) [pid 221015] futex(0x2b005bab3160, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 221015] futex(0x2b005bab3124, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 15029, {1551814843, 782476000}, ffffffff) = -1 ETIMEDOUT (Connection timed out) [pid 221015] futex(0x2b005bab3160, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 221015] futex(0x2b005bab3124, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 15031, {1551814844, 782476000}, ffffffff) = -1 ETIMEDOUT (Connection timed out) [pid 221015] futex(0x2b005bab3160, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 221015] futex(0x2b005bab3124, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 15033, {1551814845, 782476000}, ffffffff) = -1 ETIMEDOUT (Connection timed out) [pid 221015] futex(0x2b005bab3160, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 221015] futex(0x2b005bab3124, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 15035, {1551814846, 782476000}, ffffffff <unfinished ...> [pid 129792] <... restart_syscall resumed> ) = 0 [pid 129792] stat("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=74, ...}) = 0 [pid 129792] socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 17 [pid 129792] fcntl(17, F_GETFL) = 0x2 (flags O_RDWR) [pid 129792] fcntl(17, F_SETFL, O_RDWR|O_NONBLOCK) = 0 [pid 129792] connect(17, {sa_family=AF_INET, sin_port=htons(6923), sin_addr=inet_addr("10.210.15.254")}, 16) = -1 EINPROGRESS (Operation now in progress) [pid 129792] poll([{fd=17, events=POLLIN|POLLOUT}], 1, 2000) = 1 ([{fd=17, revents=POLLOUT}]) [pid 129792] getsockopt(17, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 [pid 129792] fcntl(17, F_SETFL, O_RDWR) = 0 [pid 129792] fcntl(17, F_SETFD, FD_CLOEXEC) = 0 [pid 129792] rt_sigaction(SIGALRM, {SIG_DFL, [ALRM], SA_RESTORER, 0x2b005c7465d0}, {SIG_DFL, [ALRM], SA_RESTORER, 0x2b005c7465d0}, 8) = 0 [pid 129792] stat("/var/run/munge/munge.socket.2", {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0 [pid 129792] socket(AF_LOCAL, SOCK_STREAM, 0) = 18 [pid 129792] fcntl(18, F_GETFL) = 0x2 (flags O_RDWR) [pid 129792] fcntl(18, F_SETFL, O_RDWR|O_NONBLOCK) = 0 [pid 129792] connect(18, {sa_family=AF_LOCAL, sun_path="/var/run/munge/munge.socket.2"}, 110) = 0 [pid 129792] writev(18, [{"\0`mK\4\2\0\0\0\0\24", 11}, {"\1\1\1\0\0\0\0\0\377\377\377\377\377\377\377\377\0\0\0\0", 20}], 2) = 31 [pid 129792] read(18, "\0`mK\4\3\0\0\0\0\206", 11) = 11 [pid 129792] read(18, "\0\0\0\0\0\200MUNGE:AwQDAACx4qJScnazKTt7"..., 134) = 134 [pid 129792] close(18) = 0 [pid 129792] rt_sigaction(SIGALRM, {SIG_DFL, [ALRM], SA_RESTORER, 0x2b005c7465d0}, {SIG_DFL, [ALRM], SA_RESTORER, 0x2b005c7465d0}, 8) = 0 [pid 129792] rt_sigaction(SIGPIPE, {SIG_IGN, [PIPE], SA_RESTORER, 0x2b005c7465d0}, {SIG_IGN, [PIPE], SA_RESTORER, 0x2b005c7465d0}, 8) = 0 [pid 129792] fcntl(17, F_GETFL) = 0x2 (flags O_RDWR) [pid 129792] fcntl(17, F_GETFL) = 0x2 (flags O_RDWR) [pid 129792] fcntl(17, F_SETFL, O_RDWR|O_NONBLOCK) = 0 [pid 129792] poll([{fd=17, events=POLLOUT}], 1, 60000) = 1 ([{fd=17, revents=POLLOUT}]) [pid 129792] recvfrom(17, 0x2b0068e20950, 1, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable) [pid 129792] sendto(17, "\0\0\5H", 4, 0, NULL, 0) = 4 [pid 129792] fcntl(17, F_SETFL, O_RDWR) = 0 [pid 129792] fcntl(17, F_GETFL) = 0x2 (flags O_RDWR) [pid 129792] fcntl(17, F_GETFL) = 0x2 (flags O_RDWR) [pid 129792] fcntl(17, F_SETFL, O_RDWR|O_NONBLOCK) = 0 [pid 129792] poll([{fd=17, events=POLLOUT}], 1, 60000) = 1 ([{fd=17, revents=POLLOUT}]) [pid 129792] recvfrom(17, 0x2b0068e20950, 1, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable) [pid 129792] sendto(17, "!\0\0\0\0\0\23\232\0\0\4\233\0\0\0\0\0\0\0\0\0\0\0\0\0\vauth/m"..., 1352, 0, NULL, 0) = 1352 [pid 129792] fcntl(17, F_SETFL, O_RDWR) = 0 [pid 129792] rt_sigaction(SIGPIPE, {SIG_IGN, [PIPE], SA_RESTORER, 0x2b005c7465d0}, {SIG_IGN, [PIPE], SA_RESTORER, 0x2b005c7465d0}, 8) = 0 [pid 129792] fcntl(17, F_GETFL) = 0x2 (flags O_RDWR) [pid 129792] fcntl(17, F_GETFL) = 0x2 (flags O_RDWR) [pid 129792] fcntl(17, F_SETFL, O_RDWR|O_NONBLOCK) = 0 [pid 129792] poll([{fd=17, events=POLLIN}], 1, 60000) = 1 ([{fd=17, revents=POLLIN}]) [pid 129792] recvfrom(17, "", 4, 0, NULL, NULL) = 0 [pid 129792] fcntl(17, F_SETFL, O_RDWR) = 0 [pid 129792] close(17) = 0 [pid 129792] socket(AF_LOCAL, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 17 [pid 129792] connect(17, {sa_family=AF_LOCAL, sun_path="/dev/log"}, 110) = 0 [pid 129792] sendto(17, "<14>Mar 5 11:40:46 slurmstepd[2"..., 88, MSG_NOSIGNAL, NULL, 0) = 88 [pid 129792] close(17) = 0 [pid 129792] rt_sigprocmask(SIG_BLOCK, [CHLD], [HUP INT QUIT USR1 USR2 PIPE ALRM TERM TSTP], 8) = 0 [pid 129792] rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], SA_RESTORER, 0x2b005c7465d0}, 8) = 0 [pid 129792] rt_sigprocmask(SIG_SETMASK, [HUP INT QUIT USR1 USR2 PIPE ALRM TERM TSTP], NULL, 8) = 0 [pid 129792] nanosleep({15, 0}, <unfinished ...> [pid 221015] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 221015] futex(0x2b005bab3160, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 221015] futex(0x2b005bab3124, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 15037, {1551814847, 782476000}, ffffffff) = -1 ETIMEDOUT (Connection timed out) [pid 221015] futex(0x2b005bab3160, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 221015] futex(0x2b005bab3124, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 15039, {1551814848, 782476000}, ffffffff) = -1 ETIMEDOUT (Connection timed out) [pid 221015] futex(0x2b005bab3160, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 221015] futex(0x2b005bab3124, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 15041, {1551814849, 782476000}, ffffffff) = -1 ETIMEDOUT (Connection timed out) [pid 221015] futex(0x2b005bab3160, FUTEX_WAKE_PRIVATE, 1) = 0 lsof on the node doesn't work, but /proc/PID/fd for the slurmstepd and children look (in general) like [root@sgisummit-frcf-111-20 fd]# ls -al total 0 dr-x------ 2 root root 0 Mar 5 11:37 . dr-xr-xr-x 9 dwyman upg_dwyman 0 Mar 5 09:35 .. lrwx------ 1 root root 64 Mar 5 11:37 0 -> /dev/null lrwx------ 1 root root 64 Mar 5 11:37 1 -> /dev/null lr-x------ 1 root root 64 Mar 5 11:37 10 -> /sys/fs/cgroup/memory/slurm/uid_34470/job_7358408/step_batch/memory.oom_control l-wx------ 1 root root 64 Mar 5 11:37 11 -> /sys/fs/cgroup/memory/slurm/uid_34470/job_7358408/step_batch/cgroup.event_control lrwx------ 1 root root 64 Mar 5 11:37 12 -> anon_inode:[eventfd] lr-x------ 1 root root 64 Mar 5 11:37 13 -> pipe:[31120943] l-wx------ 1 root root 64 Mar 5 11:37 14 -> pipe:[31120943] lr-x------ 1 root root 64 Mar 5 11:37 15 -> /dev/null lrwx------ 1 root root 64 Mar 5 11:37 2 -> /dev/null lr-x------ 1 root root 64 Mar 5 11:37 3 -> /var/lib/sss/mc/passwd lr-x------ 1 root root 64 Mar 5 11:37 4 -> pipe:[31111662] l-wx------ 1 root root 64 Mar 5 11:37 5 -> pipe:[31111662] lrwx------ 1 root root 64 Mar 5 11:37 6 -> socket:[31111663] lr-x------ 1 root root 64 Mar 5 11:37 7 -> pipe:[31111664] l-wx------ 1 root root 64 Mar 5 11:37 8 -> pipe:[31111664] lrwx------ 1 root root 64 Mar 5 11:37 9 -> socket:[31121150] This node has several jobs that are running normally on it, so I'm going to drain it and watch this to see what happens.
The two jobs from the previous comment eventually finished, the slurm logs which mention the job ids in question look like: Mar 5 11:44:32 frankie slurmctld[69336]: Resending TERMINATE_JOB request JobId=7358407 Nodelist=sgisummit-frcf-111-20 Mar 5 11:44:32 frankie slurmctld[69336]: Resending TERMINATE_JOB request JobId=7358408 Nodelist=sgisummit-frcf-111-20 Mar 5 11:44:46 sgisummit-frcf-111-20 slurmstepd[221023]: Retrying job complete RPC for 7358407.4294967294 Mar 5 11:44:46 sgisummit-frcf-111-20 slurmstepd[221000]: Retrying job complete RPC for 7358408.4294967294 Mar 5 11:44:46 frankie slurmctld[69336]: error: A non superuser 34470 tried to complete batch JobId=7358407 Mar 5 11:44:46 frankie slurmctld[69336]: error: A non superuser 34470 tried to complete batch JobId=7358408 Mar 5 11:45:01 sgisummit-frcf-111-20 slurmstepd[221000]: Retrying job complete RPC for 7358408.4294967294 Mar 5 11:45:01 sgisummit-frcf-111-20 slurmstepd[221023]: Retrying job complete RPC for 7358407.4294967294 Mar 5 11:45:01 frankie slurmctld[69336]: error: A non superuser 34470 tried to complete batch JobId=7358408 Mar 5 11:45:01 frankie slurmctld[69336]: error: A non superuser 34470 tried to complete batch JobId=7358407 Mar 5 11:45:16 sgisummit-frcf-111-20 slurmstepd[221023]: Retrying job complete RPC for 7358407.4294967294 Mar 5 11:45:16 sgisummit-frcf-111-20 slurmstepd[221000]: Retrying job complete RPC for 7358408.4294967294 Mar 5 11:45:16 frankie slurmctld[69336]: error: A non superuser 34470 tried to complete batch JobId=7358407 Mar 5 11:45:16 frankie slurmctld[69336]: error: A non superuser 34470 tried to complete batch JobId=7358408 Mar 5 11:45:31 sgisummit-frcf-111-20 slurmstepd[221023]: Retrying job complete RPC for 7358407.4294967294 Mar 5 11:45:31 sgisummit-frcf-111-20 slurmstepd[221000]: Retrying job complete RPC for 7358408.4294967294 Mar 5 11:45:31 frankie slurmctld[69336]: error: A non superuser 34470 tried to complete batch JobId=7358407 Mar 5 11:45:31 frankie slurmctld[69336]: error: A non superuser 34470 tried to complete batch JobId=7358408 Mar 5 11:45:46 frankie slurmctld[69336]: error: A non superuser 34470 tried to complete batch JobId=7358408 Mar 5 11:45:46 sgisummit-frcf-111-20 slurmstepd[221023]: Retrying job complete RPC for 7358407.4294967294 Mar 5 11:45:46 sgisummit-frcf-111-20 slurmstepd[221000]: Retrying job complete RPC for 7358408.4294967294 Mar 5 11:45:46 frankie slurmctld[69336]: error: A non superuser 34470 tried to complete batch JobId=7358407 Mar 5 11:46:01 sgisummit-frcf-111-20 slurmstepd[221000]: Retrying job complete RPC for 7358408.4294967294 Mar 5 11:46:01 sgisummit-frcf-111-20 slurmstepd[221023]: Retrying job complete RPC for 7358407.4294967294 Mar 5 11:46:01 frankie slurmctld[69336]: error: A non superuser 34470 tried to complete batch JobId=7358408 Mar 5 11:46:01 frankie slurmctld[69336]: error: A non superuser 34470 tried to complete batch JobId=7358407 Mar 5 11:46:16 sgisummit-frcf-111-20 slurmstepd[221000]: error: rmdir(/var/spool/slurm/slurmd/job7358408): Permission denied Mar 5 11:46:16 sgisummit-frcf-111-20 slurmstepd[221023]: error: rmdir(/var/spool/slurm/slurmd/job7358407): Permission denied Mar 5 11:46:20 sgisummit-frcf-111-20 slurmd[94054]: _handle_stray_script: Purging vestigial job script /var/spool/slurm/slurmd/job7358407/slurm_script Mar 5 11:46:20 sgisummit-frcf-111-20 slurmd[94054]: _handle_stray_script: Purging vestigial job script /var/spool/slurm/slurmd/job7358408/slurm_script Mar 5 11:46:20 frankie slurmctld[69336]: cleanup_completing: JobId=7358407 completion process took 4246 seconds Mar 5 11:46:20 frankie slurmctld[69336]: cleanup_completing: JobId=7358408 completion process took 4246 seconds jbh
> [Thu Feb 28 03:13:19 2019] nfs: server gsfs0 not responding, still trying > [Thu Feb 28 03:13:22 2019] nfs: server gsfs0 not responding, still trying > [Thu Feb 28 03:13:32 2019] nfs: server gsfs0 not responding, still trying > [Thu Feb 28 03:16:41 2019] nfs: server gsfs0 OK > [Thu Feb 28 03:16:41 2019] nfs: server gsfs0 OK > [Thu Feb 28 03:16:41 2019] nfs: server gsfs0 OK That's not good indeed. > > if (USING_SINGULARITY and BINDING an NFS mount and ACTIVELY READ/WRITE and > NFS chokes) then hang/zombies? > > The number of things that would have to come together there at the same > time make the low incident rate for any set of jobs that potentially meet > these conditions seem reasonable. To me, this makes perfect sense. All indications points to some I/O problem: D processes, gdb not able to attach, slurmd not proceeding and even a hung 'ls'.. At this point I think that it is not a Slurm issue and it seems it goes beyond of what we (SchedMD) can do to diagnose it. The suggestion is to monitor your filesystem, be it ZFS or NFS. Maybe you can try to force it to happen just stressing NFS and see what happens. What do singularity logs says? Is there any way to easily play with these containers and try to isolate it from slurm? > Mar 5 11:44:46 frankie slurmctld[69336]: error: A non superuser 34470 tried > to complete batch JobId=7358407 > Mar 5 11:44:46 frankie slurmctld[69336]: error: A non superuser 34470 tried Have you added nodes to the system recently? If so you need to restart all your slurmd daemons. If not, 34470, can you 'getent passwd' it? Are you using sssd?
(In reply to Felip Moll from comment #69) > > [Thu Feb 28 03:13:19 2019] nfs: server gsfs0 not responding, still trying > > [Thu Feb 28 03:13:22 2019] nfs: server gsfs0 not responding, still trying > > [Thu Feb 28 03:13:32 2019] nfs: server gsfs0 not responding, still trying > > [Thu Feb 28 03:16:41 2019] nfs: server gsfs0 OK > > [Thu Feb 28 03:16:41 2019] nfs: server gsfs0 OK > > [Thu Feb 28 03:16:41 2019] nfs: server gsfs0 OK > > That's not good indeed. > > > > > if (USING_SINGULARITY and BINDING an NFS mount and ACTIVELY READ/WRITE and > > NFS chokes) then hang/zombies? > > > > The number of things that would have to come together there at the same > > time make the low incident rate for any set of jobs that potentially meet > > these conditions seem reasonable. > > To me, this makes perfect sense. All indications points to some I/O problem: > D processes, > gdb not able to attach, slurmd not proceeding and even a hung 'ls'.. > > At this point I think that it is not a Slurm issue and it seems it goes > beyond of what > we (SchedMD) can do to diagnose it. The suggestion is to monitor your > filesystem, be it ZFS > or NFS. Maybe you can try to force it to happen just stressing NFS and see > what happens. I'm not sure it's that straightforward. When I check these nodes the filesystems in question all work fine and the output from the jobs show no indication that there were any I/O issues. The timing of these NFS messages in dmesg is circumstantial and suspicious, but there are no I/O problems in processes except (potentially) the hung processes. The filesystems the jobs associated with this hung slurmstepd processses used all work fine while the processes remain hung. None of these recent jobs had any associated zfs/kernel dumps in dmesg. And I hate containers and NFS, so I'd love to blame them or even systemd, but the common thread here seems to be slurmstepd. All these other things work fine across all our systems except in these rare flurries that occur in batch jobs. > What do singularity logs says? Is there any way to easily play with these > containers and try to isolate > it from slurm? At least for this recent workflow, the user tests it extensively on login nodes and via interactive srun jobs before submitting it in mass and has not had this problem while working on the pipeline. > > Mar 5 11:44:46 frankie slurmctld[69336]: error: A non superuser 34470 tried > > to complete batch JobId=7358407 > > Mar 5 11:44:46 frankie slurmctld[69336]: error: A non superuser 34470 tried > > Have you added nodes to the system recently? If so you need to restart all > your slurmd daemons. > If not, 34470, can you 'getent passwd' it? Are you using sssd? I did add a node recently, but have these messages logged extending back before adding the node and after my last node change several months ago. sssd is in use, but is working fine on the node (there were several other jobs running normally and logins, etc, work fine.)
Also if there is a recipe I can follow for backtraces, core dumps, etc., that would be more helpful, I'm happy to collect more info that way.
We have another data point to report on this, these messages caught my attention: Mar 13 08:28:09 frankie slurmctld[58154]: Resending TERMINATE_JOB request JobId=8137198 Nodelist=dper7425-srcf-d15-27 Mar 13 08:28:09 frankie slurmctld[58154]: Resending TERMINATE_JOB request JobId=8137199 Nodelist=dper7425-srcf-d15-27 Mar 13 08:28:09 frankie slurmctld[58154]: Resending TERMINATE_JOB request JobId=8137200 Nodelist=dper7425-srcf-d15-27 Mar 13 08:28:09 frankie slurmctld[58154]: Resending TERMINATE_JOB request JobId=8137201 Nodelist=dper7425-srcf-d15-27 Mar 13 08:28:09 frankie slurmctld[58154]: Resending TERMINATE_JOB request JobId=8137202 Nodelist=dper7425-srcf-d15-27 Mar 13 08:28:09 frankie slurmctld[58154]: Resending TERMINATE_JOB request JobId=8137203 Nodelist=dper7425-srcf-d15-27 Upon closer inspection these are more hung processes. It looks like at some point slurm does manage to mark things in this state as complete and move on as on this node I found some older ghosts: [root@dper7425-srcf-d15-27 ~]# ps aux | grep USERNAME USERNAME 14113 0.0 0.0 420256 4316 ? D Feb27 0:00 slurmstepd: [7269234.batch] USERNAME 14163 0.0 0.0 420256 5000 ? D Feb27 0:00 slurmstepd: [7269235.batch] USERNAME 14230 0.0 0.0 420256 4852 ? D Feb27 0:00 slurmstepd: [7269236.batch] USERNAME 14269 0.0 0.0 420256 6164 ? D Feb27 0:00 slurmstepd: [7269237.batch] USERNAME 14319 0.0 0.0 420256 4144 ? D Feb27 0:00 slurmstepd: [7269238.batch] USERNAME 33348 0.0 0.0 420260 4884 ? D Mar05 0:00 slurmstepd: [7549600.batch] USERNAME 33385 0.0 0.0 420260 5848 ? D Mar05 0:00 slurmstepd: [7549601.batch] USERNAME 33412 0.0 0.0 420260 6784 ? D Mar05 0:00 slurmstepd: [7549602.batch] USERNAME 33443 0.0 0.0 420260 6780 ? D Mar05 0:00 slurmstepd: [7549603.batch] USERNAME 61561 0.0 0.0 353700 7348 ? D Mar11 0:00 slurmstepd: [8137199.batch] USERNAME 61582 0.0 0.0 353700 7352 ? D Mar11 0:00 slurmstepd: [8137198.batch] USERNAME 61608 0.0 0.0 353700 7344 ? D Mar11 0:00 slurmstepd: [8137201.batch] USERNAME 61633 0.0 0.0 353700 7352 ? D Mar11 0:00 slurmstepd: [8137202.batch] USERNAME 61652 0.0 0.0 353700 7352 ? D Mar11 0:00 slurmstepd: [8137200.batch] USERNAME 61678 0.0 0.0 353700 7348 ? D Mar11 0:00 slurmstepd: [8137203.batch] USERNAME 103421 0.0 0.0 353696 7252 ? D Mar08 0:00 slurmstepd: [8086937.batch] USERNAME 103434 0.0 0.0 353696 7308 ? D Mar08 0:00 slurmstepd: [8086936.batch] It turns out these jobs have no accounting history. [I am root@frankie ~]# sacct -j 7269237 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- [I am root@frankie ~]# sacct -j 7269238 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- [I am root@frankie ~]# sacct -j 7269239 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- [I am root@frankie ~]# sacct -j 8086936 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- [I am root@frankie ~]# sacct -j 8086937 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- In previous cases I hadn't thought to look at the accounting history for the zombies, but looking back now at one of the oldest reported examples: [I am root@frankie ~]# sacct -j 3092804 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- [I am root@frankie ~]# In looking the node and processes over there are no NFS errors in dmesg around the times associated with these jobs, the children of the slurmstepd in question are all gone. As per the usual behavior, I can't interrupt them to get a backtrace: [root@dper7425-srcf-d15-27 ~]# gdb -p 103421 `which slurmstepd` GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-114.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /scg/slurm/installed/slurm-18.08.6/sbin/slurmstepd...done. Attaching to program: /scg/slurm/current/sbin/slurmstepd, process 103421 ^C^C^C^C^C^C^C^C And strace is a dead end: [root@dper7425-srcf-d15-27 ~]# strace -ff -p 103421 strace: Process 103421 attached kill -11 PID for these does not produce a core dump. If we can't figure out who/what to blame for why slurmd/slurmstepd is abandoning ship in these cases, can we at least agree that they need to be accounted for?
John, I am working today/tomorrow with your bug again and let the rest of the team look also to your bug. Will come back when I have relevant info.
I'm starting to think that this is a race with: ProctrackType=proctrack/cgroup and TaskPlugin=task/cgroup and not really your problem. I have other similar bugs with same problems, but until today I hadn't seen zombies on other customers. Now one reports also zombies. In any case it seems to happen only on a specific set of machines, I cannot reproduce while there's one person reproducing it consistently. Will come back to you.
Hi Felip, Thanks for looking into it. I went back and counted these (based on the log messages and lack of accounting) and we've had this occur ~ 4000 times out of our 8.25 million jobs during that period. Since I've only noticed maybe 100 of those over that time I am assuming that these have been cleaned up in most cases by an automatic process I use to drain/reboot nodes with job complete errors or just wound up as unnoticed processes in D state. If there's something I can test or collect on my end, happy to oblige, but without a solid reproducer it'll just be more statistical sampling. jbh
John, Can you show me the output from a node of: ]$ mount |grep cgroup
(In reply to Felip Moll from comment #76) > John, > > Can you show me the output from a node of: > > ]$ mount |grep cgroup From random node: [root@dper7425-srcf-d15-09 ~]# mount |grep cgroup tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755) cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd) cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices) cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_prio,net_cls) cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids) cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpuacct,cpu) cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset) cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio) cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory) cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event) cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer) cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
Another round of possibly affected jobs, at least to the point of a collection of slurmstepd who have been abandoned. This time, however, there is backtrace. Caveat: I did add a new node and partition to Slurm this morning. I don't know yet if these were already off the rails before then or that I just noticed them after. jbh (gdb) thread apply all bt full Thread 3 (Thread 0x2b7be2beb700 (LWP 119262)): #0 0x00002b7bdd01b20d in poll () from /usr/lib64/libc.so.6 No symbol table info available. #1 0x00002b7bdbe25d77 in _poll_internal (shutdown_time=<optimized out>, nfds=2, pfds=0x2b7be40008d0) at eio.c:364 n = <optimized out> timeout = -1 #2 eio_handle_mainloop (eio=0x14158c0) at eio.c:328 retval = 0 pollfds = 0x2b7be40008d0 map = 0x2b7be4000900 maxnfds = 1 nfds = 2 n = <optimized out> shutdown_time = <optimized out> __func__ = "eio_handle_mainloop" #3 0x000000000041d98a in _msg_thr_internal (job_arg=0x13dda30) at req.c:273 job = 0x13dda30 #4 0x00002b7bdcd13dd5 in start_thread () from /usr/lib64/libpthread.so.0 No symbol table info available. #5 0x00002b7bdd025ead in clone () from /usr/lib64/libc.so.6 No symbol table info available. Thread 2 (Thread 0x2b7be3b1e700 (LWP 119296)): #0 0x00002b7bdd01b20d in poll () from /usr/lib64/libc.so.6 No symbol table info available. #1 0x00002b7be3206942 in _oom_event_monitor (x=<optimized out>) at task_cgroup_memory.c:478 args = 0x13caa90 ret = <optimized out> res = 1 fds = {{fd = 12, events = 3, revents = 0}, {fd = 13, events = 8193, revents = 0}} __func__ = "_oom_event_monitor" #2 0x00002b7bdcd13dd5 in start_thread () from /usr/lib64/libpthread.so.0 No symbol table info available. #3 0x00002b7bdd025ead in clone () from /usr/lib64/libc.so.6 No symbol table info available. Thread 1 (Thread 0x2b7bdbad4080 (LWP 119202)): #0 0x00002b7bdcfece2d in nanosleep () from /usr/lib64/libc.so.6 No symbol table info available. #1 0x00002b7bdcfeccc4 in sleep () from /usr/lib64/libc.so.6 No symbol table info available. #2 0x000000000040da68 in _one_step_complete_msg (job=job@entry=0x13dda30, first=0, last=0) at mgr.c:886 req = {address = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, auth_cred = 0x0, body_offset = 0, buffer = 0x0, conn = 0x0, conn_fd = -1, data = 0x7ffc02ba4ac0, data_size = 0, flags = 0, msg_index = 0, msg_type = 5016, protocol_version = 65534, forward = {cnt = 0, init = 65534, nodelist = 0x0, timeout = 0, tree_width = 0}, forward_struct = 0x0, orig_addr = {sin_family = 0, sin_port = 0, sin_addr = { s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, ret_list = 0x0} msg = {job_id = 8370873, job_step_id = 4294967295, range_first = 0, range_last = 0, step_rc = 0, jobacct = 0x140f9f0} rc = -1 retcode = <optimized out> i = 184 port = 0 ip_buf = "0.0.0.0\000\000\000\000\000\000\000\000" acct_sent = true #3 0x000000000040faeb in stepd_send_step_complete_msgs (job=job@entry=0x13dda30) at mgr.c:953 start = 0 size = 0 first = -1 last = -1 sent_own_comp_msg = false __func__ = "stepd_send_step_complete_msgs" #4 0x00000000004109ab in _spawn_job_container (job=0x13dda30) at mgr.c:1223 jobacct_id = {taskid = 0, nodeid = 0, job = 0x13dda30} status = 15 pid = 119298 jobacct = <optimized out> rusage = {ru_utime = {tv_sec = 0, tv_usec = 0}, ru_stime = {tv_sec = 0, tv_usec = 1306}, {ru_maxrss = 1800, __ru_maxrss_word = 1800}, {ru_ixrss = 0, __ru_ixrss_word = 0}, {ru_idrss = 0, __ru_idrss_word = 0}, {ru_isrss = 0, __ru_isrss_word = 0}, {ru_minflt = 267, __ru_minflt_word = 267}, {ru_majflt = 0, __ru_majflt_word = 0}, {ru_nswap = 0, __ru_nswap_word = 0}, {ru_inblock = 0, __ru_inblock_word = 0}, {ru_oublock = 0, __ru_oublock_word = 0}, { ru_msgsnd = 0, __ru_msgsnd_word = 0}, {ru_msgrcv = 0, __ru_msgrcv_word = 0}, {ru_nsignals = 0, __ru_nsignals_word = 0}, {ru_nvcsw = 2, __ru_nvcsw_word = 2}, {ru_nivcsw = 0, __ru_nivcsw_word = 0}} rc = 0 x11_pipe = {0, 0} #5 job_manager (job=job@entry=0x13dda30) at mgr.c:1292 rc = 0 io_initialized = false ckpt_type = 0x13c59d0 "checkpoint/none" err_msg = 0x0 __func__ = "job_manager" #6 0x000000000040d452 in main (argc=1, argv=0x7ffc02ba4e78) at slurmstepd.c:174 cli = 0x13d4b70 self = 0x13d4b40 msg = 0x13dd340 rc = 0 launch_params = 0x0 __func__ = "main" (gdb) where #0 0x00002b7bdcfece2d in nanosleep () from /usr/lib64/libc.so.6 #1 0x00002b7bdcfeccc4 in sleep () from /usr/lib64/libc.so.6 #2 0x000000000040da68 in _one_step_complete_msg (job=job@entry=0x13dda30, first=0, last=0) at mgr.c:886 #3 0x000000000040faeb in stepd_send_step_complete_msgs (job=job@entry=0x13dda30) at mgr.c:953 #4 0x00000000004109ab in _spawn_job_container (job=0x13dda30) at mgr.c:1223 #5 job_manager (job=job@entry=0x13dda30) at mgr.c:1292 #6 0x000000000040d452 in main (argc=1, argv=0x7ffc02ba4e78) at slurmstepd.c:174 From dmesg I see: [Wed Mar 20 09:22:24 2019] slurmstepd[119463]: segfault at 0 ip 00002ba7bc926b2b sp 00007ffce1328040 error 4 in libslurmfull.so[2ba7bc8b6000+1b3000] [Wed Mar 20 09:22:24 2019] slurmstepd[119334]: segfault at 0 ip 00002ba7bc926b2b sp 00007ffce132a1b0 error 4 in libslurmfull.so[2ba7bc8b6000+1b3000] [Wed Mar 20 09:22:24 2019] slurmstepd[120270]: segfault at 0 ip 00002b0d19130b2b sp 00007ffc48d47390 error 4 in libslurmfull.so[2b0d190c0000+1b3000] [Wed Mar 20 09:22:24 2019] slurmstepd[120200]: segfault at 0 ip 00002b0d19130b2b sp 00007ffc48d49500 error 4 in libslurmfull.so[2b0d190c0000+1b3000] [Wed Mar 20 09:22:25 2019] slurmstepd[122299]: segfault at 0 ip 00002acf010e8b2b sp 00007ffcf6819d20 error 4 in libslurmfull.so[2acf01078000+1b3000] [Wed Mar 20 09:22:25 2019] slurmstepd[121978]: segfault at 0 ip 00002acf010e8b2b sp 00007ffcf681be90 error 4 in libslurmfull.so[2acf01078000+1b3000]
Thanks, I am investigating the race condition currently. One thing you could temporary test is to change proctrack to linuxproc in order to see if the error is gone. This has implications though, but would give us a good hint to really conclude it is a race condition with these plugins. > > [Wed Mar 20 09:22:24 2019] slurmstepd[119463]: segfault at 0 ip > 00002ba7bc926b2b sp 00007ffce1328040 error 4 in > libslurmfull.so[2ba7bc8b6000+1b3000] > [Wed Mar 20 09:22:24 2019] slurmstepd[119334]: segfault at 0 ip > 00002ba7bc926b2b sp 00007ffce132a1b0 error 4 in > libslurmfull.so[2ba7bc8b6000+1b3000] > [Wed Mar 20 09:22:24 2019] slurmstepd[120270]: segfault at 0 ip > 00002b0d19130b2b sp 00007ffc48d47390 error 4 in > libslurmfull.so[2b0d190c0000+1b3000] > [Wed Mar 20 09:22:24 2019] slurmstepd[120200]: segfault at 0 ip > 00002b0d19130b2b sp 00007ffc48d49500 error 4 in > libslurmfull.so[2b0d190c0000+1b3000] > [Wed Mar 20 09:22:25 2019] slurmstepd[122299]: segfault at 0 ip > 00002acf010e8b2b sp 00007ffcf6819d20 error 4 in > libslurmfull.so[2acf01078000+1b3000] > [Wed Mar 20 09:22:25 2019] slurmstepd[121978]: segfault at 0 ip > 00002acf010e8b2b sp 00007ffcf681be90 error 4 in > libslurmfull.so[2acf01078000+1b3000] This seems a different issue. We should try to look for the code where it segfaulted: In the libslurmfull.so[2acf01078000+1b3000] part, the 2acf01078000 is where the library was loaded. Subtract this from the instruction pointer (ip, 00002acf010e8b2b) and you'll get the offset (it gives me 70b2b) into the .so of the offending instruction. Then you can use objdump -DCgl libslurmfull.so and search for the instruction at that offset. We should easily be able to figure out which function it is from the asm labels. We can also try using addr2line -j.text -e libslurmfull.so 70b2b, but because of compiler optimization it may not be correct. https://stackoverflow.com/questions/2179403/how-do-you-read-a-segfault-kernel-log-message I wouldn't worry too much now for this segfault though.
For the sake of completeness (and because I have multiple jobs on multiple nodes that had this segfault in the same way/location): From the objdump: /scg/slurm/sources/slurm-18.08.6/src/common/env.c:824 70b11: 48 8b 05 30 32 34 00 mov 0x343230(%rip),%rax # 3b3d48 <slurmctld_conf@@Base-0xc078> 70b18: 4c 8b 80 60 04 00 00 mov 0x460(%rax),%r8 70b1f: 4d 85 c0 test %r8,%r8 70b22: 75 0a jne 70b2e <setup_env+0x127e> /scg/slurm/sources/slurm-18.08.6/src/common/env.c:827 70b24: 48 8b 80 e0 00 00 00 mov 0xe0(%rax),%rax 70b2b: 4c 8b 00 mov (%rax),%r8 /scg/slurm/sources/slurm-18.08.6/src/common/env.c:828 70b2e: 48 8b 05 13 32 34 00 mov 0x343213(%rip),%rax # 3b3d48 <slurmctld_conf@@Base-0xc078> 70b35: 44 8b 88 90 04 00 00 mov 0x490(%rax),%r9d 70b3c: 48 8b 88 c8 00 00 00 mov 0xc8(%rax),%rcx 70b43: 48 8d 7b 70 lea 0x70(%rbx),%rdi 70b47: c7 04 24 00 21 00 00 movl $0x2100,(%rsp) 70b4e: 48 8d 15 2d 62 0f 00 lea 0xf622d(%rip),%rdx # 166d82 <__func__.7868+0x4c2> 70b55: 48 8d 35 32 62 0f 00 lea 0xf6232(%rip),%rsi # 166d8e <__func__.7868+0x4ce> 70b5c: b8 00 00 00 00 mov $0x0,%eax 70b61: e8 1a 38 fc ff callq 34380 <setenvf@plt> And, [root@dper730xd-srcf-d16-35 ~]# addr2line -j.text -e /scg/slurm/current/lib/slurm/libslurmfull.so 70b2b /scg/slurm/sources/slurm-18.08.6/src/common/plugin.c:162
(In reply to Felip Moll from comment #79) > Thanks, I am investigating the race condition currently. > > One thing you could temporary test is to change proctrack to linuxproc in > order to see if the error is gone. This has implications though, but would > give us a good hint to really conclude it is a race condition with these > plugins. > If I do this live with running jobs, will they be affected? jbh
(In reply to John Hanks from comment #80) > For the sake of completeness (and because I have multiple jobs on multiple > nodes that had this segfault in the same way/location): > > From the objdump: > > /scg/slurm/sources/slurm-18.08.6/src/common/env.c:824 > 70b11: 48 8b 05 30 32 34 00 mov 0x343230(%rip),%rax # > 3b3d48 <slurmctld_conf@@Base-0xc078> > 70b18: 4c 8b 80 60 04 00 00 mov 0x460(%rax),%r8 > 70b1f: 4d 85 c0 test %r8,%r8 > 70b22: 75 0a jne 70b2e <setup_env+0x127e> > /scg/slurm/sources/slurm-18.08.6/src/common/env.c:827 > 70b24: 48 8b 80 e0 00 00 00 mov 0xe0(%rax),%rax > 70b2b: 4c 8b 00 mov (%rax),%r8 > /scg/slurm/sources/slurm-18.08.6/src/common/env.c:828 > 70b2e: 48 8b 05 13 32 34 00 mov 0x343213(%rip),%rax # > 3b3d48 <slurmctld_conf@@Base-0xc078> > 70b35: 44 8b 88 90 04 00 00 mov 0x490(%rax),%r9d > 70b3c: 48 8b 88 c8 00 00 00 mov 0xc8(%rax),%rcx > 70b43: 48 8d 7b 70 lea 0x70(%rbx),%rdi > 70b47: c7 04 24 00 21 00 00 movl $0x2100,(%rsp) > 70b4e: 48 8d 15 2d 62 0f 00 lea 0xf622d(%rip),%rdx # > 166d82 <__func__.7868+0x4c2> > 70b55: 48 8d 35 32 62 0f 00 lea 0xf6232(%rip),%rsi # > 166d8e <__func__.7868+0x4ce> > 70b5c: b8 00 00 00 00 mov $0x0,%eax > 70b61: e8 1a 38 fc ff callq 34380 <setenvf@plt> > > And, > > > [root@dper730xd-srcf-d16-35 ~]# addr2line -j.text -e > /scg/slurm/current/lib/slurm/libslurmfull.so 70b2b > /scg/slurm/sources/slurm-18.08.6/src/common/plugin.c:162 Have you upgraded to 18.08.6? Have you changed your slurm.conf recently? The plugin.c:162 is trying to load a .so plugin, while the results of the objdump is trying to setup the environment variable SLURM_WORKING_CLUSTER from slurmctld_conf.control_addr[0]. In any case I would suggest opening a new bug for this segfault. In what regards to changing ProcTrack cgroup to linuxproc, I did some tests and your jobs shouldn't get affected. In any case I am suggesting this as a measure to diagnose the race condition between cgroup plugins and just add your bug to the list of bugs with the same symptoms. You need to restart all slurmds, but the stepds will continue working. Let me know if you're willing to do this change, otherwise I will continue investigating the race and when I have some solution provide you with it.
> Have you upgraded to 18.08.6? Have you changed your slurm.conf recently? We are now at 18.08.6. No changes to slurm.conf recently except adding nodes and partitions. > In any case I would suggest opening a new bug for this segfault. I'll fork a new bug the next time I am cleaning these up on a node. > > Let me know if you're willing to do this change, otherwise I will continue > investigating the race and when I have some solution provide you with it. Change made, as of this afternoon we are running with ProctrackType=proctrack/linuxproc If anything good or bad happens, you'll be the first to know :) jbh
(In reply to John Hanks from comment #83) > > Have you upgraded to 18.08.6? Have you changed your slurm.conf recently? > > We are now at 18.08.6. No changes to slurm.conf recently except adding nodes > and partitions. > > > In any case I would suggest opening a new bug for this segfault. > > I'll fork a new bug the next time I am cleaning these up on a node. > > > > > > Let me know if you're willing to do this change, otherwise I will continue > > investigating the race and when I have some solution provide you with it. > > Change made, as of this afternoon we are running with > > ProctrackType=proctrack/linuxproc > > If anything good or bad happens, you'll be the first to know :) > > jbh John, I identified a problem with mixing jobacct_gather/cgroup and proctrack/cgroup. It turns out that the jobacct_gather/cgroup moves the pids of tasks in the extern step into something like: /sys/fs/cgroup/freezer/slurm_gamba1/uid_1000/job_27/step_extern/task_0/cgroup.procs , but proctrack/cgroup expects those to be in ...job_27/step_extern/cgroup.procs. This causes that when the job finishes, the processes inside step_extern are not killed: i.e. the sleep processes, that's because a call to xcgroup_get_pids provided by proctrack/cgroup returns no pids. Then, since there are no pids to kill, a try to send a signal is useless and therefore the processes do not die, the cgroup is not removed, and the job doesn't finish (CG state). After a timeout, the job is flushed and the node set to drain. The only piece left in the puzzle is that this is happening to me only when the cgroup mountpoints are shared, like: cgroup on /sys/fs/cgroup/freezer,memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory,freezer) and that does not seem to be your case. In any case, switching from jobacct_gather/cgroup to linux , OR proctrack/cgroup to linuxproc, OR disabling the flag Contain(and X11) should workaround the problem. From these 3 options I would go for the jobacct_gather/linux one. I know I asked you to change proctrack, but is a test I want to do with your system since your cgroup is not shared, and I want to discard a couple of situations. Let me know when you have new informations.
(In reply to Felip Moll from comment #84) > In any case, switching from jobacct_gather/cgroup to linux , OR > proctrack/cgroup to linuxproc, OR disabling the flag Contain(and X11) should > workaround the problem. From these 3 options I would go for the > jobacct_gather/linux one. I know I asked you to change proctrack, but is a > test I want to do with your system since your cgroup is not shared, and I > want to discard a couple of situations. > > Let me know when you have new information. As of this morning we are now running with: ProctrackType=proctrack/cgroup MemLimitEnforce=no JobAcctGatherParams=NoOverMemoryKill JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/linux TaskPlugin=task/affinity,task/cgroup PrologFlags=Alloc,Contain,X11 Is that the correct collection of settings you want us to test? Best, jbh
(In reply to John Hanks from comment #85) > (In reply to Felip Moll from comment #84) > > In any case, switching from jobacct_gather/cgroup to linux , OR > > proctrack/cgroup to linuxproc, OR disabling the flag Contain(and X11) should > > workaround the problem. From these 3 options I would go for the > > jobacct_gather/linux one. I know I asked you to change proctrack, but is a > > test I want to do with your system since your cgroup is not shared, and I > > want to discard a couple of situations. > > > > Let me know when you have new information. > > As of this morning we are now running with: > > ProctrackType=proctrack/cgroup > MemLimitEnforce=no > JobAcctGatherParams=NoOverMemoryKill > JobAcctGatherFrequency=30 > JobAcctGatherType=jobacct_gather/linux > TaskPlugin=task/affinity,task/cgroup > PrologFlags=Alloc,Contain,X11 > > > Is that the correct collection of settings you want us to test? > > Best, > > jbh That should be correct and not give you errors anymore. If it still fails I will ask for more information, specifically: In a failing node, when the job is still CG, and for each cgroup susbsystem in 'freezer,memory,cpu', show the associated tasks and the cgroup tree i.e.: root@deb5:/sys/fs/cgroup# find . -type d -name job_27 ./cpuset/slurm_gamba1/uid_1000/job_27 ./memory/slurm_gamba1/uid_1000/job_27 <---- do for it ./freezer/slurm_gamba1/uid_1000/job_27 <---- do for it ./cpuacct/slurm_gamba1/uid_1000/job_27 > cd /sys/fs/cgroup/freezer/slurm_gamba1/uid_1000 > for i in $(find job_27 -name tasks -exec cat '{}' \;); do ps aux|grep $i|grep -v grep; done > find job_27 -name cgroup.procs -exec echo '{}' \; -exec cat '{}' \; > find job_27 -type f Let me know about your status.
Btw, in comment 83 and comment 85 there's a mismatch. Comment 83 says you are running with ProctrackType=proctrack/linuxproc Comment 85 says you are running with ProctrackType=proctrack/cgroup In any case I am interested in knowing if comment 83 change had any benefit, or if since comment 85 the issue happened again. Both situations should've workaround the issue.
(In reply to Felip Moll from comment #87) > Btw, in comment 83 and comment 85 there's a mismatch. > > Comment 83 says you are running with > > ProctrackType=proctrack/linuxproc > > Comment 85 says you are running with > > ProctrackType=proctrack/cgroup > > > In any case I am interested in knowing if comment 83 change had any benefit, > or if since comment 85 the issue happened again. Both situations should've > workaround the issue. Comment 85 is the current state, the Comment 83 state was short-lived and I went through the whole config to verify and get to the current (85) state based on rereading your suggestions. I'm still draining some nodes, but have been rebooting nodes since that change and have not yet seen the problem hit on a freshly rebooted node under the current configuration. Since I like a lively parameter space to explore the ongoing reboots have been for a kernel update :) jbh
(In reply to John Hanks from comment #88) > (In reply to Felip Moll from comment #87) > > Btw, in comment 83 and comment 85 there's a mismatch. > > > > Comment 83 says you are running with > > > > ProctrackType=proctrack/linuxproc > > > > Comment 85 says you are running with > > > > ProctrackType=proctrack/cgroup > > > > > > In any case I am interested in knowing if comment 83 change had any benefit, > > or if since comment 85 the issue happened again. Both situations should've > > workaround the issue. > > Comment 85 is the current state, the Comment 83 state was short-lived and I > went through the whole config to verify and get to the current (85) state > based on rereading your suggestions. > > I'm still draining some nodes, but have been rebooting nodes since that > change and have not yet seen the problem hit on a freshly rebooted node > under the current configuration. Since I like a lively parameter space to > explore the ongoing reboots have been for a kernel update :) > > jbh Thanks, please keep me posted if something new happens.
> Thanks, please keep me posted if something new happens. I woke up this morning to a large batch of nodes draining with reason="Kill task failed". This in itself is not new, been struggling with this for years on and off but I thought I'd describe it in case it's related to our current problem. A pstree shows the offending jobs stuck with a sleep and several slurmstepd processes. ├─slurmd(25725) ├─slurmstepd(48947)─┬─sleep(48952) │ ├─{slurmstepd}(48948) │ ├─{slurmstepd}(48949) │ ├─{slurmstepd}(48950) │ └─{slurmstepd}(48951) ├─slurmstepd(48954)─┬─slurm_script(48965,krystal2)───python(49615)───java(49616)─┬─{java}(49617) │ │ ├─{java}(49618) [ many java processes snipped here ] │ │ ├─{java}(49652) │ │ └─{java}(49653) │ ├─{slurmstepd}(48955) │ ├─{slurmstepd}(48956) │ ├─{slurmstepd}(48957) │ └─{slurmstepd}(48958) ├─slurmstepd(65982)─┬─sleep(65987) │ ├─{slurmstepd}(65983) │ ├─{slurmstepd}(65984) │ ├─{slurmstepd}(65985) │ └─{slurmstepd}(65986) ├─slurmstepd(65989)─┬─slurm_script(65999,abahman)───sh(66029) │ ├─{slurmstepd}(65990) │ ├─{slurmstepd}(65991) │ ├─{slurmstepd}(65992) │ └─{slurmstepd}(65993) A bt of one of these (48947) shows: (gdb) thread apply all bt full Thread 5 (Thread 0x2b068dafb700 (LWP 48948)): #0 0x00002b068ec3c965 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0 No symbol table info available. #1 0x00002b068dcb7303 in _watch_tasks (arg=<optimized out>) at slurm_jobacct_gather.c:373 err = <optimized out> __func__ = "_watch_tasks" #2 0x00002b068ec38dd5 in start_thread () from /usr/lib64/libpthread.so.0 No symbol table info available. #3 0x00002b068ef4aead in clone () from /usr/lib64/libc.so.6 No symbol table info available. Thread 4 (Thread 0x2b0694a1e700 (LWP 48949)): #0 0x00002b068ec3cd12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0 No symbol table info available. #1 0x00002b068dcb16b0 in _timer_thread (args=<optimized out>) at slurm_acct_gather_profile.c:205 err = <optimized out> i = 4 now = 1554903404 diff = <optimized out> tvnow = {tv_sec = 1554742482, tv_usec = 891838} abs = {tv_sec = 1554903405, tv_nsec = 891838000} __func__ = "_timer_thread" #2 0x00002b068ec38dd5 in start_thread () from /usr/lib64/libpthread.so.0 No symbol table info available. #3 0x00002b068ef4aead in clone () from /usr/lib64/libc.so.6 No symbol table info available. Thread 3 (Thread 0x2b0694b1f700 (LWP 48950)): #0 0x00002b068ef4020d in poll () from /usr/lib64/libc.so.6 No symbol table info available. #1 0x00002b068dd4ad77 in _poll_internal (shutdown_time=<optimized out>, nfds=2, pfds=0x2b06980008d0) at eio.c:364 n = <optimized out> timeout = -1 #2 eio_handle_mainloop (eio=0x1a6ef80) at eio.c:328 retval = 0 pollfds = 0x2b06980008d0 map = 0x2b0698000900 maxnfds = 1 nfds = 2 n = <optimized out> shutdown_time = <optimized out> __func__ = "eio_handle_mainloop" #3 0x000000000041d98a in _msg_thr_internal (job_arg=0x1a6eb10) at req.c:273 job = 0x1a6eb10 #4 0x00002b068ec38dd5 in start_thread () from /usr/lib64/libpthread.so.0 No symbol table info available. #5 0x00002b068ef4aead in clone () from /usr/lib64/libc.so.6 No symbol table info available. Thread 2 (Thread 0x2b0695a52700 (LWP 48951)): #0 0x00002b068ef4020d in poll () from /usr/lib64/libc.so.6 No symbol table info available. #1 0x00002b069513a942 in _oom_event_monitor (x=<optimized out>) at task_cgroup_memory.c:478 args = 0x1a6c9e0 ret = <optimized out> res = 0 fds = {{fd = 12, events = 3, revents = 0}, {fd = 13, events = 8193, revents = 0}} __func__ = "_oom_event_monitor" #2 0x00002b068ec38dd5 in start_thread () from /usr/lib64/libpthread.so.0 No symbol table info available. #3 0x00002b068ef4aead in clone () from /usr/lib64/libc.so.6 No symbol table info available. Thread 1 (Thread 0x2b068d9f9040 (LWP 48947)): #0 0x00002b068ef11aea in wait4 () from /usr/lib64/libc.so.6 No symbol table info available. #1 0x000000000041084f in _spawn_job_container (job=0x1a6eb10) at mgr.c:1183 jobacct_id = {taskid = 0, nodeid = 0, job = 0x1a6eb10} status = 0 pid = 48952 jobacct = 0x0 rusage = {ru_utime = {tv_sec = 0, tv_usec = 0}, ru_stime = {tv_sec = 0, tv_usec = 0}, {ru_maxrss = 0, __ru_maxrss_word = 0}, { ru_ixrss = 0, __ru_ixrss_word = 0}, {ru_idrss = 0, __ru_idrss_word = 0}, {ru_isrss = 0, __ru_isrss_word = 0}, {ru_minflt = 0, __ru_minflt_word = 0}, {ru_majflt = 0, __ru_majflt_word = 0}, {ru_nswap = 0, __ru_nswap_word = 0}, {ru_inblock = 0, __ru_inblock_word = 0}, {ru_oublock = 0, __ru_oublock_word = 0}, {ru_msgsnd = 0, __ru_msgsnd_word = 0}, {ru_msgrcv = 0, __ru_msgrcv_word = 0}, {ru_nsignals = 0, __ru_nsignals_word = 0}, {ru_nvcsw = 0, __ru_nvcsw_word = 0}, {ru_nivcsw = 0, __ru_nivcsw_word = 0}} rc = 0 x11_pipe = {0, 0} #2 job_manager (job=job@entry=0x1a6eb10) at mgr.c:1292 rc = 0 io_initialized = false ckpt_type = 0x1a62660 "checkpoint/none" err_msg = 0x0 __func__ = "job_manager" #3 0x000000000040d452 in main (argc=1, argv=0x7fffbc2431f8) at slurmstepd.c:174 cli = 0x1a6d370 self = 0x1a63010 msg = 0x1a637f0 rc = 0 launch_params = 0x0 __func__ = "main" An strace of this process shows this in a loop: [pid 48949] futex(0x2b068dfad160, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 48949] futex(0x2b068dfad124, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 322193, {1554903579, 891838000}, ffffffff) = -1 ETIMEDOUT (Connection timed out) [pid 48949] futex(0x2b068dfad160, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 48949] futex(0x2b068dfad124, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 322195, {1554903580, 891838000}, ffffffff) = -1 ETIMEDOUT (Connection timed out) [pid 48949] futex(0x2b068dfad160, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 48949] futex(0x2b068dfaf3fc, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x2b068dfaf3f8, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 [pid 48948] <... futex resumed> ) = 0 [pid 48949] futex(0x2b068dfad124, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 322197, {1554903581, 891838000}, ffffffff <unfinished ...> [pid 48948] futex(0x2b068dfaf428, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 48948] stat("/sys/fs/cgroup/freezer/slurm/uid_281293/job_8592288/step_extern/cgroup.procs", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 [pid 48948] open("/sys/fs/cgroup/freezer/slurm/uid_281293/job_8592288/step_extern/cgroup.procs", O_RDONLY) = 9 [pid 48948] lseek(9, 0, SEEK_CUR) = 0 [pid 48948] lseek(9, 0, SEEK_SET) = 0 [pid 48948] read(9, "4", 1) = 1 [pid 48948] read(9, "8", 1) = 1 [pid 48948] read(9, "9", 1) = 1 [pid 48948] read(9, "5", 1) = 1 [pid 48948] read(9, "2", 1) = 1 [pid 48948] read(9, "\n", 1) = 1 [pid 48948] read(9, "", 1) = 0 [pid 48948] lseek(9, 0, SEEK_SET) = 0 [pid 48948] read(9, "48952\n", 6) = 6 [pid 48948] close(9) = 0 [pid 48948] open("/proc/48952/stat", O_RDONLY) = 9 [pid 48948] fcntl(9, F_SETFD, FD_CLOEXEC) = 0 [pid 48948] read(9, "48952 (sleep) S 48947 48952 4894"..., 511) = 320 [pid 48948] open("/proc/48952/status", O_RDONLY) = 15 [pid 48948] fstat(15, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 [pid 48948] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b068dafc000 [pid 48948] read(15, "Name:\tsleep\nUmask:\t0022\nState:\tS"..., 1024) = 1024 [pid 48948] close(15) = 0 [pid 48948] munmap(0x2b068dafc000, 4096) = 0 [pid 48948] close(9) = 0 [pid 48948] open("/proc/48952/io", O_RDONLY) = 9 [pid 48948] fcntl(9, F_SETFD, FD_CLOEXEC) = 0 [pid 48948] read(9, "rchar: 2012\nwchar: 0\nsyscr: 6\nsy"..., 255) = 93 [pid 48948] open("/proc/48952/status", O_RDONLY) = 15 [pid 48948] fstat(15, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 [pid 48948] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b068dafc000 [pid 48948] read(15, "Name:\tsleep\nUmask:\t0022\nState:\tS"..., 1024) = 1024 [pid 48948] close(15) = 0 [pid 48948] munmap(0x2b068dafc000, 4096) = 0 [pid 48948] close(9) = 0 [pid 48948] open("/sys/devices/system/cpu/cpu7/cpufreq/cpuinfo_cur_freq", O_RDONLY) = 9 [pid 48948] fcntl(9, F_SETFD, FD_CLOEXEC) = 0 [pid 48948] read(9, "2401000", 7) = 7 [pid 48948] close(9) = 0 [pid 48948] futex(0x2b068dfaf3fc, FUTEX_WAIT_PRIVATE, 10735, NULL <unfinished ...> [pid 48949] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 48949] futex(0x2b068dfad160, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 48949] futex(0x2b068dfad124, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 322199, {1554903582, 891838000}, ffffffff) = -1 ETIMEDOUT (Connection timed out) [pid 48949] futex(0x2b068dfad160, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 48949] futex(0x2b068dfad124, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 322201, {1554903583, 891838000}, ffffffff) = -1 ETIMEDOUT (Connection timed out) [pid 48949] futex(0x2b068dfad160, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 48949] futex(0x2b068dfad124, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 322203, {1554903584, 891838000}, ffffffff) = -1 ETIMEDOUT (Connection timed out) [pid 48949] futex(0x2b068dfad160, FUTEX_WAKE_PRIVATE, 1) = 0 The job in question shows as running: (scg)[I am root@frankie ~]# sacct -j 8592288 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 8592288 GERM batch ccurtis2 1 RUNNING 0:0 8592288.ext+ extern ccurtis2 1 RUNNING 0:0 although clearly all user parts of it have exited. scancel works to kill this job. I don't have much time to dig into this right now but will add more detail when I do. jbh
(In reply to John Hanks from comment #90) > > > Thanks, please keep me posted if something new happens. > > > I woke up this morning to a large batch of nodes draining with reason="Kill > task failed". This in itself is not new, been struggling with this for years > on and off but I thought I'd describe it in case it's related to our current > problem. > > A pstree shows the offending jobs stuck with a sleep and several slurmstepd > processes. > > > > ├─slurmd(25725) > ├─slurmstepd(48947)─┬─sleep(48952) > │ ├─{slurmstepd}(48948) > │ ├─{slurmstepd}(48949) > │ ├─{slurmstepd}(48950) > │ └─{slurmstepd}(48951) > > ├─slurmstepd(48954)─┬─slurm_script(48965, > krystal2)───python(49615)───java(49616)─┬─{java}(49617) > │ │ > ├─{java}(49618) > [ many java processes snipped here ] > │ │ > ├─{java}(49652) > │ │ > └─{java}(49653) > │ ├─{slurmstepd}(48955) > │ ├─{slurmstepd}(48956) > │ ├─{slurmstepd}(48957) > │ └─{slurmstepd}(48958) > ├─slurmstepd(65982)─┬─sleep(65987) > │ ├─{slurmstepd}(65983) > │ ├─{slurmstepd}(65984) > │ ├─{slurmstepd}(65985) > │ └─{slurmstepd}(65986) > ├─slurmstepd(65989)─┬─slurm_script(65999,abahman)───sh(66029) > │ ├─{slurmstepd}(65990) > │ ├─{slurmstepd}(65991) > │ ├─{slurmstepd}(65992) > │ └─{slurmstepd}(65993) > > > A bt of one of these (48947) shows: > > (gdb) thread apply all bt full > > Thread 5 (Thread 0x2b068dafb700 (LWP 48948)): > #0 0x00002b068ec3c965 in pthread_cond_wait@@GLIBC_2.3.2 () from > /usr/lib64/libpthread.so.0 > No symbol table info available. > #1 0x00002b068dcb7303 in _watch_tasks (arg=<optimized out>) at > slurm_jobacct_gather.c:373 > err = <optimized out> > __func__ = "_watch_tasks" > #2 0x00002b068ec38dd5 in start_thread () from /usr/lib64/libpthread.so.0 > No symbol table info available. > #3 0x00002b068ef4aead in clone () from /usr/lib64/libc.so.6 > No symbol table info available. > > Thread 4 (Thread 0x2b0694a1e700 (LWP 48949)): > #0 0x00002b068ec3cd12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from > /usr/lib64/libpthread.so.0 > No symbol table info available. > #1 0x00002b068dcb16b0 in _timer_thread (args=<optimized out>) at > slurm_acct_gather_profile.c:205 > err = <optimized out> > i = 4 > now = 1554903404 > diff = <optimized out> > tvnow = {tv_sec = 1554742482, tv_usec = 891838} > abs = {tv_sec = 1554903405, tv_nsec = 891838000} > __func__ = "_timer_thread" > #2 0x00002b068ec38dd5 in start_thread () from /usr/lib64/libpthread.so.0 > No symbol table info available. > #3 0x00002b068ef4aead in clone () from /usr/lib64/libc.so.6 > No symbol table info available. > > Thread 3 (Thread 0x2b0694b1f700 (LWP 48950)): > #0 0x00002b068ef4020d in poll () from /usr/lib64/libc.so.6 > No symbol table info available. > #1 0x00002b068dd4ad77 in _poll_internal (shutdown_time=<optimized out>, > nfds=2, pfds=0x2b06980008d0) at eio.c:364 > n = <optimized out> > timeout = -1 > #2 eio_handle_mainloop (eio=0x1a6ef80) at eio.c:328 > retval = 0 > pollfds = 0x2b06980008d0 > map = 0x2b0698000900 > maxnfds = 1 > nfds = 2 > n = <optimized out> > shutdown_time = <optimized out> > __func__ = "eio_handle_mainloop" > #3 0x000000000041d98a in _msg_thr_internal (job_arg=0x1a6eb10) at req.c:273 > job = 0x1a6eb10 > #4 0x00002b068ec38dd5 in start_thread () from /usr/lib64/libpthread.so.0 > No symbol table info available. > #5 0x00002b068ef4aead in clone () from /usr/lib64/libc.so.6 > No symbol table info available. > > Thread 2 (Thread 0x2b0695a52700 (LWP 48951)): > #0 0x00002b068ef4020d in poll () from /usr/lib64/libc.so.6 > No symbol table info available. > #1 0x00002b069513a942 in _oom_event_monitor (x=<optimized out>) at > task_cgroup_memory.c:478 > args = 0x1a6c9e0 > ret = <optimized out> > res = 0 > fds = {{fd = 12, events = 3, revents = 0}, {fd = 13, events = 8193, > revents = 0}} > __func__ = "_oom_event_monitor" > #2 0x00002b068ec38dd5 in start_thread () from /usr/lib64/libpthread.so.0 > No symbol table info available. > #3 0x00002b068ef4aead in clone () from /usr/lib64/libc.so.6 > No symbol table info available. > > Thread 1 (Thread 0x2b068d9f9040 (LWP 48947)): > #0 0x00002b068ef11aea in wait4 () from /usr/lib64/libc.so.6 > No symbol table info available. > #1 0x000000000041084f in _spawn_job_container (job=0x1a6eb10) at mgr.c:1183 > jobacct_id = {taskid = 0, nodeid = 0, job = 0x1a6eb10} > status = 0 > pid = 48952 > jobacct = 0x0 > rusage = {ru_utime = {tv_sec = 0, tv_usec = 0}, ru_stime = {tv_sec = > 0, tv_usec = 0}, {ru_maxrss = 0, __ru_maxrss_word = 0}, { > ru_ixrss = 0, __ru_ixrss_word = 0}, {ru_idrss = 0, > __ru_idrss_word = 0}, {ru_isrss = 0, __ru_isrss_word = 0}, {ru_minflt = 0, > __ru_minflt_word = 0}, {ru_majflt = 0, __ru_majflt_word = 0}, > {ru_nswap = 0, __ru_nswap_word = 0}, {ru_inblock = 0, > __ru_inblock_word = 0}, {ru_oublock = 0, __ru_oublock_word = 0}, > {ru_msgsnd = 0, __ru_msgsnd_word = 0}, {ru_msgrcv = 0, > __ru_msgrcv_word = 0}, {ru_nsignals = 0, __ru_nsignals_word = > 0}, {ru_nvcsw = 0, __ru_nvcsw_word = 0}, {ru_nivcsw = 0, > __ru_nivcsw_word = 0}} > rc = 0 > x11_pipe = {0, 0} > #2 job_manager (job=job@entry=0x1a6eb10) at mgr.c:1292 > rc = 0 > io_initialized = false > ckpt_type = 0x1a62660 "checkpoint/none" > err_msg = 0x0 > __func__ = "job_manager" > #3 0x000000000040d452 in main (argc=1, argv=0x7fffbc2431f8) at > slurmstepd.c:174 > cli = 0x1a6d370 > self = 0x1a63010 > msg = 0x1a637f0 > rc = 0 > launch_params = 0x0 > __func__ = "main" > > > An strace of this process shows this in a loop: > > [pid 48949] futex(0x2b068dfad160, FUTEX_WAKE_PRIVATE, 1) = 0 > [pid 48949] futex(0x2b068dfad124, > FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 322193, {1554903579, > 891838000}, ffffffff) = -1 ETIMEDOUT (Connection timed out) > [pid 48949] futex(0x2b068dfad160, FUTEX_WAKE_PRIVATE, 1) = 0 > [pid 48949] futex(0x2b068dfad124, > FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 322195, {1554903580, > 891838000}, ffffffff) = -1 ETIMEDOUT (Connection timed out) > [pid 48949] futex(0x2b068dfad160, FUTEX_WAKE_PRIVATE, 1) = 0 > [pid 48949] futex(0x2b068dfaf3fc, FUTEX_WAKE_OP_PRIVATE, 1, 1, > 0x2b068dfaf3f8, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 > [pid 48948] <... futex resumed> ) = 0 > [pid 48949] futex(0x2b068dfad124, > FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 322197, {1554903581, > 891838000}, ffffffff <unfinished ...> > [pid 48948] futex(0x2b068dfaf428, FUTEX_WAKE_PRIVATE, 1) = 0 > [pid 48948] > stat("/sys/fs/cgroup/freezer/slurm/uid_281293/job_8592288/step_extern/cgroup. > procs", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 > [pid 48948] > open("/sys/fs/cgroup/freezer/slurm/uid_281293/job_8592288/step_extern/cgroup. > procs", O_RDONLY) = 9 > [pid 48948] lseek(9, 0, SEEK_CUR) = 0 > [pid 48948] lseek(9, 0, SEEK_SET) = 0 > [pid 48948] read(9, "4", 1) = 1 > [pid 48948] read(9, "8", 1) = 1 > [pid 48948] read(9, "9", 1) = 1 > [pid 48948] read(9, "5", 1) = 1 > [pid 48948] read(9, "2", 1) = 1 > [pid 48948] read(9, "\n", 1) = 1 > [pid 48948] read(9, "", 1) = 0 > [pid 48948] lseek(9, 0, SEEK_SET) = 0 > [pid 48948] read(9, "48952\n", 6) = 6 > [pid 48948] close(9) = 0 > [pid 48948] open("/proc/48952/stat", O_RDONLY) = 9 > [pid 48948] fcntl(9, F_SETFD, FD_CLOEXEC) = 0 > [pid 48948] read(9, "48952 (sleep) S 48947 48952 4894"..., 511) = 320 > [pid 48948] open("/proc/48952/status", O_RDONLY) = 15 > [pid 48948] fstat(15, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 > [pid 48948] mmap(NULL, 4096, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b068dafc000 > [pid 48948] read(15, "Name:\tsleep\nUmask:\t0022\nState:\tS"..., 1024) = 1024 > [pid 48948] close(15) = 0 > [pid 48948] munmap(0x2b068dafc000, 4096) = 0 > [pid 48948] close(9) = 0 > [pid 48948] open("/proc/48952/io", O_RDONLY) = 9 > [pid 48948] fcntl(9, F_SETFD, FD_CLOEXEC) = 0 > [pid 48948] read(9, "rchar: 2012\nwchar: 0\nsyscr: 6\nsy"..., 255) = 93 > [pid 48948] open("/proc/48952/status", O_RDONLY) = 15 > [pid 48948] fstat(15, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 > [pid 48948] mmap(NULL, 4096, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b068dafc000 > [pid 48948] read(15, "Name:\tsleep\nUmask:\t0022\nState:\tS"..., 1024) = 1024 > [pid 48948] close(15) = 0 > [pid 48948] munmap(0x2b068dafc000, 4096) = 0 > [pid 48948] close(9) = 0 > [pid 48948] open("/sys/devices/system/cpu/cpu7/cpufreq/cpuinfo_cur_freq", > O_RDONLY) = 9 > [pid 48948] fcntl(9, F_SETFD, FD_CLOEXEC) = 0 > [pid 48948] read(9, "2401000", 7) = 7 > [pid 48948] close(9) = 0 > [pid 48948] futex(0x2b068dfaf3fc, FUTEX_WAIT_PRIVATE, 10735, NULL > <unfinished ...> > [pid 48949] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) > [pid 48949] futex(0x2b068dfad160, FUTEX_WAKE_PRIVATE, 1) = 0 > [pid 48949] futex(0x2b068dfad124, > FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 322199, {1554903582, > 891838000}, ffffffff) = -1 ETIMEDOUT (Connection timed out) > [pid 48949] futex(0x2b068dfad160, FUTEX_WAKE_PRIVATE, 1) = 0 > [pid 48949] futex(0x2b068dfad124, > FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 322201, {1554903583, > 891838000}, ffffffff) = -1 ETIMEDOUT (Connection timed out) > [pid 48949] futex(0x2b068dfad160, FUTEX_WAKE_PRIVATE, 1) = 0 > [pid 48949] futex(0x2b068dfad124, > FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 322203, {1554903584, > 891838000}, ffffffff) = -1 ETIMEDOUT (Connection timed out) > [pid 48949] futex(0x2b068dfad160, FUTEX_WAKE_PRIVATE, 1) = 0 > > The job in question shows as running: > > > (scg)[I am root@frankie ~]# sacct -j 8592288 > JobID JobName Partition Account AllocCPUS State ExitCode > ------------ ---------- ---------- ---------- ---------- ---------- -------- > 8592288 GERM batch ccurtis2 1 RUNNING 0:0 > 8592288.ext+ extern ccurtis2 1 RUNNING 0:0 > > although clearly all user parts of it have exited. > > scancel works to kill this job. > > I don't have much time to dig into this right now but will add more detail > when I do. > > jbh Thanks John, Can you do: ls -R /sys/fs/cgroup/freezer/slurm/uid_281293/job_8592288/ Also: cat /sys/fs/cgroup/freezer/slurm/uid_281293/job_8592288/step_extern/cgroup.procs and if 'find /sys/fs/cgroup/freezer/slurm/uid_281293/job_8592288/ -name cgroup.procs' then cat each of them. Then for each pid, correlate it with the pids on the system (ps aux|grep xxx). Let's see if the race is the same than the one I already identified.
That node got harvested by my node cleaner, but there are others. Here's one: [root@sgisummit-frcf-111-10 ~]# ls -R /sys/fs/cgroup/freezer/slurm/uid_324759 /sys/fs/cgroup/freezer/slurm/uid_324759: cgroup.clone_children freezer.parent_freezing job_8611335 cgroup.event_control freezer.self_freezing notify_on_release cgroup.procs freezer.state tasks /sys/fs/cgroup/freezer/slurm/uid_324759/job_8611335: cgroup.clone_children freezer.parent_freezing notify_on_release cgroup.event_control freezer.self_freezing step_batch cgroup.procs freezer.state tasks /sys/fs/cgroup/freezer/slurm/uid_324759/job_8611335/step_batch: cgroup.clone_children freezer.parent_freezing notify_on_release cgroup.event_control freezer.self_freezing tasks cgroup.procs freezer.state There is no step_batch in the cgroups for this one, but other cgroup.procs appear to be empty. FWIW, the other job on this node which otherwise still appears to be normal also has empty cgroup.procs files. [root@sgisummit-frcf-111-10 ~]# cd /sys/fs/cgroup/freezer/slurm/uid_324759/job_8611335/ [root@sgisummit-frcf-111-10 job_8611335]# ls cgroup.clone_children freezer.parent_freezing notify_on_release cgroup.event_control freezer.self_freezing step_batch cgroup.procs freezer.state tasks [root@sgisummit-frcf-111-10 job_8611335]# cat cgroup.procs [root@sgisummit-frcf-111-10 job_8611335]# cat step_batch/cgroup.procs [root@sgisummit-frcf-111-10 job_8611335]# Since this is a different node/job, strace and backtrace below for the sake of completeness. [pid 169307] futex(0x2ba79a9e7160, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 169307] futex(0x2ba79a9e7124, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 310797, {1554909120, 234054000}, ffffffff) = -1 ETIMEDOUT (Connection timed out) [pid 169307] futex(0x2ba79a9e7160, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 169307] futex(0x2ba79a9e7124, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 310799, {1554909121, 234054000}, ffffffff) = -1 ETIMEDOUT (Connection timed out) [pid 169307] futex(0x2ba79a9e7160, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 169307] futex(0x2ba79a9e93fc, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x2ba79a9e93f8, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 [pid 169306] <... futex resumed> ) = 0 [pid 169307] futex(0x2ba79a9e7124, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 310801, {1554909122, 234054000}, ffffffff <unfinished ...> [pid 169306] futex(0x2ba79a9e9428, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 169306] stat("/sys/fs/cgroup/freezer/slurm/uid_281293/job_8593185/step_extern/cgroup.procs", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 [pid 169306] open("/sys/fs/cgroup/freezer/slurm/uid_281293/job_8593185/step_extern/cgroup.procs", O_RDONLY) = 9 [pid 169306] lseek(9, 0, SEEK_CUR) = 0 [pid 169306] lseek(9, 0, SEEK_SET) = 0 [pid 169306] read(9, "1", 1) = 1 [pid 169306] read(9, "6", 1) = 1 [pid 169306] read(9, "9", 1) = 1 [pid 169306] read(9, "3", 1) = 1 [pid 169306] read(9, "1", 1) = 1 [pid 169306] read(9, "0", 1) = 1 [pid 169306] read(9, "\n", 1) = 1 [pid 169306] read(9, "", 1) = 0 [pid 169306] lseek(9, 0, SEEK_SET) = 0 [pid 169306] read(9, "169310\n", 7) = 7 [pid 169306] close(9) = 0 [pid 169306] open("/proc/169310/stat", O_RDONLY) = 9 [pid 169306] fcntl(9, F_SETFD, FD_CLOEXEC) = 0 [pid 169306] read(9, "169310 (sleep) S 169305 169310 1"..., 511) = 326 [pid 169306] open("/proc/169310/status", O_RDONLY) = 15 [pid 169306] fstat(15, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 [pid 169306] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2ba79a536000 [pid 169306] read(15, "Name:\tsleep\nUmask:\t0022\nState:\tS"..., 1024) = 1024 [pid 169306] close(15) = 0 [pid 169306] munmap(0x2ba79a536000, 4096) = 0 [pid 169306] close(9) = 0 [pid 169306] open("/proc/169310/io", O_RDONLY) = 9 [pid 169306] fcntl(9, F_SETFD, FD_CLOEXEC) = 0 [pid 169306] read(9, "rchar: 2012\nwchar: 0\nsyscr: 6\nsy"..., 255) = 93 [pid 169306] open("/proc/169310/status", O_RDONLY) = 15 [pid 169306] fstat(15, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 [pid 169306] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2ba79a536000 [pid 169306] read(15, "Name:\tsleep\nUmask:\t0022\nState:\tS"..., 1024) = 1024 [pid 169306] close(15) = 0 [pid 169306] munmap(0x2ba79a536000, 4096) = 0 [pid 169306] close(9) = 0 [pid 169306] open("/sys/devices/system/cpu/cpu12/cpufreq/cpuinfo_cur_freq", O_RDONLY) = 9 [pid 169306] fcntl(9, F_SETFD, FD_CLOEXEC) = 0 [pid 169306] read(9, "3199951", 7) = 7 [pid 169306] close(9) = 0 [pid 169306] futex(0x2ba79a9e93fc, FUTEX_WAIT_PRIVATE, 10363, NULL <unfinished ...> [pid 169307] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 169307] futex(0x2ba79a9e7160, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 169307] futex(0x2ba79a9e7124, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 310803, {1554909123, 234054000}, ffffffff) = -1 ETIMEDOUT (Connection timed out) [pid 169307] futex(0x2ba79a9e7160, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 169307] futex(0x2ba79a9e7124, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 310805, {1554909124, 234054000}, ffffffff) = -1 ETIMEDOUT (Connection timed out) (gdb) thread apply all bt full Thread 5 (Thread 0x2ba79a535700 (LWP 169306)): #0 0x00002ba79b676965 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0 No symbol table info available. #1 0x00002ba79a6f1303 in _watch_tasks (arg=<optimized out>) at slurm_jobacct_gather.c:373 err = <optimized out> __func__ = "_watch_tasks" #2 0x00002ba79b672dd5 in start_thread () from /usr/lib64/libpthread.so.0 No symbol table info available. #3 0x00002ba79b984ead in clone () from /usr/lib64/libc.so.6 No symbol table info available. Thread 4 (Thread 0x2ba7a1458700 (LWP 169307)): #0 0x00002ba79b676d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0 No symbol table info available. #1 0x00002ba79a6eb6b0 in _timer_thread (args=<optimized out>) at slurm_acct_gather_profile.c:205 err = <optimized out> i = 4 now = 1554909165 diff = <optimized out> tvnow = {tv_sec = 1554753721, tv_usec = 234054} abs = {tv_sec = 1554909166, tv_nsec = 234054000} __func__ = "_timer_thread" #2 0x00002ba79b672dd5 in start_thread () from /usr/lib64/libpthread.so.0 No symbol table info available. #3 0x00002ba79b984ead in clone () from /usr/lib64/libc.so.6 No symbol table info available. Thread 3 (Thread 0x2ba7a1559700 (LWP 169308)): #0 0x00002ba79b97a20d in poll () from /usr/lib64/libc.so.6 No symbol table info available. #1 0x00002ba79a784d77 in _poll_internal (shutdown_time=<optimized out>, nfds=2, pfds=0x2ba7a40008d0) at eio.c:364 n = <optimized out> timeout = -1 #2 eio_handle_mainloop (eio=0x1760ad0) at eio.c:328 retval = 0 pollfds = 0x2ba7a40008d0 map = 0x2ba7a4000900 maxnfds = 1 nfds = 2 n = <optimized out> shutdown_time = <optimized out> __func__ = "eio_handle_mainloop" #3 0x000000000041d98a in _msg_thr_internal (job_arg=0x1759730) at req.c:273 job = 0x1759730 #4 0x00002ba79b672dd5 in start_thread () from /usr/lib64/libpthread.so.0 No symbol table info available. #5 0x00002ba79b984ead in clone () from /usr/lib64/libc.so.6 No symbol table info available. Thread 2 (Thread 0x2ba7a248c700 (LWP 169309)): #0 0x00002ba79b97a20d in poll () from /usr/lib64/libc.so.6 No symbol table info available. #1 0x00002ba7a1b74942 in _oom_event_monitor (x=<optimized out>) at task_cgroup_memory.c:478 args = 0x175e610 ret = <optimized out> res = 0 fds = {{fd = 12, events = 3, revents = 0}, {fd = 13, events = 8193, revents = 0}} __func__ = "_oom_event_monitor" #2 0x00002ba79b672dd5 in start_thread () from /usr/lib64/libpthread.so.0 No symbol table info available. #3 0x00002ba79b984ead in clone () from /usr/lib64/libc.so.6 No symbol table info available. Thread 1 (Thread 0x2ba79a433040 (LWP 169305)): #0 0x00002ba79b94baea in wait4 () from /usr/lib64/libc.so.6 No symbol table info available. #1 0x000000000041084f in _spawn_job_container (job=0x1759730) at mgr.c:1183 jobacct_id = {taskid = 0, nodeid = 0, job = 0x1759730} status = 0 pid = 169310 jobacct = 0x0 rusage = {ru_utime = {tv_sec = 0, tv_usec = 0}, ru_stime = {tv_sec = 0, tv_usec = 0}, {ru_maxrss = 0, __ru_maxrss_word = 0}, {ru_ixrss = 0, __ru_ixrss_word = 0}, {ru_idrss = 0, __ru_idrss_word = 0}, {ru_isrss = 0, __ru_isrss_word = 0}, {ru_minflt = 0, __ru_minflt_word = 0}, { ru_majflt = 0, __ru_majflt_word = 0}, {ru_nswap = 0, __ru_nswap_word = 0}, {ru_inblock = 0, __ru_inblock_word = 0}, {ru_oublock = 0, __ru_oublock_word = 0}, {ru_msgsnd = 0, __ru_msgsnd_word = 0}, {ru_msgrcv = 0, __ru_msgrcv_word = 0}, {ru_nsignals = 0, __ru_nsignals_word = 0}, {ru_nvcsw = 0, __ru_nvcsw_word = 0}, {ru_nivcsw = 0, __ru_nivcsw_word = 0}} rc = 0 x11_pipe = {0, 0} #2 job_manager (job=job@entry=0x1759730) at mgr.c:1292 rc = 0 io_initialized = false ckpt_type = 0x17556b0 "checkpoint/none" err_msg = 0x0 __func__ = "job_manager" #3 0x000000000040d452 in main (argc=1, argv=0x7ffce73c4188) at slurmstepd.c:174 cli = 0x175b3a0 self = 0x175b3d0 msg = 0x1759a60 rc = 0 launch_params = 0x0 __func__ = "main" (gdb)
(In reply to John Hanks from comment #92) > That node got harvested by my node cleaner, but there are others. Here's one: > > [root@sgisummit-frcf-111-10 ~]# ls -R /sys/fs/cgroup/freezer/slurm/uid_324759 > /sys/fs/cgroup/freezer/slurm/uid_324759: > cgroup.clone_children freezer.parent_freezing job_8611335 > cgroup.event_control freezer.self_freezing notify_on_release > cgroup.procs freezer.state tasks > > /sys/fs/cgroup/freezer/slurm/uid_324759/job_8611335: > cgroup.clone_children freezer.parent_freezing notify_on_release > cgroup.event_control freezer.self_freezing step_batch > cgroup.procs freezer.state tasks > > /sys/fs/cgroup/freezer/slurm/uid_324759/job_8611335/step_batch: > cgroup.clone_children freezer.parent_freezing notify_on_release > cgroup.event_control freezer.self_freezing tasks > cgroup.procs freezer.state > > There is no step_batch in the cgroups for this one, but other cgroup.procs > appear to be empty. FWIW, the other job on this node which otherwise still > appears to be normal also has empty cgroup.procs files. The issue is there's no a /sys/fs/cgroup/freezer/*/step_extern/cgroup.procs... Note that you're pasting me the pid 169307's backtrace which doesn't seem to correspond to the ls -R you did: [pid 169306] stat("/sys/fs/cgroup/freezer/slurm/uid_281293/job_8593185/step_extern/cgroup.procs", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 The backtrace shows the pid trying to read the step_extern cgroup.procs, which doesn't exist, which means the race is very similar to what I already found. The step_extern directory should be there if you have PrologFlags=Contain (or X11).
Sorry, bad tracking of jobids and pids on my part, too many open terminals. Let me try that again. The job that caused the problem is gone now (as far as processes), it was job 8611335 (scg)[I am root@frankie ~]# scontrol show jobid=8611335 slurm_load_jobs error: Invalid job id specified (scg)[I am root@frankie ~]# sacct -j 8611335 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- (scg)[I am root@frankie ~]# It did leave behind: [root@sgisummit-frcf-111-10 job_8611335]# pwd /sys/fs/cgroup/freezer/slurm/uid_324759/job_8611335 [root@sgisummit-frcf-111-10 job_8611335]# tree . ├── cgroup.clone_children ├── cgroup.event_control ├── cgroup.procs ├── freezer.parent_freezing ├── freezer.self_freezing ├── freezer.state ├── notify_on_release ├── step_batch │ ├── cgroup.clone_children │ ├── cgroup.event_control │ ├── cgroup.procs │ ├── freezer.parent_freezing │ ├── freezer.self_freezing │ ├── freezer.state │ ├── notify_on_release │ └── tasks └── tasks [root@sgisummit-frcf-111-10 job_8611335]# cat cgroup.procs [root@sgisummit-frcf-111-10 job_8611335]# cat step_batch/cgroup.procs [root@sgisummit-frcf-111-10 job_8611335]# The log messages I have for it are: (scg)[I am root@frankie ~]# grep 8611335 /srv/logs/messages-20190409 Apr 9 18:30:12 frankie slurmctld[171811]: _slurm_rpc_submit_batch_job: JobId=8611335 InitPrio=270 usec=3141 Apr 9 19:05:41 frankie slurmctld[171811]: backfill: Started JobId=8611335 in batch on sgisummit-frcf-111-10 Apr 9 19:05:43 sgisummit-frcf-111-10 slurmd[25808]: task_p_slurmd_batch_request: 8611335 Apr 9 19:05:43 sgisummit-frcf-111-10 slurmd[25808]: task/affinity: job 8611335 CPU input mask for node: 0x002000 Apr 9 19:05:43 sgisummit-frcf-111-10 slurmd[25808]: task/affinity: job 8611335 CPU final HW mask for node: 0x002000 Apr 9 19:05:43 sgisummit-frcf-111-10 slurmd[25808]: _run_prolog: prolog with lock for job 8611335 ran for 0 seconds Apr 9 19:05:43 sgisummit-frcf-111-10 slurmstepd[71324]: task/cgroup: /slurm/uid_324759/job_8611335: alloc=10240MB mem.limit=10240MB memsw.limit=15360MB Apr 9 19:05:43 sgisummit-frcf-111-10 slurmstepd[71324]: task/cgroup: /slurm/uid_324759/job_8611335/step_extern: alloc=10240MB mem.limit=10240MB memsw.limit=15360MB Apr 9 19:05:43 sgisummit-frcf-111-10 slurmd[25808]: Launching batch job 8611335 for UID 324759 Apr 9 19:05:43 sgisummit-frcf-111-10 slurmstepd[71331]: task/cgroup: /slurm/uid_324759/job_8611335: alloc=10240MB mem.limit=10240MB memsw.limit=15360MB Apr 9 19:05:43 sgisummit-frcf-111-10 slurmstepd[71331]: task/cgroup: /slurm/uid_324759/job_8611335/step_batch: alloc=10240MB mem.limit=10240MB memsw.limit=15360MB Apr 9 20:05:42 frankie slurmctld[171811]: Time limit exhausted for JobId=8611335 Apr 9 20:05:42 sgisummit-frcf-111-10 slurmstepd[71331]: error: *** JOB 8611335 ON sgisummit-frcf-111-10 CANCELLED AT 2019-04-09T20:05:42 DUE TO TIME LIMIT *** Apr 9 20:08:50 frankie slurmctld[171811]: Resending TERMINATE_JOB request JobId=8611335 Nodelist=sgisummit-frcf-111-10 Apr 9 20:09:43 sgisummit-frcf-111-10 slurmstepd[71331]: error: *** JOB 8611335 STEPD TERMINATED ON sgisummit-frcf-111-10 AT 2019-04-09T20:09:42 DUE TO JOB NOT ENDING WITH SIGNALS *** Apr 9 20:09:43 frankie slurmctld[171811]: error: slurmd error running JobId=8611335 on node(s)=sgisummit-frcf-111-10: Kill task failed Apr 9 20:09:52 frankie slurmctld[171811]: cleanup_completing: JobId=8611335 completion process took 250 seconds jbh
Quick note about the scope of this, on April 9 we had 50 jobs die this way, none of them have records in the accounting database.
Found one still in progress. Job has reached the end of it's time limit: (scg)[I am root@frankie ~]# squeue -j 8610041 JOBID PARTITION NAME USER ACCOUNT ST TIME_LEFT TIME_LIMIT NODES CPUS MIN_MEMORY TRES_PER NODELIST(REASON) 8610041 batch vcf2fasta.sh kwalter jandr CG INVALID 10:00 1 1 4G N/A sgisummit-frcf-111-30 On the node we have: [root@sgisummit-frcf-111-30 job_8610041]# pwd /sys/fs/cgroup/freezer/slurm/uid_324759/job_8610041 [root@sgisummit-frcf-111-30 job_8610041]# cat cgroup.procs 17138 [root@sgisummit-frcf-111-30 job_8610041]# cat step_batch/cgroup.procs 18556 There are remnants of many of this users jobs in /sys/fs/cgroup: [root@sgisummit-frcf-111-30 uid_324759]# pwd /sys/fs/cgroup/freezer/slurm/uid_324759 [root@sgisummit-frcf-111-30 uid_324759]# ls cgroup.clone_children freezer.self_freezing job_8610086 job_8610112 tasks cgroup.event_control freezer.state job_8610109 job_8610177 cgroup.procs job_8610041 job_8610110 job_8610178 freezer.parent_freezing job_8610064 job_8610111 notify_on_release Looking at a sample of those, no job record: (scg)[I am root@frankie ~]# sacct -j 8610041,8610064,8610086,8610177 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- (scg)[I am root@frankie ~]# Process tree looks like: ├─slurmd(25689)─┬─{slurmd}(22892) │ ├─{slurmd}(234651) │ ├─{slurmd}(96458) │ ├─{slurmd}(218953) │ ├─{slurmd}(89826) │ ├─{slurmd}(53393) │ ├─{slurmd}(846) │ └─{slurmd}(182218) ├─slurmstepd(17138)─┬─slurm_script(18556,kwalter)───sh(21597) │ ├─{slurmstepd}(17139) │ ├─{slurmstepd}(17140) │ ├─{slurmstepd}(17141) │ ├─{slurmstepd}(17142) │ └─{slurmstepd}(24210) and attaching to the 25689 results in a non-responsive gdb. strace shows: [root@sgisummit-frcf-111-30 ~]# strace -ff -p 25689 strace: Process 25689 attached with 9 threads [pid 182218] read(24, <unfinished ...> [pid 846] read(22, <unfinished ...> [pid 53393] read(20, <unfinished ...> [pid 89826] read(18, <unfinished ...> [pid 218953] read(16, <unfinished ...> [pid 96458] read(14, <unfinished ...> [pid 234651] read(11, <unfinished ...> [pid 22892] read(7, <unfinished ...> [pid 25689] accept(5, Those all seem to be reading from a socket: [root@sgisummit-frcf-111-30 ~]# ls -l /proc/182218/fd/24 lrwx------ 1 root root 64 Apr 10 09:15 /proc/182218/fd/24 -> socket:[7847063] [root@sgisummit-frcf-111-30 ~]# ls -l /proc/846/fd/22 lrwx------ 1 root root 64 Apr 10 09:15 /proc/846/fd/22 -> socket:[7781149] [root@sgisummit-frcf-111-30 ~]# ls -l /proc/53393/fd/20 lrwx------ 1 root root 64 Apr 10 09:15 /proc/53393/fd/20 -> socket:[7819303] [root@sgisummit-frcf-111-30 ~]# ls -l /proc/89826/fd/18 lrwx------ 1 root root 64 Apr 10 09:16 /proc/89826/fd/18 -> socket:[7800699] which is connected to the slurmd? [root@sgisummit-frcf-111-30 ~]# netstat -anp | grep 7847063 unix 3 [ ] STREAM CONNECTED 7847063 25689/slurmd [root@sgisummit-frcf-111-30 ~]# netstat -anp | grep 7781149 unix 3 [ ] STREAM CONNECTED 7781149 25689/slurmd Note, if I leave strace on there long enough I can see nhc run (health check) but then it goes back to the above state. I managed to kill slurmd (kill -11 PID) and now it hangs on startup: [root@sgisummit-frcf-111-30 ~]# slurmd -D -v -v -v -v -v -v -v -v slurmd: debug: Log file re-opened slurmd: debug4: found jobid = 8610041, stepid = 4294967294 slurmd: debug3: Trying to load plugin /scg/slurm/installed/slurm-18.08.6/lib/slurm/auth_munge.so slurmd: debug: Munge authentication plugin loaded slurmd: debug3: Success. Starting slurmd under gdb and letting it run for 30 seconds produces this bt: (gdb) thread apply all bt full Thread 1 (Thread 0x2aaaaaadb640 (LWP 148207)): #0 0x00002aaaac3566e0 in __read_nocancel () from /usr/lib64/libpthread.so.0 No symbol table info available. #1 0x00002aaaab244b36 in stepd_connect (directory=<optimized out>, nodename=<optimized out>, jobid=8610041, stepid=4294967294, protocol_version=protocol_version@entry=0x66fe68) at stepd_api.c:282 remaining = <optimized out> ptr = 0x7fffffffd908 "" rc = <optimized out> req = 0 fd = <optimized out> rc = 0 auth_cred = <optimized out> auth_info = 0x0 local_nodename = 0x0 buffer = 0x65ccf0 len = 16384 __func__ = "stepd_connect" #2 0x000000000040b9b5 in _update_logging () at slurmd.c:2025 fd = <optimized out> steps = 0x638ec0 i = 0x66b940 stepd = 0x66fe50 o = <optimized out> cf = <optimized out> #3 0x000000000040d130 in _read_config () at slurmd.c:921 path_pubkey = 0x0 cf = 0x633f40 <slurmctld_conf> cc = <optimized out> cr_flag = true gang_flag = false tok = <optimized out> save_ptr = 0x0 over_memory_kill = false __func__ = "_read_config" #4 0x000000000040eb57 in _slurmd_init () at slurmd.c:1543 rlim = {rlim_cur = 18374686479671623680, rlim_max = 71776119061217280} stat_buf = {st_dev = 7596848497658783603, st_ino = 3543601059117542772, st_nlink = 18446462805749870897, st_mode = 4294967295, st_uid = 4294967295, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 0, st_blksize = 0, st_blocks = 0, st_atim = {tv_sec = 0, tv_nsec = 0}, st_mtim = {tv_sec = 0, tv_nsec = 0}, st_ctim = {tv_sec = 0, tv_nsec = 0}, __unused = {0, 0, 0}} cpu_cnt = <optimized out> #5 main (argc=10, argv=0x7fffffffdcd8) at slurmd.c:261 i = <optimized out> pidfd = <optimized out> blocked_signals = {13, 0} cc = <optimized out> oom_value = <optimized out> slurmd_uid = 0 curr_uid = 0 time_stamp = '\000' <repeats 128 times>, "\002", '\000' <repeats 111 times>... __func__ = "main" The hung job is still there: [root@sgisummit-frcf-111-30 ~]# strace -ff -p 17138 strace: Process 17138 attached with 6 threads [pid 17142] restart_syscall(<... resuming interrupted poll ...> <unfinished ...> [pid 17141] futex(0x2b515b002520, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> [pid 17140] futex(0x2b515b002520, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> [pid 17139] futex(0x2b515b0073fc, FUTEX_WAIT_PRIVATE, 61, NULL <unfinished ...> [pid 17138] wait4(-1, But I can't interrupt to get a bt or convince it to dump core. jbh
slurmd *did* dump a core file at some point, I think maybe when I originally killed it with -11 but possibly durin gone of the startup attempts after that (timing is pretty close). It shows: (gdb) thread apply all bt full Thread 9 (Thread 0x2b93a0b3c700 (LWP 182218)): #0 0x00002b9395b326fd in read () from /usr/lib64/libpthread.so.0 No symbol table info available. #1 0x00002b9394a20b36 in stepd_connect (directory=<optimized out>, nodename=<optimized out>, jobid=8610041, stepid=4294967294, protocol_version=protocol_version@entry=0x2b93fc01a638) at stepd_api.c:282 remaining = <optimized out> ptr = 0x2b93a0b3b958 "" rc = <optimized out> req = 0 fd = <optimized out> rc = 0 auth_cred = <optimized out> auth_info = 0x0 local_nodename = 0x0 buffer = 0x2b93fc001450 len = 16384 __func__ = "stepd_connect" #2 0x000000000040df81 in _fill_registration_msg (msg=0x2b93fc000cb0) at slurmd.c:770 fd = <optimized out> steps = 0x14fc370 stepd = 0x2b93fc01a620 buf = {sysname = "Linux", '\000' <repeats 59 times>, nodename = "sgisummit-frcf-111-30.scg.stanford.edu", '\000' <repeats 26 times>, release = "3.10.0-957.10.1.el7.x86_64", '\000' <repeats 38 times>, version = "#1 SMP Mon Mar 18 15:06:45 UTC 2019", '\000' <repeats 29 times>, machine = "x86_64", '\000' <repeats 58 times>, domainname = "(none)", '\000' <repeats 58 times>} n = 0 os = <optimized out> slurmd_start_time = 1554152209 i = 0x152d700 arch = <optimized out> gres_info = <optimized out> first_msg = false #3 send_registration_msg (status=status@entry=0, startup=startup@entry=true) at slurmd.c:629 ret_val = 0 msg = 0x2b93fc000cb0 __func__ = "send_registration_msg" #4 0x000000000041a6dc in slurmd_req (msg=msg@entry=0x2b93fc0008d0) at req.c:388 rc = 0 __func__ = "slurmd_req" #5 0x000000000040c973 in _service_connection (arg=<optimized out>) at slurmd.c:549 con = 0x151a8c0 msg = 0x2b93fc0008d0 __func__ = "_service_connection" rc = 0 #6 0x00002b9395b2bdd5 in start_thread () from /usr/lib64/libpthread.so.0 No symbol table info available. #7 0x00002b9395e3dead in clone () from /usr/lib64/libc.so.6 No symbol table info available. Thread 8 (Thread 0x2b93a2438700 (LWP 846)): #0 0x00002b9395b326fd in read () from /usr/lib64/libpthread.so.0 No symbol table info available. #1 0x00002b9394a20b36 in stepd_connect (directory=<optimized out>, nodename=<optimized out>, jobid=8610041, stepid=4294967294, protocol_version=protocol_version@entry=0x2b9400004be8) at stepd_api.c:282 remaining = <optimized out> ptr = 0x2b93a2437958 "" rc = <optimized out> req = 0 fd = <optimized out> rc = 0 auth_cred = <optimized out> auth_info = 0x0 local_nodename = 0x0 buffer = 0x2b9400017d20 len = 16384 __func__ = "stepd_connect" #2 0x000000000040df81 in _fill_registration_msg (msg=0x2b9400004d50) at slurmd.c:770 fd = <optimized out> steps = 0x14fcb40 stepd = 0x2b9400004bd0 buf = {sysname = "Linux", '\000' <repeats 59 times>, nodename = "sgisummit-frcf-111-30.scg.stanford.edu", '\000' <repeats 26 times>, release = "3.10.0-957.10.1.el7.x86_64", '\000' <repeats 38 times>, version = "#1 SMP Mon Mar 18 15:06:45 UTC 2019", '\000' <repeats 29 times>, machine = "x86_64", '\000' <repeats 58 times>, domainname = "(none)", '\000' <repeats 58 times>} n = 0 os = <optimized out> slurmd_start_time = 1554152209 i = 0x152d6e0 arch = <optimized out> gres_info = <optimized out> first_msg = false #3 send_registration_msg (status=status@entry=0, startup=startup@entry=true) at slurmd.c:629 ret_val = 0 msg = 0x2b9400004d50 __func__ = "send_registration_msg" #4 0x000000000041a6dc in slurmd_req (msg=msg@entry=0x2b9400000970) at req.c:388 rc = 0 __func__ = "slurmd_req" #5 0x000000000040c973 in _service_connection (arg=<optimized out>) at slurmd.c:549 con = 0x1564520 msg = 0x2b9400000970 __func__ = "_service_connection" ---Type <return> to continue, or q <return> to quit--- rc = 0 #6 0x00002b9395b2bdd5 in start_thread () from /usr/lib64/libpthread.so.0 No symbol table info available. #7 0x00002b9395e3dead in clone () from /usr/lib64/libc.so.6 No symbol table info available. Thread 7 (Thread 0x2b93a2d41700 (LWP 89826)): #0 0x00002b9395b326fd in read () from /usr/lib64/libpthread.so.0 No symbol table info available. #1 0x00002b9394a20b36 in stepd_connect (directory=<optimized out>, nodename=<optimized out>, jobid=8610041, stepid=4294967294, protocol_version=protocol_version@entry=0x2b939c01cd98) at stepd_api.c:282 remaining = <optimized out> ptr = 0x2b93a2d40958 "" rc = <optimized out> req = 0 fd = <optimized out> rc = 0 auth_cred = <optimized out> auth_info = 0x0 local_nodename = 0x0 buffer = 0x2b939c008400 len = 16384 __func__ = "stepd_connect" #2 0x000000000040df81 in _fill_registration_msg (msg=0x2b939c01f040) at slurmd.c:770 fd = <optimized out> steps = 0x14fcfa0 stepd = 0x2b939c01cd80 buf = {sysname = "Linux", '\000' <repeats 59 times>, nodename = "sgisummit-frcf-111-30.scg.stanford.edu", '\000' <repeats 26 times>, release = "3.10.0-957.10.1.el7.x86_64", '\000' <repeats 38 times>, version = "#1 SMP Mon Mar 18 15:06:45 UTC 2019", '\000' <repeats 29 times>, machine = "x86_64", '\000' <repeats 58 times>, domainname = "(none)", '\000' <repeats 58 times>} n = 0 os = <optimized out> slurmd_start_time = 1554152209 i = 0x152d620 arch = <optimized out> gres_info = <optimized out> first_msg = false #3 send_registration_msg (status=status@entry=0, startup=startup@entry=true) at slurmd.c:629 ret_val = 0 msg = 0x2b939c01f040 __func__ = "send_registration_msg" #4 0x000000000041a6dc in slurmd_req (msg=msg@entry=0x2b939c01df00) at req.c:388 rc = 0 __func__ = "slurmd_req" #5 0x000000000040c973 in _service_connection (arg=<optimized out>) at slurmd.c:549 con = 0x150d5c0 msg = 0x2b939c01df00 __func__ = "_service_connection" rc = 0 #6 0x00002b9395b2bdd5 in start_thread () from /usr/lib64/libpthread.so.0 No symbol table info available. #7 0x00002b9395e3dead in clone () from /usr/lib64/libc.so.6 No symbol table info available. Thread 6 (Thread 0x2b93a1f33700 (LWP 234651)): #0 0x00002b9395b326fd in read () from /usr/lib64/libpthread.so.0 No symbol table info available. #1 0x00002b9394a20b36 in stepd_connect (directory=<optimized out>, nodename=<optimized out>, jobid=8610041, stepid=4294967294, protocol_version=protocol_version@entry=0x2b93e0019f08) at stepd_api.c:282 remaining = <optimized out> ptr = 0x2b93a1f32958 "" rc = <optimized out> req = 0 fd = <optimized out> rc = 0 auth_cred = <optimized out> auth_info = 0x0 local_nodename = 0x0 buffer = 0x2b93e0003ce0 len = 16384 __func__ = "stepd_connect" #2 0x000000000040df81 in _fill_registration_msg (msg=0x2b93e0000a80) at slurmd.c:770 fd = <optimized out> steps = 0x14fc140 stepd = 0x2b93e0019ef0 buf = {sysname = "Linux", '\000' <repeats 59 times>, nodename = "sgisummit-frcf-111-30.scg.stanford.edu", '\000' <repeats 26 times>, release = "3.10.0-957.10.1.el7.x86_64", '\000' <repeats 38 times>, version = "#1 SMP Mon Mar 18 15:06:45 UTC 2019", '\000' <repeats 29 times>, machine = "x86_64", '\000' <repeats 58 times>, domainname = "(none)", '\000' <repeats 58 times>} n = 1 os = <optimized out> slurmd_start_time = 1554152209 i = 0x152d640 arch = <optimized out> gres_info = <optimized out> first_msg = false #3 send_registration_msg (status=status@entry=0, startup=startup@entry=true) at slurmd.c:629 ret_val = 0 msg = 0x2b93e0000a80 __func__ = "send_registration_msg" #4 0x000000000041a6dc in slurmd_req (msg=msg@entry=0x2b93e0000970) at req.c:388 rc = 0 ---Type <return> to continue, or q <return> to quit--- __func__ = "slurmd_req" #5 0x000000000040c973 in _service_connection (arg=<optimized out>) at slurmd.c:549 con = 0x1561720 msg = 0x2b93e0000970 __func__ = "_service_connection" rc = 0 #6 0x00002b9395b2bdd5 in start_thread () from /usr/lib64/libpthread.so.0 No symbol table info available. #7 0x00002b9395e3dead in clone () from /usr/lib64/libc.so.6 No symbol table info available. Thread 5 (Thread 0x2b93a3549700 (LWP 22892)): #0 0x00002b9395b326fd in read () from /usr/lib64/libpthread.so.0 No symbol table info available. #1 0x00002b9394a20b36 in stepd_connect (directory=<optimized out>, nodename=<optimized out>, jobid=jobid@entry=8610041, stepid=4294967294, protocol_version=protocol_version@entry=0x2b93f001bc28) at stepd_api.c:282 remaining = <optimized out> ptr = 0x2b93a3548a58 "" rc = <optimized out> req = 0 fd = <optimized out> rc = 0 auth_cred = <optimized out> auth_info = 0x0 local_nodename = 0x0 buffer = 0x2b93f0001420 len = 16384 __func__ = "stepd_connect" #2 0x0000000000416d12 in _job_still_running (job_id=8610041) at req.c:4661 fd = <optimized out> retval = false steps = 0x14fc870 i = 0x152d660 s = 0x2b93f001bc10 #3 _pause_for_job_completion (job_id=8610041, nodes=<optimized out>, max_time=max_time@entry=0) at req.c:5752 sec = 241 pause = 10 rc = <optimized out> count = 4 #4 0x0000000000418bff in _rpc_terminate_job (msg=msg@entry=0x2b93f0007280) at req.c:5554 have_spank = <optimized out> rc = 0 req = 0x2b93f0005c20 uid = 383 nsteps = <optimized out> delay = <optimized out> job_env = {jobid = 1, step_id = 0, node_list = 0x0, partition = 0x17f <Address 0x17f out of bounds>, resv_id = 0x416c5d <_kill_all_active_steps+213> "E\205\344u&E\205\377\270u\220B", spank_job_env = 0xfffe0000, spank_job_env_size = 0, uid = 15, user_name = 0xf00000000 <Address 0xf00000000 out of bounds>} __func__ = "_rpc_terminate_job" #5 0x0000000000418f3c in _rpc_timelimit (msg=msg@entry=0x2b93f0007280) at req.c:3914 uid = 383 req = 0x2b93f0005c20 nsteps = <optimized out> rc = <optimized out> #6 0x00000000004196ec in slurmd_req (msg=msg@entry=0x2b93f0007280) at req.c:337 rc = <optimized out> __func__ = "slurmd_req" #7 0x000000000040c973 in _service_connection (arg=<optimized out>) at slurmd.c:549 con = 0x1547ff0 msg = 0x2b93f0007280 __func__ = "_service_connection" rc = 0 #8 0x00002b9395b2bdd5 in start_thread () from /usr/lib64/libpthread.so.0 No symbol table info available. #9 0x00002b9395e3dead in clone () from /usr/lib64/libc.so.6 No symbol table info available. Thread 4 (Thread 0x2b93a2e42700 (LWP 53393)): #0 0x00002b9395b326fd in read () from /usr/lib64/libpthread.so.0 No symbol table info available. #1 0x00002b9394a20b36 in stepd_connect (directory=<optimized out>, nodename=<optimized out>, jobid=8610041, stepid=4294967294, protocol_version=protocol_version@entry=0x2b93bc01a288) at stepd_api.c:282 remaining = <optimized out> ptr = 0x2b93a2e41958 "" rc = <optimized out> req = 0 fd = <optimized out> rc = 0 auth_cred = <optimized out> auth_info = 0x0 local_nodename = 0x0 buffer = 0x2b93bc001fa0 len = 16384 __func__ = "stepd_connect" #2 0x000000000040df81 in _fill_registration_msg (msg=0x2b93bc01b0e0) at slurmd.c:770 fd = <optimized out> steps = 0x14fc230 stepd = 0x2b93bc01a270 buf = {sysname = "Linux", '\000' <repeats 59 times>, nodename = "sgisummit-frcf-111-30.scg.stanford.edu", '\000' <repeats 26 times>, release = "3.10.0-957.10.1.el7.x86_64", '\000' <repeats 38 times>, version = "#1 SMP Mon Mar 18 15:06:45 UTC 2019", '\000' <repeats 29 times>, ---Type <return> to continue, or q <return> to quit--- machine = "x86_64", '\000' <repeats 58 times>, domainname = "(none)", '\000' <repeats 58 times>} n = 0 os = <optimized out> slurmd_start_time = 1554152209 i = 0x152d6c0 arch = <optimized out> gres_info = <optimized out> first_msg = false #3 send_registration_msg (status=status@entry=0, startup=startup@entry=true) at slurmd.c:629 ret_val = 0 msg = 0x2b93bc01b0e0 __func__ = "send_registration_msg" #4 0x000000000041a6dc in slurmd_req (msg=msg@entry=0x2b93bc01af50) at req.c:388 rc = 0 __func__ = "slurmd_req" #5 0x000000000040c973 in _service_connection (arg=<optimized out>) at slurmd.c:549 con = 0x154bdf0 msg = 0x2b93bc01af50 __func__ = "_service_connection" rc = 0 #6 0x00002b9395b2bdd5 in start_thread () from /usr/lib64/libpthread.so.0 No symbol table info available. #7 0x00002b9395e3dead in clone () from /usr/lib64/libc.so.6 No symbol table info available. Thread 3 (Thread 0x2b93a2034700 (LWP 218953)): #0 0x00002b9395b326fd in read () from /usr/lib64/libpthread.so.0 No symbol table info available. #1 0x00002b9394a20b36 in stepd_connect (directory=<optimized out>, nodename=<optimized out>, jobid=8610041, stepid=4294967294, protocol_version=protocol_version@entry=0x2b93f8003528) at stepd_api.c:282 remaining = <optimized out> ptr = 0x2b93a2033958 "" rc = <optimized out> req = 0 fd = <optimized out> rc = 0 auth_cred = <optimized out> auth_info = 0x0 local_nodename = 0x0 buffer = 0x2b93f8016ef0 len = 16384 __func__ = "stepd_connect" #2 0x000000000040df81 in _fill_registration_msg (msg=0x2b93f80008d0) at slurmd.c:770 fd = <optimized out> steps = 0x14fc6e0 stepd = 0x2b93f8003510 buf = {sysname = "Linux", '\000' <repeats 59 times>, nodename = "sgisummit-frcf-111-30.scg.stanford.edu", '\000' <repeats 26 times>, release = "3.10.0-957.10.1.el7.x86_64", '\000' <repeats 38 times>, version = "#1 SMP Mon Mar 18 15:06:45 UTC 2019", '\000' <repeats 29 times>, machine = "x86_64", '\000' <repeats 58 times>, domainname = "(none)", '\000' <repeats 58 times>} n = 0 os = <optimized out> slurmd_start_time = 1554152209 i = 0x152d6a0 arch = <optimized out> gres_info = <optimized out> first_msg = false #3 send_registration_msg (status=status@entry=0, startup=startup@entry=true) at slurmd.c:629 ret_val = 0 msg = 0x2b93f80008d0 __func__ = "send_registration_msg" #4 0x000000000041a6dc in slurmd_req (msg=msg@entry=0x2b93f8000e90) at req.c:388 rc = 0 __func__ = "slurmd_req" #5 0x000000000040c973 in _service_connection (arg=<optimized out>) at slurmd.c:549 con = 0x151a5e0 msg = 0x2b93f8000e90 __func__ = "_service_connection" rc = 0 #6 0x00002b9395b2bdd5 in start_thread () from /usr/lib64/libpthread.so.0 No symbol table info available. #7 0x00002b9395e3dead in clone () from /usr/lib64/libc.so.6 No symbol table info available. Thread 2 (Thread 0x2b93a3347700 (LWP 96458)): #0 0x00002b9395b326fd in read () from /usr/lib64/libpthread.so.0 No symbol table info available. #1 0x00002b9394a20b36 in stepd_connect (directory=<optimized out>, nodename=<optimized out>, jobid=8610041, stepid=4294967294, protocol_version=protocol_version@entry=0x2b93ac0188e8) at stepd_api.c:282 remaining = <optimized out> ptr = 0x2b93a3346958 "" rc = <optimized out> req = 0 fd = <optimized out> rc = 0 auth_cred = <optimized out> auth_info = 0x0 local_nodename = 0x0 buffer = 0x2b93ac015e90 len = 16384 __func__ = "stepd_connect" #2 0x000000000040df81 in _fill_registration_msg (msg=0x2b93ac01bef0) at slurmd.c:770 ---Type <return> to continue, or q <return> to quit--- fd = <optimized out> steps = 0x14fc690 stepd = 0x2b93ac0188d0 buf = {sysname = "Linux", '\000' <repeats 59 times>, nodename = "sgisummit-frcf-111-30.scg.stanford.edu", '\000' <repeats 26 times>, release = "3.10.0-957.10.1.el7.x86_64", '\000' <repeats 38 times>, version = "#1 SMP Mon Mar 18 15:06:45 UTC 2019", '\000' <repeats 29 times>, machine = "x86_64", '\000' <repeats 58 times>, domainname = "(none)", '\000' <repeats 58 times>} n = 0 os = <optimized out> slurmd_start_time = 1554152209 i = 0x152d680 arch = <optimized out> gres_info = <optimized out> first_msg = false #3 send_registration_msg (status=status@entry=0, startup=startup@entry=true) at slurmd.c:629 ret_val = 0 msg = 0x2b93ac01bef0 __func__ = "send_registration_msg" #4 0x000000000041a6dc in slurmd_req (msg=msg@entry=0x2b93ac01c020) at req.c:388 rc = 0 __func__ = "slurmd_req" #5 0x000000000040c973 in _service_connection (arg=<optimized out>) at slurmd.c:549 con = 0x150a390 msg = 0x2b93ac01c020 __func__ = "_service_connection" rc = 0 #6 0x00002b9395b2bdd5 in start_thread () from /usr/lib64/libpthread.so.0 No symbol table info available. #7 0x00002b9395e3dead in clone () from /usr/lib64/libc.so.6 No symbol table info available. Thread 1 (Thread 0x2b93942b5600 (LWP 25689)): #0 0x00002b9395b2fd12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0 No symbol table info available. #1 0x000000000040ca6b in _wait_for_all_threads (secs=secs@entry=120) at slurmd.c:501 ts = {tv_sec = 1554913503, tv_nsec = 0} rc = <optimized out> __func__ = "_wait_for_all_threads" #2 0x000000000040fb16 in main (argc=1, argv=0x7ffe9b39f5f8) at slurmd.c:389 i = <optimized out> pidfd = 6 blocked_signals = {13, 0} cc = <optimized out> oom_value = <optimized out> slurmd_uid = <optimized out> curr_uid = <optimized out> time_stamp = "Mon, 01 Apr 2019 13:56:47 -0700", '\000' <repeats 129 times>, "\002", '\000' <repeats 79 times>... __func__ = "main" (gdb)
In all the fun I forgot to check for our original symptom for this bug, we have zombies: [root@sgisummit-frcf-111-30 slurmd]# ps aux | grep Z USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND kwalter 21597 0.0 0.0 0 0 ? Z Apr09 0:00 [sh] <defunct> kwalter 25592 0.0 0.0 0 0 ? Zl Apr09 0:01 [javac] <defunct> kwalter 25593 0.0 0.0 0 0 ? Zl Apr09 0:01 [javac] <defunct> kwalter 25595 0.0 0.0 0 0 ? Zl Apr09 0:01 [javac] <defunct> kwalter 25596 0.0 0.0 0 0 ? Zl Apr09 0:01 [javac] <defunct> kwalter 25597 0.0 0.0 0 0 ? Zl Apr09 0:01 [javac] <defunct> kwalter 25599 0.0 0.0 0 0 ? Zl Apr09 0:01 [javac] <defunct> kwalter 25600 0.0 0.0 0 0 ? Zl Apr09 0:01 [javac] <defunct> kwalter 25601 0.0 0.0 0 0 ? Zl Apr09 0:01 [javac] <defunct> jbh
Again I cannot see evidence that the subdirectory step_extern exists. In all your examples it should be there, but it isn't: Apr 9 19:05:43 sgisummit-frcf-111-10 slurmstepd[71324]: task/cgroup: /slurm/uid_324759/job_8611335/step_extern: alloc=10240MB mem.limit=10240MB memsw.limit=15360MB Thanks for the info, I'll dig more. I'm sorry to take so long with this issue, it is not trivial and there are many different situations (also in different customers) that are still confusing.
I've checked the remaining nodes with problem jobs and confirmed that there is no step_extern in /sys/fs/cgroup/slurm/.... for any of them. Will keep an eye out for any more changes in behavior of these.
John, After analyzing line by line your backtraces, and after the fact that you restarted slurmd and it was hung on most of the threads in stepd_connect() I wanted to check the code in slurmd.c and the calls to this function, but again without finding the cause. The stepd_connect() happens just in two places, in _update_logging() and in _fill_registration_msg(). Your backtraces just confirms that slurmd threads are stuck in these two functions, which would mean the problem is in slurmstepd since what is happening is that slurmd is connecting to the stepd socket. The call is such as: fd = stepd_connect(stepd->directory, stepd->nodename, stepd->jobid, stepd->stepid,&stepd->protocol_version); As already stated many times in this bug, you cannot even attach to slurmstepd: .......... Reading symbols from /scg/slurm/installed/slurm-18.08.6/sbin/slurmstepd...done. Attaching to program: /scg/slurm/current/sbin/slurmstepd, process 103421 ^C^C^C^C^C^C^C^C And strace is a dead end: [root@dper7425-srcf-d15-27 ~]# strace -ff -p 103421 strace: Process 103421 attached ........... Which I guess slurmd is also in the same situation, it is trying to connect to an unresponsive socket-side. stepd_connect() in the end just does a connect() syscall. At this point I think we need to forget about slurmd and other components, and focus in slurmstepd and why it is hung so hardly. We should be able to attach as we already commented before in this bug. What can do a program, non-attachable aside from I/O problems? I also think, along this time, you've seen some other issues and we have mixed some other situation, but the original one is the one presented in your last comments. One test I'd like you to do is to attach with gdb to a hung slurmstepd, and then try to send SIGCONT (and do a next in gdb) or if it doesn't respond at all, just SIGABRT or SIGKILL signals. We need to see what slurmstepd is doing. You can even try to attach to slurmstepd threads and extract a bt for each of them, until you found one non-attachable. If slurmstepd is not ending with any signal other than SIGKILL, then when slurmd is trying to sigterm it indeed won't terminate and generate errors in the log, this was seen: --- Aug 9 20:12:36 frankie slurmctld[121446]: Resending TERMINATE_JOB request JobId=3092791 Nodelist=sgiuv300-srcf-d10-01 Aug 9 20:13:16 sgiuv300-srcf-d10-01 kernel: [ 9512.041358] INFO: task slurmstepd:377005 blocked for more than 120 seconds. Aug 9 20:13:16 sgiuv300-srcf-d10-01 kernel: [ 9512.041361] slurmstepd D ffff96cac7dd0000 0 377005 1 0x00000080 Aug 9 20:14:12 sgiuv300-srcf-d10-01 slurmstepd[377480]: error: *** EXTERN STEP FOR 3092791 STEPD TERMINATED ON sgiuv300-srcf-d10-01 AT 2018-08-09T20:14:11 DUE TO JOB NOT ENDING WITH SIGNALS *** --- Your user/program/enviroment/fs is able to trigger this conditions in someway...
Hi John, Have you been able to do the commented test in my previous comment? I do really think that we should be able to attach to these processes if they are not in a D state waiting for IO. At least with strace. Otherwise I really don't know how to debug it and I am again inclined in some kind of system issue. Just to recall, is this error happening on systems with the same node image?
Hi Felip, We've identified an issue with our Isilon storage which causes access to individual files on an NFS mount to hang indefinitely while still working for all other files on the same NFS mount. I don't have a direct link from this to the Slurm problem yet and I haven't been able to catch a node with this problem recently to do any further testing. I am talking to Isilon about it this week, but again from their side it occurs rarely and I have no reliable way to reproduce so I expect it to take time to sort out the issue. Hopefully I will have something to report soon. Regardless of the outcome though I still think that Slurm dropping job accounting records for any reason is a Slurm bug that should be addressed. Best, jbh On Mon, Jun 3, 2019, 9:08 AM <bugs@schedmd.com> wrote: > *Comment # 102 <https://bugs.schedmd.com/show_bug.cgi?id=5549#c102> on bug > 5549 <https://bugs.schedmd.com/show_bug.cgi?id=5549> from Felip Moll > <felip.moll@schedmd.com> * > > Hi John, > > Have you been able to do the commented test in my previous comment? > I do really think that we should be able to attach to these processes if they > are not in a D state waiting for IO. At least with strace. > > Otherwise I really don't know how to debug it and I am again inclined in some > kind of system issue. > > Just to recall, is this error happening on systems with the same node image? > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
An issue with the storage like the one you're commenting would make a lot of sense on the behavior we're observating. Maybe the nfs option of 'soft' could help a bit with this, at the cost of possibly corrupting data. I will wait for more information on your side, specially if you are able to correlate the issues. I want to recall that this happened in multiple systems, not only in the UV300, so this reinforces the theory. > Regardless of the outcome > though I still think that Slurm dropping job accounting records for any > reason is a Slurm bug that should be addressed. If you're seeing this in 18.08.6, this is probably caused by a regression introduced in this specific version. The issue was introduced in commit ade9101e95925b01005b465c5451e243578c2fd8 in 18.08.6. Was fixed later in 18.08.7, commit 3361bf611c61de3bb90f8cadbacf58b4d1dc8707. See Bug 6697 The workaround is to specify a start time with '-S' even if you are specifying -j. Tell me if this is the case. If it is not, I'll ask for a direct query in the database to see if the jobs are really accounted.
The sacct problem is resolved, it works fine now. I missed that it was a different bug/problem. Will let you know how our storage investigations go.
(In reply to John Hanks from comment #105) > The sacct problem is resolved, it works fine now. I missed that it was a > different bug/problem. Great, did you upgrade? > Will let you know how our storage investigations go. Please, keep me updated.
(In reply to Felip Moll from comment #106) > (In reply to John Hanks from comment #105) > > The sacct problem is resolved, it works fine now. I missed that it was a > > different bug/problem. > > Great, did you upgrade? > Yes, updating the ticket to show 18.08.7.
Hi John, How is everything going? Did you fix your storage? Is the problem gone afterwards?
(In reply to Felip Moll from comment #108) > Hi John, > > How is everything going? Did you fix your storage? Is the problem gone > afterwards? Since changing my mount options for NFS from "hard,intr" to "soft,intr" we've had a few nodes enter "Kill task failed" state, but I haven't found any hanging files on them. We've also done some kernel updates along the way. There's not much reason to leave this open until I catch the problem happening again and can provide some debugging feedback, maybe we'll get lucky and the 'soft' change will be a sufficient workaround. I'll mark this resolved, cannotreproduce and will re-open if that status changes.
(In reply to John Hanks from comment #109) > (In reply to Felip Moll from comment #108) > > Hi John, > > > > How is everything going? Did you fix your storage? Is the problem gone > > afterwards? > > Since changing my mount options for NFS from "hard,intr" to "soft,intr" > we've had a few nodes enter "Kill task failed" state, but I haven't found > any hanging files on them. We've also done some kernel updates along the > way. > > There's not much reason to leave this open until I catch the problem > happening again and can provide some debugging feedback, maybe we'll get > lucky and the 'soft' change will be a sufficient workaround. I'll mark this > resolved, cannotreproduce and will re-open if that status changes. Understood, reopen as needed. If you still want me to check about this nodes with the kill task failed attach the logs here, but I think you definetively have some issue with this NFS. Note that intr / nointr option is deprecated and ignored after kernel 2.6.25. See 'man nfs'.
Just adding a note here, we had this occur again this week with `soft` in place and Slurm seems to be able to properly kill and clean up jobs that were hung on a specific file with the updated mount options. All signs point to an NFS issue with our Isilon OneFS storage which is mounted NFS4 vers=4.0. Just throwing a lot of keywords in here in case anyone else has the same problem. Still can not force this to reproduce and it is still difficult to debug due to the nature of the hanging jobs and the rarity of it occurring.
Thank you John for this valuable information. Please, don't hesitate in adding what you think is necessary here. Regards, Felip