Ticket 9160 - slurmstepd stuck in CG state
Summary: slurmstepd stuck in CG state
Status: RESOLVED TIMEDOUT
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmstepd (show other tickets)
Version: 20.02.3
Hardware: Linux Linux
: --- 2 - High Impact
Assignee: Marshall Garey
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-06-04 02:32 MDT by dl_support-hpc
Modified: 2020-07-01 12:03 MDT (History)
2 users (show)

See Also:
Site: WEHI
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurmd log on milton-med-001 (3.93 KB, text/plain)
2020-06-04 02:58 MDT, dl_support-hpc
Details
compressed slurmctld log (14.39 MB, application/x-gzip)
2020-06-04 17:10 MDT, dl_support-hpc
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description dl_support-hpc 2020-06-04 02:32:17 MDT
Hi Support,

Recently I upgraded our system to 20.02.3 from 20.02.1. We are running our compute nodes on Centos 7.7. 

It was observed that on job cancellation the step processes are not removed on compute node only if I kill manually the slurmstepd parent process

[root@milton-med-001 yum.repos.d]# systemctl status slurmd.service 
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2020-06-04 17:28:23 AEST; 10min ago
  Process: 6536 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
    Tasks: 11
   Memory: 15.5M
   CGroup: /system.slice/slurmd.service
           ├─ 8389 /usr/sbin/slurmd
           ├─26573 /usr/sbin/slurmd
           ├─26574 slurmstepd: [35656.extern]
           └─26579 sleep 100000000

Jun 04 17:28:22 milton-med-001.hpc.wehi.edu.au systemd[1]: Starting Slurm node daemon...
Jun 04 17:28:23 milton-med-001.hpc.wehi.edu.au systemd[1]: Can't open PID file /var/run/slurm/slurmd.pid (yet?) after start: No such file or directory
Jun 04 17:28:23 milton-med-001.hpc.wehi.edu.au systemd[1]: Started Slurm node daemon.
[root@milton-med-001 yum.repos.d]# kill -9 26573
[root@milton-med-001 yum.repos.d]# systemctl status slurmd.service 
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2020-06-04 17:28:23 AEST; 11min ago
  Process: 6536 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
    Tasks: 1
   Memory: 14.6M
   CGroup: /system.slice/slurmd.service
           └─8389 /usr/sbin/slurmd

and the job would be in CG state

-bash-4.2$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
             35656   regular rstudio- ghangas. CG       0:57      1 milton-med-001 

After manually killing the process the job is removed from the squeue output.

If I uninstall the rpm with the 20.03.3 on the compute node and install the 20.02.1 this problem is not manifested.

Is this similar to https://bugs.schedmd.com/show_bug.cgi?id=4733  ?

Thanks and Regards,
Laszlo
Comment 1 Dominik Bartkiewicz 2020-06-04 02:48:22 MDT
Hi

Could you send us slurmctld.log and slurmd.log from the affected node?

Dominik
Comment 2 dl_support-hpc 2020-06-04 02:58:54 MDT
Created attachment 14508 [details]
slurmd log on milton-med-001
Comment 3 dl_support-hpc 2020-06-04 03:00:26 MDT
Hi Dominik,

Attached is the compute slurmd log and the relevant slurmctld log for this job is below the file was too big to be uploaded.

[root@milton-ctld1 slurm]# grep 35656 slurmctld.log 
[2020-06-04T17:36:03.799] _slurm_rpc_submit_batch_job: JobId=35656 InitPrio=1 usec=435
[2020-06-04T17:36:04.364] backfill: Started JobId=35656 in regular on milton-med-001
[2020-06-04T17:36:04.364] prolog_running_decr: Configuration for JobId=35656 is complete
[2020-06-04T17:37:01.834] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=35656 uid 4015
[2020-06-04T17:38:21.869] Resending TERMINATE_JOB request JobId=35656 Nodelist=milton-med-001
[2020-06-04T17:39:35.894] cleanup_completing: JobId=35656 completion process took 154 seconds
[2020-06-04T17:39:36.156] error: _slurm_rpc_complete_batch_script: Could not find batch step for JobId=35656, this should never happen
[2020-06-04T17:39:36.156] slurmd error running JobId=35656 from node(s)=milton-med-001: Job credential revoked

Thanks and Regards,
Laszlo
Comment 4 Marshall Garey 2020-06-04 08:29:33 MDT
Can you please compress the slurmctld file and attach it?
Comment 5 dl_support-hpc 2020-06-04 17:10:11 MDT
Created attachment 14538 [details]
compressed slurmctld log
Comment 6 dl_support-hpc 2020-06-04 17:11:01 MDT
Hi Garey,

As requested.

Regards,
Laszlo
Comment 7 dl_support-hpc 2020-06-04 18:01:09 MDT
Hi Garey,

Also same behaviour on a different version os and slurm version, the controller and slurmdbd is on version 20.02.3 .

[root@milton-med-001 ~]# cat /etc/redhat-release 
CentOS Linux release 7.6.1810 (Core) 
[root@milton-med-001 ~]# rpm -qa | grep slurm
slurm-20.02.1-1.el7.centos.x86_64
slurm-libpmi-20.02.1-1.el7.centos.x86_64
slurm-pam_slurm-20.02.1-1.el7.centos.x86_64
slurm-perlapi-20.02.1-1.el7.centos.x86_64
slurm-slurmd-20.02.1-1.el7.centos.x86_64
[root@milton-med-001 ~]# squeue -u ghangas.r
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
             36090   regular rstudio- ghangas. CG       1:23      1 milton-med-001 
[root@milton-med-001 ~]# systemctl status slurmd.service 
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; disabled; vendor preset: disabled)
   Active: active (running) since Fri 2020-06-05 09:50:53 AEST; 6min ago
  Process: 36545 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 36576 (slurmd)
    Tasks: 11
   Memory: 9.8M
   CGroup: /system.slice/slurmd.service
           ├─36576 /usr/sbin/slurmd
           ├─39470 /usr/sbin/slurmd
           ├─39471 slurmstepd: [36090.extern]
           └─39477 sleep 100000000

Jun 05 09:50:52 milton-med-001.hpc.wehi.edu.au systemd[1]: Starting Slurm node daemon...
Jun 05 09:50:53 milton-med-001.hpc.wehi.edu.au systemd[1]: Started Slurm node daemon.
Comment 10 Marshall Garey 2020-06-05 15:44:32 MDT
I haven't been able to reproduce it. I've gotten a little bit of information from the logs you uploaded, but I need more to really understand what's going on. I can say for certain this is not the same problem as bug 4733.

One thing I thought was interesting was that NHC failed for that node twice before the job took a long time to complete. Do you know why NHC failed here?


[2020-06-04T16:47:15.627] update_node: node milton-med-001 reason set to: NHC: check_ps_service:  Service atd (process atd) owned by root not running; start in progress
[2020-06-04T16:47:15.627] update_node: node milton-med-001 state set to DRAINED*
[2020-06-04T16:47:15.927] Node milton-med-001 now responding
[2020-06-04T16:47:15.927] validate_node_specs: Node milton-med-001 unexpectedly rebooted boot_time=1591253212 last response=1591253074


[2020-06-04T17:28:25.688] update_node: node milton-med-001 reason set to: NHC: check_ps_service:  Service atd (process atd) owned by root not running; start in progress
[2020-06-04T17:28:25.688] update_node: node milton-med-001 state set to DRAINED
[2020-06-04T17:28:25.969] validate_node_specs: Node milton-med-001 unexpectedly rebooted boot_time=1591255692 last response=1591255654


I also saw in slurmctld that the job wasn't stuck in "completing" indefinitely, it took 154 seconds which is still a long time but not forever. Is that correct - that the job does eventually end? Does every single job in 20.02.3 take a long time to complete, or just sometimes? And just to clarify, is that only when jobs are cancelled, or also when they end normally?

One common reason for jobs to take a long time to complete is if they're stuck on I/O, which is why we have the UnkillableStepTimeout and UnkillableStepProgram parameters in slurm.conf.


You said you downgraded the slurmd's and slurmstepd's back to 20.02.1. So are jobs still taking a long time to complete? I just want to make sure I understand what kind of state your cluster is in, since you marked it as severity-2. If the cluster is stable then would it be okay to mark this as a sev-3? (Severity levels are described here: https://www.schedmd.com/support.php)


Can you run another test for me?
- Pick a node, start version 20.02.3 slurmd on that node
- Set SlurmctldDebug=debug2
- Set SlurmdDebug=debug5
- Restart slurmctld and slurmd on the node you want to run the test job on
- Run a test job specifically selecting that node, where we see the job taking a long time to complete (when cancelled)

When your test job is done, feel free to reset the debugging back to normal, the slurmd version back to 20.02.1, and upload the logs.

I know this was a lot of questions and requests; thanks for your patience and help in debugging this.
Comment 11 dl_support-hpc 2020-06-08 23:49:28 MDT
Hi Garey,

I spent most of the day trying to understand what is going on. I also had a chat with Chris Samuel about this, looks like we are hitting the same bug in https://bugs.schedmd.com/show_bug.cgi?id=9195

After updating the file 

/etc/slurm/plugins/conf.d/tmpdir.conf  and commenting out the 

#required        /etc/slurm/plugins/tmpdir/private-tmpdir.so base=/stornext/HPCScratch/milton/jobs mount=/var/tmp mount=/tmp

and restarting slurmd I can run jobs as before. The plugin was compiled for this slurm version.

Any advice how can be fixed? We have deadline by end of the month or I will have to find a different version that work with the plugin.

If you need any logs please let me know.

Appreciate your help on this.

Thanks,
Laszlo
Comment 12 Marshall Garey 2020-06-09 08:54:44 MDT
Nice find. Is that a spank plugin? Can you upload the plugin code? Can you also upload your entire plugstack.conf and slurm.conf and tmpdir.conf? Is your tmpdir.conf included by plugstack.conf or did you change the name by setting PlugStackConfig in slurm.conf?

I will consult with my coworkers on 9195; it looks like we've got some leads there. Right now I suspect this is a regression based on some work done for configless mode with SPANK.
Comment 13 Chris Samuel (NERSC) 2020-06-09 10:10:01 MDT
Hey Marshall,

Jason updated our bug on this with this info that might be useful:

----------------------8< snip snip 8<----------------------

Hi, Chris, this looks like it may be a duplicate of 
bug #9081 comment#5

As Dominik noted there:

>I think I found the source of this regression.
>You need to directly set PlugStackConfig in slurm.conf.
>Let me know if this helps.

We are still looking into fixing this in but#9081. Would you try the above workaround and let us know if this changes the behavior for you so that we can confirm the issue?

----------------------8< snip snip 8<----------------------

Can't be sure it's related, but given we're both suffering from plugin issues on 20.02 then it's worth a shot I guess.

All the best,
Chris
Comment 14 Marshall Garey 2020-06-09 10:26:00 MDT
Thanks Chris for talking with Laszlo and figuring that out. I agree that it's the same issue.

The short of it is that it was a regression from commit abd4db776c7 in 20.02.2, so for now you're required to set PlugStackConfig in slurm.conf until we fix that regression. I'm marking this as a duplicate of bug 9081.

Laszlo, please re-open this bug if setting PlugStackConfig in slurm.conf doesn't fix the issue.

*** This ticket has been marked as a duplicate of ticket 9081 ***
Comment 15 dl_support-hpc 2020-06-09 17:22:59 MDT
Hi Garey,

Thank you for looking into this, I think the problem still persist. I had the 

[root@milton-ctld1 ~]# grep stack /etc/slurm/slurm.conf 
PlugStackConfig=/etc/slurm/plugstack.conf


in my config all the time.

Please let me know what files would you need ?

Regards,
Laszlo
Comment 16 Marshall Garey 2020-06-10 08:09:08 MDT
Everything from comment 12:

Can you uploade slurm.conf, plugstack.conf, tmpdir.conf, and the plugin code?

Can you also run the experiment and answer the other questions from comment 10?
Comment 17 Marshall Garey 2020-06-22 08:35:26 MDT
Just following up - Are you able to get me the requested information?
Comment 18 Marshall Garey 2020-07-01 12:03:29 MDT
I'm closing this as timedout. Please re-open it when you're able to upload the requested information.