3535 – Jobs stuck in "launch failed requeued held"

Bug 3535 - Jobs stuck in "launch failed requeued held"

Summary: Jobs stuck in "launch failed requeued held"

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other bugs)
Version:	16.05.0
Hardware:	Linux Linux

Importance:	--- 2 - High Impact
Assignee:	Danny Auble
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2017-03-03 16:38 MST by IT
Modified:	2017-03-10 13:19 MST (History)
CC List:	0 users

See Also:
Site:	Altius Institute
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
requested files (6.05 MB, application/zip) 2017-03-03 16:59 MST, IT	Details
show-notes.txt (15.81 KB, text/plain) 2017-03-03 18:06 MST, IT	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description IT 2017-03-03 16:38:04 MST

We have hundreds of jobs stuck in a "launch failed requeued held" and we are not able to "release" them. So, two questions: what does this status mean and how do we move them to a run-able state?

Comment 1 Danny Auble 2017-03-03 16:46:17 MST

Hey Bill, this usually means the launched failed on the compute nodes for one reason or another and instead of relaunching the job on the node and draining the queue it holds it to gain further review from the user or the admin.

What I would do is look at the slurmd log from one of the nodes where the job ran and see why the job failed.

A common mistake is the spool dir isn't owned by user root (or the slurmd isn't ran by user root).  But I am guessing this is a new thing and most other jobs have ran with no issue.

If you could look at the log and see if you can find something let me know.

If you can't easily see anything please attach the slurmd and slurmcltd logs and I can look and see if I can find something.

If you can also attach your slurm.conf file that would be helpful as well.

Comment 2 IT 2017-03-03 16:59:30 MST

Created attachment 4152 [details]
requested files

Comment 3 IT 2017-03-03 17:00:21 MST

Will review the slurm.log myself in a minute... thanks

Comment 4 Danny Auble 2017-03-03 17:37:03 MST

Bill, this appears to be your problem...

[2017-03-03T15:45:12.068] error: _unpack_batch_job_launch_msg: protocol_version 0 not supported
[2017-03-03T15:45:12.068] error: Malformed RPC of type REQUEST_BATCH_JOB_LAUNCH(4005) received
[2017-03-03T15:45:12.068] fatal: slurmstepd: we didn't unpack the request correctly

It looks like this started shortly after you restarted the slurmd today

[2017-03-03T14:13:03.206] slurmd version 16.05.0 started
[2017-03-03T14:13:03.206] Warning: revoke on job 1339621 has no expiration
[2017-03-03T14:13:03.208] slurmd started on Fri, 03 Mar 2017 14:13:03 -0800
[2017-03-03T14:13:03.208] CPUs=48 Boards=1 Sockets=2 Cores=12 Threads=2 Memory=515579 TmpDisk=1525438 Uptime=14 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2017-03-03T14:13:05.212] error: If munged is up, restart with --num-threads=10
[2017-03-03T14:13:05.212] error: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory
[2017-03-03T14:13:05.212] error: authentication: Socket communication error
[2017-03-03T14:13:05.212] error: Unable to register: Protocol authentication error

It appears munge wasn't up and that caused the registration message to fail.  It appears to come up later and all is good (authentication wise anyway).

My guess is for some reason the job was sent over with a bad protocol_version. and the slurmd doesn't know how to unpack the message.

This seems fairly familiar, and I am sure this is fixed in a later version of 16.05, but I haven't found the exact commit yet.

One thing you might try is restarting the slurmd on the node causing the issue and see if that fixes things.

You should be able to see the version the slurmctld thinks each slurmd is running with

scontrol show node

Let me know if that helps things out.  If you could send me that output that would be great as well.

You should also try to sync your slurm.conf files across your nodes.  It appears from your slurmctld log you have many, if not all, out of sync.

Comment 5 IT 2017-03-03 17:41:30 MST

I think the issue was triggered (~ 1440) by a messed up slurm.conf that has since gotten fixed but not reloaded. I am 
working through the nodes by hand (just to be careful) restarting munge and slurmd and it looks like there is 
improvement. I am not sure it is 100% but it seems better. Will update you again soon.


On Mar 3 2017 16:37, bugs@schedmd.com wrote:
> *Comment # 4 <https://bugs.schedmd.com/show_bug.cgi?id=3535#c4> on bug 3535
> <https://bugs.schedmd.com/show_bug.cgi?id=3535> from Danny Auble <mailto:da@schedmd.com> *
>
> Bill, this appears to be your problem...
>
> [2017-03-03T15:45:12.068] error: _unpack_batch_job_launch_msg: protocol_version
> 0 not supported
> [2017-03-03T15:45:12.068] error: Malformed RPC of type
> REQUEST_BATCH_JOB_LAUNCH(4005) received
> [2017-03-03T15:45:12.068] fatal: slurmstepd: we didn't unpack the request
> correctly
>
> It looks like this started shortly after you restarted the slurmd today
>
> [2017-03-03T14:13:03.206] slurmd version 16.05.0 started
> [2017-03-03T14:13:03.206] Warning: revoke on job 1339621 has no expiration
> [2017-03-03T14:13:03.208] slurmd started on Fri, 03 Mar 2017 14:13:03 -0800
> [2017-03-03T14:13:03.208] CPUs=48 Boards=1 Sockets=2 Cores=12 Threads=2
> Memory=515579 TmpDisk=1525438 Uptime=14 CPUSpecList=(null) FeaturesAvail=(null)
> FeaturesActive=(null)
> [2017-03-03T14:13:05.212] error: If munged is up, restart with --num-threads=10
> [2017-03-03T14:13:05.212] error: Munge encode failed: Failed to access
> "/var/run/munge/munge.socket.2": No such file or directory
> [2017-03-03T14:13:05.212] error: authentication: Socket communication error
> [2017-03-03T14:13:05.212] error: Unable to register: Protocol authentication
> error
>
> It appears munge wasn't up and that caused the registration message to fail.
> It appears to come up later and all is good (authentication wise anyway).
>
> My guess is for some reason the job was sent over with a bad protocol_version.
> and the slurmd doesn't know how to unpack the message.
>
> This seems fairly familiar, and I am sure this is fixed in a later version of
> 16.05, but I haven't found the exact commit yet.
>
> One thing you might try is restarting the slurmd on the node causing the issue
> and see if that fixes things.
>
> You should be able to see the version the slurmctld thinks each slurmd is
> running with
>
> scontrol show node
>
> Let me know if that helps things out.  If you could send me that output that
> would be great as well.
>
> You should also try to sync your slurm.conf files across your nodes.  It
> appears from your slurmctld log you have many, if not all, out of sync.
>
> ------------------------------------------------------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 6 IT 2017-03-03 18:06:13 MST

Created attachment 4154 [details]
show-notes.txt

I have manually/carefully restarted munge and slurmd on all the compute nodes and things are looking better. I need to 
get the users to push it a little more to verify... Here is the output you asked for. I think it was (as you suggested) 
bad timing on pushing updates out and restarts...

On Mar 3 2017 16:40, Bill Shaver wrote:
> I think the issue was triggered (~ 1440) by a messed up slurm.conf that has since gotten fixed but not reloaded. I am
> working through the nodes by hand (just to be careful) restarting munge and slurmd and it looks like there is
> improvement. I am not sure it is 100% but it seems better. Will update you again soon.
>
>
> On Mar 3 2017 16:37, bugs@schedmd.com wrote:
>> *Comment # 4 <https://bugs.schedmd.com/show_bug.cgi?id=3535#c4> on bug 3535
>> <https://bugs.schedmd.com/show_bug.cgi?id=3535> from Danny Auble <mailto:da@schedmd.com> *
>>
>> Bill, this appears to be your problem...
>>
>> [2017-03-03T15:45:12.068] error: _unpack_batch_job_launch_msg: protocol_version
>> 0 not supported
>> [2017-03-03T15:45:12.068] error: Malformed RPC of type
>> REQUEST_BATCH_JOB_LAUNCH(4005) received
>> [2017-03-03T15:45:12.068] fatal: slurmstepd: we didn't unpack the request
>> correctly
>>
>> It looks like this started shortly after you restarted the slurmd today
>>
>> [2017-03-03T14:13:03.206] slurmd version 16.05.0 started
>> [2017-03-03T14:13:03.206] Warning: revoke on job 1339621 has no expiration
>> [2017-03-03T14:13:03.208] slurmd started on Fri, 03 Mar 2017 14:13:03 -0800
>> [2017-03-03T14:13:03.208] CPUs=48 Boards=1 Sockets=2 Cores=12 Threads=2
>> Memory=515579 TmpDisk=1525438 Uptime=14 CPUSpecList=(null) FeaturesAvail=(null)
>> FeaturesActive=(null)
>> [2017-03-03T14:13:05.212] error: If munged is up, restart with --num-threads=10
>> [2017-03-03T14:13:05.212] error: Munge encode failed: Failed to access
>> "/var/run/munge/munge.socket.2": No such file or directory
>> [2017-03-03T14:13:05.212] error: authentication: Socket communication error
>> [2017-03-03T14:13:05.212] error: Unable to register: Protocol authentication
>> error
>>
>> It appears munge wasn't up and that caused the registration message to fail.
>> It appears to come up later and all is good (authentication wise anyway).
>>
>> My guess is for some reason the job was sent over with a bad protocol_version.
>> and the slurmd doesn't know how to unpack the message.
>>
>> This seems fairly familiar, and I am sure this is fixed in a later version of
>> 16.05, but I haven't found the exact commit yet.
>>
>> One thing you might try is restarting the slurmd on the node causing the issue
>> and see if that fixes things.
>>
>> You should be able to see the version the slurmctld thinks each slurmd is
>> running with
>>
>> scontrol show node
>>
>> Let me know if that helps things out.  If you could send me that output that
>> would be great as well.
>>
>> You should also try to sync your slurm.conf files across your nodes.  It
>> appears from your slurmctld log you have many, if not all, out of sync.
>>
>> ------------------------------------------------------------------------------------------------------------------------
>> You are receiving this mail because:
>>
>>   * You reported the bug.
>>

Comment 7 IT 2017-03-03 18:24:36 MST

Danny,
Thank you. It looks like things are working again. I will monitor it over the weekend. Could we keep this ticket open 
for a couple days just in case?
I do plan to upgrade very soon, but we are coming off a data center move and still trying to stabling stuff and clean up 
back-log...
Thanks again.

On Mar 3 2017 17:05, Bill Shaver wrote:
> I have manually/carefully restarted munge and slurmd on all the compute nodes and things are looking better. I need to
> get the users to push it a little more to verify... Here is the output you asked for. I think it was (as you suggested)
> bad timing on pushing updates out and restarts...
>
> On Mar 3 2017 16:40, Bill Shaver wrote:
>> I think the issue was triggered (~ 1440) by a messed up slurm.conf that has since gotten fixed but not reloaded. I am
>> working through the nodes by hand (just to be careful) restarting munge and slurmd and it looks like there is
>> improvement. I am not sure it is 100% but it seems better. Will update you again soon.
>>
>>
>> On Mar 3 2017 16:37, bugs@schedmd.com wrote:
>>> *Comment # 4 <https://bugs.schedmd.com/show_bug.cgi?id=3535#c4> on bug 3535
>>> <https://bugs.schedmd.com/show_bug.cgi?id=3535> from Danny Auble <mailto:da@schedmd.com> *
>>>
>>> Bill, this appears to be your problem...
>>>
>>> [2017-03-03T15:45:12.068] error: _unpack_batch_job_launch_msg: protocol_version
>>> 0 not supported
>>> [2017-03-03T15:45:12.068] error: Malformed RPC of type
>>> REQUEST_BATCH_JOB_LAUNCH(4005) received
>>> [2017-03-03T15:45:12.068] fatal: slurmstepd: we didn't unpack the request
>>> correctly
>>>
>>> It looks like this started shortly after you restarted the slurmd today
>>>
>>> [2017-03-03T14:13:03.206] slurmd version 16.05.0 started
>>> [2017-03-03T14:13:03.206] Warning: revoke on job 1339621 has no expiration
>>> [2017-03-03T14:13:03.208] slurmd started on Fri, 03 Mar 2017 14:13:03 -0800
>>> [2017-03-03T14:13:03.208] CPUs=48 Boards=1 Sockets=2 Cores=12 Threads=2
>>> Memory=515579 TmpDisk=1525438 Uptime=14 CPUSpecList=(null) FeaturesAvail=(null)
>>> FeaturesActive=(null)
>>> [2017-03-03T14:13:05.212] error: If munged is up, restart with --num-threads=10
>>> [2017-03-03T14:13:05.212] error: Munge encode failed: Failed to access
>>> "/var/run/munge/munge.socket.2": No such file or directory
>>> [2017-03-03T14:13:05.212] error: authentication: Socket communication error
>>> [2017-03-03T14:13:05.212] error: Unable to register: Protocol authentication
>>> error
>>>
>>> It appears munge wasn't up and that caused the registration message to fail.
>>> It appears to come up later and all is good (authentication wise anyway).
>>>
>>> My guess is for some reason the job was sent over with a bad protocol_version.
>>> and the slurmd doesn't know how to unpack the message.
>>>
>>> This seems fairly familiar, and I am sure this is fixed in a later version of
>>> 16.05, but I haven't found the exact commit yet.
>>>
>>> One thing you might try is restarting the slurmd on the node causing the issue
>>> and see if that fixes things.
>>>
>>> You should be able to see the version the slurmctld thinks each slurmd is
>>> running with
>>>
>>> scontrol show node
>>>
>>> Let me know if that helps things out.  If you could send me that output that
>>> would be great as well.
>>>
>>> You should also try to sync your slurm.conf files across your nodes.  It
>>> appears from your slurmctld log you have many, if not all, out of sync.
>>>
>>> ------------------------------------------------------------------------------------------------------------------------
>>> You are receiving this mail because:
>>>
>>>   * You reported the bug.
>>>

Comment 8 Danny Auble 2017-03-03 18:30:03 MST

No problem Bill. Glad it was an easy one. I'll ping you next week if you haven't already closed the bug. Have a good weekend.

Comment 9 IT 2017-03-04 10:04:43 MST

Danny, 

As I am looking back through logs to understand the root cause of yesterdays problem and learn from my mistakes, it looks to me like the issue was how I tried to implement configuration changes in the slurm.conf file. The changes being made were associated with bring up a GPU node and adding all the needed plugins/options for gres/gpu. 

I pushed the slurm.conf file then did a SIGHUP to 'load' it (on each node). It seems that was not sufficient and I needed to either issue an 'scontrol reconfigure' centrally or restart the slurmd daemon on each node. It seems that I got an incomplete reload of the daemons' which eventually caused them to go defunct, nodes going to a 'completing' and jobs going to a 'launch failed requeued held' state. 

From what you see in the logs and your experience with SLURM, does this sound like the likely chain of events? It seems like I need to adjust my procedures to not use SIGHUP but rather use scontrol reconfigure, correct?

Comment 10 Danny Auble 2017-03-09 10:15:14 MST

Hey Bill, yes the chain of events you describe seems like what happened.  In my experience you will need to reboot the slurmd if you ever

- Change a plugin type or add one
- Change anything on a node (resources or GRES on a node)

If you change anything on a partition or a *Parameter most likely a sighup or scontrol reconfig will suffice.  We are working on a document that will hopefully help with this.

We also are looking into implementing a scontrol restart function that will just restart everything for you.  I am not sure when that will be though.  When in doubt restarting is usually the best option as it is rather quick and shouldn't cause much headache.

You can always scan the slurmctld|slurmd.log for errors that might lead you to believe something is amiss.  The most glaring one is the 

error: Node hpcA12 appears to have a different slurm.conf than the slurmctld.  This could cause issues with communication and functionality.  Please review both files and make sure they are the same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.

That is a very clear note that node isn't running the same slurm.conf file as the slurmctld.

Comment 11 Danny Auble 2017-03-09 12:16:56 MST

On another note I have some suggestions for your slurm.conf

I would suggest you change

-SchedulerType=sched/builtin
+SchedulerType=sched/backfill

unless you are only looking for fifo behavior.

-TaskPlugin=task/none
+TaskPlugin=affinity,cgroup

And make sure TaskAffinity is turned off or not set in your cgroup.conf. This will give your applications affinity against the cpus on the nodes. If you want to bind memory and such please consult https://slurm.schedmd.com/cgroup.conf.html

-SlurmctldDebug=6 which is also debug2 for easier readability.
+SlurmctldDebug=info or debug, or if you want to use numbers 3 or 5.

6 is rather high and can sometimes hide the errors under all those debug2 messages. You can always turn it up if you need more debug later, but I will warn most debug message after the "debug" level aren't usually helpful. We have a bunch of DebugFlags that can be turned on to different subsystems that should allow for easier debugging than increasing the general debug level.

You shouldn't need AccountingStoragePass in you slurm.conf. It is only needed in your slurmdbd.conf.

I would also warn that if you are using the database for limits or associations or QOS you aren't enforcing any of those

AccountingStorageEnforce=associations,limits,qos

Is needed to make sure people run where they should. If you don't care then this is set correctly. I would also add the "safe" flag to that list to avoid starting jobs that won't finish before they breach an accounting time limit. Again, if you don't use limits and don't ever plan to then this isn't needed.

You can also compress your nodenames if you desire, but that is more aesthetic.
i.e.

NodeName=hpcA[01-16] NodeAddr=10.174.2.[43-58]
NodeName=hpcB[01-07] NodeAddr=10.174.2.[59-60,21,29,28,27,26,25]

Other than that things look fairly good.
You don't need a NodeName for any node you don't plan on putting in a partition, only compute nodes need that. So I would expect sched0,sched1, and workbench could be removed.

Comment 12 Danny Auble 2017-03-10 12:57:19 MST

Bill, do you need anything else on this or can we close it.  If you would like more guidance on configuration please open a new bug.

Thanks!

Comment 13 IT 2017-03-10 13:02:29 MST

Thanks. I appreciate the help and the extras you passed along. Please fee free to close this.

On Mar 10 2017 11:57, bugs@schedmd.com wrote:
> *Comment # 12 <https://bugs.schedmd.com/show_bug.cgi?id=3535#c12> on bug 3535
> <https://bugs.schedmd.com/show_bug.cgi?id=3535> from Danny Auble <mailto:da@schedmd.com> *
>
> Bill, do you need anything else on this or can we close it.  If you would like
> more guidance on configuration please open a new bug.
>
> Thanks!
>
> ------------------------------------------------------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 14 Danny Auble 2017-03-10 13:19:12 MST

Excellent, glad I could help.