Bug 3941 - Nodes being drained due to "Kill task failed"
Summary: Nodes being drained due to "Kill task failed"
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other bugs)
Version: 17.02.6
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Director of Support
QA Contact:
URL:
: 3958 (view as bug list)
Depends on:
Blocks:
 
Reported: 2017-06-29 13:04 MDT by David Matthews
Modified: 2018-08-15 12:31 MDT (History)
7 users (show)

See Also:
Site: Met Office
Alineos Sites: ---
Bull/Atos Sites: ---
Confidential Site: ---
Cray Sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 17.02.7
Target Release: ---
DevPrio: ---


Attachments
slurm.conf (1.66 KB, text/plain)
2017-07-06 10:44 MDT, David Matthews
Details
slurmd.log (101.16 KB, text/x-log)
2017-07-06 10:51 MDT, David Matthews
Details
Remove state check from UnkillableProgram (632 bytes, patch)
2017-07-18 10:21 MDT, Tim Shaw
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description David Matthews 2017-06-29 13:04:22 MDT
We've just upgraded to 17.02.5 (from 15.08.7).
We're now having a problem where nodes are being drained with reason "batch job complete failure".

Example slurmctld.log entries:
[2017-06-29T12:28:40.004+01:00] error: slurmd error running JobId=3302451 on node(s)=expspicesrv003: Kill task failed
[2017-06-29T12:28:40.004+01:00] drain_nodes: node expspicesrv003 state set to DRAIN
[2017-06-29T12:28:40.004+01:00] job_complete: JobID=3302451 State=0x1 NodeCnt=1 WEXITSTATUS 1
[2017-06-29T12:28:40.004+01:00] job_complete: JobID=3302451 State=0x8005 NodeCnt=1 done
Corresponding slurmd.log entries:
[2017-06-29T12:28:40.000+01:00] [3302451] error: *** JOB 3302451 STEPD TERMINATED ON expspicesrv003 AT 2017-06-29T12:28:40 DUE TO JOB NOT ENDING WITH SIGNALS ***
[2017-06-29T12:28:40.000+01:00] [3302451] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4001 status 256
[2017-06-29T12:28:40.003+01:00] [3302451] done with job

What does this error mean and why are the nodes being drained as a result?

I can see this error is new in 17.02 but I'm unclear what it's purpose is or what triggers it:
https://github.com/SchedMD/slurm/commit/f18390e81766a46b50ffa08a8cf1b7946ecdbf90
Comment 3 Tim Shaw 2017-06-29 16:32:46 MDT
From the log entries you provided, it looks like the slurmd had a problem with a job step that didn't end correctly, and the slurmd wasn't able to kill it after the timeout was reached.  Because of this, the node is drained to give admins the opportunity to manually intervene and determine if there is a problem that needs to be addressed, rather than continue running jobs on a potentially problematic node.  That's the reasoning behind the draining of the node in this situation.

When you're ready, you can bring the node back online for jobs with:

scontrol update nodename=expspicesrv003 state=resume

Please let me know if you have further questions.

Tim
Comment 4 David Matthews 2017-06-30 07:21:09 MDT
What could cause this? I can't see any evidence of issues on the nodes. 
Given that this is a new feature in slurm, what would the behaviour have been at 15.08?

We had 4 nodes drained yesterday with this problem so I certainly need to get this fixed.
Comment 5 David Matthews 2017-07-04 06:54:20 MDT
We are continuing to have problems.
At the moment I have no idea how to prevent this.

We did not have this problem with 15.08. So (as already asked), could you please explain how 15.08 would have behaved in this situation?

Are you saying that slurm will now always drain a node if the job step can't be killed? Is this related to the "UnkillableStepTimeout" setting?

Previously we would see occasional errors such as:
proctrack_p_wait: Unable to destroy container 44825 in cgroup plugin, giving up after 128 sec
I think this happens when processes get stuck in uninterruptible sleep due to transient I/O issues on one of our network file systems. I haven't seen this error message since the upgrade (although that message still exists in the slurm code) but is that because slurm is now giving up much earlier? In the most recent case I checked I can see that the cgroup of the job that was terminated has not been cleaned up.
Comment 6 Tim Shaw 2017-07-05 14:00:23 MDT
*** Bug 3958 has been marked as a duplicate of this bug. ***
Comment 7 Tim Shaw 2017-07-05 14:24:37 MDT
David,

I believe in 15.08 you would get the error messages and it would spend the time trying to clean up but eventually it would just move on.  So 17.02 is exposing that problem more by draining the node.

A few suggestions:  First, set the UnkillableStepTimeout to be 120.  I noticed in the code that the step timeout defaults 60 seconds but the proctrack cgroup timeout is 120 seconds.  So let's make them to same to get them synced up.

Second, set your SlurmdDebug=debug2.  This will provide a lot more details as to the problem trying to kill the process and the specific pid.  That will hopefully shed more light on the situation.  This will also produce a much larger, verbose log file so you'll want to set it back to your previous log level as soon as you catch the problem in the log file.

Lastly, could you attach your slurm.conf to the bug just so we can refer to it if needed.

Thanks.

Tim
Comment 8 David Matthews 2017-07-06 10:44:43 MDT
Created attachment 4887 [details]
slurm.conf
Comment 9 David Matthews 2017-07-06 10:51:30 MDT
Created attachment 4888 [details]
slurmd.log
Comment 10 David Matthews 2017-07-06 10:59:20 MDT
I've just had the first failure after making the changes you suggested. I've attached the slurmd.log contents from the time of job submission until the failure. I've also attached our slurm.conf.

The job in question was cancelled by the user. Relevant slurmctld.log entries:
[2017-07-06T16:08:14.204+01:00] _slurm_rpc_kill_job2: REQUEST_KILL_JOB job 3528793 uid 10895
[2017-07-06T16:09:49.787+01:00] Resending TERMINATE_JOB request JobId=3528793 Nodelist=expspicesrv018
[2017-07-06T16:10:15.003+01:00] error: slurmd error running JobId=3528793 on node(s)=expspicesrv018: Kill task failed
[2017-07-06T16:10:15.285+01:00] cleanup_completing: job 3528793 completion process took 121 seconds
Comment 11 David Matthews 2017-07-06 11:06:29 MDT
(In reply to Tim Shaw from comment #7)
> I believe in 15.08 you would get the error messages and it would spend the
> time trying to clean up but eventually it would just move on.
What error messages do you mean? The "STEPD TERMINATED DUE TO JOB NOT ENDING WITH SIGNALS" message did not exist in 15.08.

> A few suggestions:  First, set the UnkillableStepTimeout to be 120.  I
> noticed in the code that the step timeout defaults 60 seconds but the
> proctrack cgroup timeout is 120 seconds.  So let's make them to same to get
> them synced up.
Note that I'm still not getting any message from proctrack_p_wait. However, that message mentioned 128s not 120 so perhaps we didn't increase UnkillableStepTimeout enough?
Comment 12 Tim Shaw 2017-07-06 13:23:23 MDT
Looking at your log, you'll see the entry:

killing process 29865 (inherited_task) with signal 9

many times.  This is the stuck process that the slurmd isn't able to kill.  And the cgroup can't be destroyed because the process is still running inside of it:

_slurm_cgroup_destroy: problem deleting step cgroup path /cgroup/freezer/slurm/uid_10895/job_3528793/step_batch: Device or resource busy

Why the process is getting stuck is hard to say at this point.  You'll want to contact the user that submitted the job and find out details about the job and submission.  Could it be stuck waiting on I/O?  Finding out more about the nature of the job may reveal details as to why it's getting stuck.

>What error messages do you mean? The "STEPD TERMINATED DUE TO JOB NOT ENDING WITH SIGNALS" message did not exist in 15.08.

I'm not referring to any specific error message, although the message you provided:

proctrack_p_wait: Unable to destroy container 44825 in cgroup plugin, giving up after 128 sec

seems to stem from the same problem.  And rather than doing anything about it, older versions of slurmd would just ignore it and move on.

>Note that I'm still not getting any message from proctrack_p_wait. However, that message mentioned 128s not 120 so perhaps we didn't increase UnkillableStepTimeout enough?

Possibly.  I don't think it would make a difference.

Regards.

Tim
Comment 13 David Matthews 2017-07-07 06:51:47 MDT
OK. I'll do some further investigation and report back.
Comment 14 David Matthews 2017-07-10 07:22:18 MDT
I'm trying to use UnkillableStepProgram to gather information when the problem occurs. However, as far as I can see this is not working. Presumably this should still work and I should be able to see evidence of it running (with SlurmdDebug=debug2)?
Comment 15 Tim Shaw 2017-07-11 09:09:51 MDT
Yes, UnkillableStepProgram should be called if the process is determined to be unkillable and there should be entries noting it in the log file.  I'm currently working on trying to reproduce this issue myself.  I'll let you know what I find.
Comment 16 Tim Shaw 2017-07-11 11:29:24 MDT
David, I wasn't able to reproduce a problem with running the UnkillableStepProgram.  Once my UnkillableStepTimeout expired, my UnkillableStepProgram ran and there was this entry in the log:

debug:  step_terminate_monitor: unkillable after 60 sec, calling: /home/tshaw/test/3941/unkillable_program.sh

So I don't see an issue with it.  Look to see if you see that log entry in your slurmd log as well.  

Also, when you see a drained node and log in to investigate, do you find the unkillable process still running or has it exited on its own by then?  You can find the pid in the log on this line:

debug2: killing process 29865 (inherited_task) with signal 9

If the process does eventually run to completion on its own, you may just need to increase the UnkillableStepTimeout value.  That option is the amount of time, in seconds, that Slurm will wait before deciding that process is unkillable.  That way, if the problem is I/O related and the process becomes responsive later, it will be killed without the node draining.  I would try to determine what that amount of time is and then set the UnkillableStepTimeout to something higher than that time.  Or, it that's too hard to catch, try increasing the value incrementally until the drainings no longer occur.

Hope that helps.
Comment 17 David Matthews 2017-07-13 11:39:39 MDT
Tim, I don't see that debug line in my logs.

I'm confused by the code.
It looks like the UnkillableStepProgram won't be run if
job->state != SLURMSTEPD_STEP_RUNNING
https://github.com/SchedMD/slurm/blob/slurm-17.02/src/slurmd/slurmstepd/step_terminate_monitor.c#L178

But job->state appears to be set to SLURMSTEPD_STEP_ENDING before step_terminate_monitor is ever started?
https://github.com/SchedMD/slurm/blob/slurm-17.02/src/slurmd/slurmstepd/mgr.c
So how will UnkillableStepProgram ever run?
Apologies if I'm misunderstanding the code.

How did you manage to test this?
Can you suggest any way to replicate the problem?

We're fairly certain all our problems are I/O related so I'm assuming we have processes stuck in interruptible sleep.
Comment 18 Tim Shaw 2017-07-14 11:00:00 MDT
Before I respond to your last questions, I wanted to follow up on my last response.  Have you had a chance to increase your UnkillableStepTimeout?  Have you seen any improvement for doing so?  Assuming your stuck processes do eventually free up, this will prevent any further drainings from occurring.  Let me know how that goes.

  (In reply to David Matthews from comment #17)
> Tim, I don't see that debug line in my logs.
> 
> I'm confused by the code.
> It looks like the UnkillableStepProgram won't be run if
> job->state != SLURMSTEPD_STEP_RUNNING
> https://github.com/SchedMD/slurm/blob/slurm-17.02/src/slurmd/slurmstepd/
> step_terminate_monitor.c#L178
> 
> But job->state appears to be set to SLURMSTEPD_STEP_ENDING before
> step_terminate_monitor is ever started?
> https://github.com/SchedMD/slurm/blob/slurm-17.02/src/slurmd/slurmstepd/mgr.c
> So how will UnkillableStepProgram ever run?
> Apologies if I'm misunderstanding the code.

I'm assuming your referring to the logic in mgr.c:1332.  The code there gets called at the beginning of the step and probably hasn't hit the problem yet.  So that state does not get set to SLURMSTEPD_STEP_ENDING.  If it did, you would also see the entry:

Before call to spank_fini()
After call to spank_fini()

in the log file if that were the case.

> How did you manage to test this?
> Can you suggest any way to replicate the problem?

Because I couldn't create a process that could be unkillable, I tested this by creating a job script that would trap and ignore SIGINT & SIGTERM signals and then sleep for a really long time.  Then, for testing purposes only, I temporarily changed the stepd code to send a SIGTERM rather than a SIGKILL (proctrack_cgroup.c:539) and recompiled.  Then I submitted the job with a really short walltime so that the job was sure to run past it and the slurmstepd would be sure to try to kill the long-running job:

sbatch -t 00:05 job.sh

That simulated a stuck job because my script was ignoring the SIGTERM signals sent to it by the slurmd.  I'm not sure how helpful this will be for you but that is how I tested it.
 
> We're fairly certain all our problems are I/O related so I'm assuming we
> have processes stuck in interruptible sleep.

I'm curious to know if the stuck processes ever free up.  If they don't every free up, at least the processes are there to inspect and do whatever investigation is needed (like an strace).

Tim
Comment 19 Ole.H.Nielsen@fysik.dtu.dk 2017-07-16 02:50:13 MDT
+1 for this problem on 17.02.6.  We've had a handful of nodes drained for this reason since upgrading to 17.02.6.

We have Cgroups enabled.  No NFS problems (that I know of).  No stray user processes left behind (hours later).  Nothing in the syslog around the time of the error, no OOM-killer messages.

This user has previously run jobs exceeding the memory allocation, so I guess the job got killed for that reason.

I've increased UnkillableStepTimeout to 150 now and made "scontrol reconfig".  Is there any need to restarts slurmd on the nodes as well?

The slurmd.log for this job leading up to the draining of the node at 22:30:08:

[2017-07-15T20:46:11.980] task_p_slurmd_batch_request: 121567
[2017-07-15T20:46:11.980] task/affinity: job 121567 CPU input mask for node: 0xFFFFFF
[2017-07-15T20:46:11.980] task/affinity: job 121567 CPU final HW mask for node: 0xFFFFFF
[2017-07-15T20:46:11.981] _run_prolog: run job script took usec=4
[2017-07-15T20:46:11.981] _run_prolog: prolog with lock for job 121567 ran for 0 seconds
[2017-07-15T20:46:11.981] Launching batch job 121567 for UID 221341
[2017-07-15T20:46:12.016] [121567] task/cgroup: /slurm/uid_221341/job_121567: alloc=192000MB mem.limit=192000MB memsw.limit=192000MB
[2017-07-15T20:46:12.016] [121567] task/cgroup: /slurm/uid_221341/job_121567/step_batch: alloc=192000MB mem.limit=192000MB memsw.limit=192000MB
[2017-07-15T20:46:12.046] [121567] task_p_pre_launch: Using sched_affinity for tasks
[2017-07-15T22:28:09.672] [121567] error: *** JOB 121567 ON x073 CANCELLED AT 2017-07-15T22:28:09 ***
[2017-07-15T22:28:10.692] [121567] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 15
[2017-07-15T22:28:10.695] [121567] done with job
[2017-07-15T22:28:32.088] task_p_slurmd_batch_request: 123236
[2017-07-15T22:28:32.088] task/affinity: job 123236 CPU input mask for node: 0xFFFFFF
[2017-07-15T22:28:32.088] task/affinity: job 123236 CPU final HW mask for node: 0xFFFFFF
[2017-07-15T22:28:32.089] _run_prolog: run job script took usec=4
[2017-07-15T22:28:32.089] _run_prolog: prolog with lock for job 123236 ran for 0 seconds
[2017-07-15T22:28:32.089] Launching batch job 123236 for UID 24803
[2017-07-15T22:28:32.126] [123236] task/cgroup: /slurm/uid_24803/job_123236: alloc=256000MB mem.limit=256000MB memsw.limit=256000MB
[2017-07-15T22:28:32.126] [123236] task/cgroup: /slurm/uid_24803/job_123236/step_batch: alloc=256000MB mem.limit=256000MB memsw.limit=256000MB
[2017-07-15T22:28:32.150] [123236] task_p_pre_launch: Using sched_affinity for tasks
[2017-07-15T22:29:09.185] [123236] Defering sending signal, processes in job are currently core dumping
[2017-07-15T22:30:08.033] [123236] error: *** JOB 123236 STEPD TERMINATED ON x073 AT 2017-07-15T22:30:07 DUE TO JOB NOT ENDING WITH SIGNALS ***
[2017-07-15T22:30:08.040] [123236] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4001 status 10752
[2017-07-15T22:30:08.047] [123236] done with job
Comment 20 Ole.H.Nielsen@fysik.dtu.dk 2017-07-16 03:17:36 MDT
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #19)
> +1 for this problem on 17.02.6.  We've had a handful of nodes drained for
> this reason since upgrading to 17.02.6.
> 
> We have Cgroups enabled.  No NFS problems (that I know of).  No stray user
> processes left behind (hours later).  Nothing in the syslog around the time
> of the error, no OOM-killer messages.
> 
> This user has previously run jobs exceeding the memory allocation, so I
> guess the job got killed for that reason.

A likely explanation for our issue is large timeouts when terminating the job: 

The user's job is producing huge core dump files of the order of 9 GB.  These large files could take minutes to write onto a busy NFS server over the compute node's 1 Gbit Ethernet link.

It would make sense to me to kill the user process even during the core dumping stage, if possible.
Comment 21 Ole.H.Nielsen@fysik.dtu.dk 2017-07-17 05:36:08 MDT
We're seeing more of these nodes drain events today.  But even though I've increased the timeout:

# scontrol show config | grep UnkillableStepTimeout
UnkillableStepTimeout   = 150 sec

and made "scontrol reconfig", the slurmd on a compute node still times out after 60 seconds, see this slurmd.log:

[2017-07-17T13:21:26.764] [123429] error: *** JOB 123429 ON g034 CANCELLED AT 2017-07-17T13:19:27 DUE TO TIME LIMIT ***
[2017-07-17T13:22:30.284] [123429] error: *** JOB 123429 STEPD TERMINATED ON g034 AT 2017-07-17T13:22:27 DUE TO JOB NOT ENDING WITH SIGNALS ***
[2017-07-17T13:22:30.284] [123429] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4001 status 15
[2017-07-17T13:22:30.286] [123429] done with job

Question: Must we restart the slurmd daemons in order to reconfigure the UnkillableStepTimeout?

Note added: In order get an alert for nodes that get drained, I've created a Slurm trigger as described in https://wiki.fysik.dtu.dk/niflheim/SLURM#slurm-trigger-information (changing the state from "down" to "drained").
Comment 22 David Matthews 2017-07-17 09:20:44 MDT
Tim, thanks for the info on how you tested this.

I still don't understand why I can't get the UnkillableStepProgram to run - I'll try to add some extra diagnostics and let you know what I find.

I've tried increasing UnkillableStepTimeout but that hasn't really helped. When we have I/O issues the problem can last for an extended period. The processes do eventually free up I assume (I haven't actually managed to get any firm evidence of what's happening because I can't get UnkillableStepProgram to run).

I think there are several points to consider here:

1) Is the new behaviour of slurm sensible? Previously you had a choice whether to drain the node or not (via the UnkillableStepProgram) - now it happens regardless. However, I can see that draining the node is probably not a bad thing to do in most cases (although I would prefer a more specific reason than "batch job complete failure").

2) Assuming the new behaviour is retained, I think it need documenting. For example, the documentation for UnkillableStepTimeout
only mentions that UnkillableStepProgram will be executed.
https://slurm.schedmd.com/slurm.conf.html
It ought to mention that, if this timeout is reached, the node will also be drained with reason "batch job complete failure".

3) I still need to be work out why I can't run UnkillableStepProgram.

Hope that makes sense.
Comment 23 Tim Shaw 2017-07-17 13:04:39 MDT
David,

Check to make sure that the UnkillableStepProgram is on the compute node's local file system (or accessible on an nfs share) and that you've reconfigured or restarted the slurmd's process.  Also, do you see any error or reference to "unkillable" in the slurmd.log?  If not, run "scontrol show config" on the compute node:

# scontrol show config | grep Unkill
UnkillableStepProgram   = /home/tshaw/test/3941/unkillable_program.sh
UnkillableStepTimeout   = 60 sec

Perhaps, it just hasn't picked up the slurm.conf change.

Regarding your points:
 
> 1) Is the new behaviour of slurm sensible? Previously you had a choice
> whether to drain the node or not (via the UnkillableStepProgram) - now it
> happens regardless. However, I can see that draining the node is probably
> not a bad thing to do in most cases (although I would prefer a more specific
> reason than "batch job complete failure").
> 

Understood.  This change was implemented to protect against scheduling any jobs on a node where the state is suspect, but we're open to recommendations from customers on how to make the problem more obvious.  I'll see what's involved with doing that.

> 2) Assuming the new behaviour is retained, I think it need documenting. For
> example, the documentation for UnkillableStepTimeout
> only mentions that UnkillableStepProgram will be executed.
> https://slurm.schedmd.com/slurm.conf.html
> It ought to mention that, if this timeout is reached, the node will also be
> drained with reason "batch job complete failure".

Agreed.  I put together a doc patch for this as well.

Tim
Comment 25 David Matthews 2017-07-18 09:41:59 MDT
By adding some extra debug statements I've now confirmed the problem is this line:
https://github.com/SchedMD/slurm/blob/slurm-17.02/src/slurmd/slurmstepd/step_terminate_monitor.c#L178
When call_external_program is called job->state has the value "3" (SLURMSTEPD_STEP_ENDING?) so it just returns.
So, the question is, why is job->state wrong or why is the external program not called in this job state?
Comment 29 Tim Shaw 2017-07-18 10:21:55 MDT
Created attachment 4939 [details]
Remove state check from UnkillableProgram

David,

Here is a patch that removes that state check so that your UnkillableProgram can run.  Can you test it and let me know how it works for you?

Also, because you've confirmed that your job is in an ending state, it makes me think that the I/O problem is in the epilog script, rather than the job.  You may want to inspect it to verify.

Tim
Comment 30 David Matthews 2017-07-21 10:09:11 MDT
(In reply to Tim Shaw from comment #29)
> Here is a patch that removes that state check so that your UnkillableProgram
> can run.  Can you test it and let me know how it works for you?

Thanks - I've implemented this but am still waiting for a failure to confirm it works (I'm sure it will).

> Also, because you've confirmed that your job is in an ending state, it makes
> me think that the I/O problem is in the epilog script, rather than the job. 
> You may want to inspect it to verify.

If you look at the slurmd.log output I provided you'll see the entry "debug:  [job 3528793] attempting to run epilog". This occur after the "STEPD TERMINATED ... DUE TO JOB NOT ENDING WITH SIGNALS" entry. So the problem can't be in the epilog can it?
Comment 38 Tim Shaw 2017-07-26 08:46:59 MDT
> If you look at the slurmd.log output I provided you'll see the entry "debug:
> [job 3528793] attempting to run epilog". This occur after the "STEPD
> TERMINATED ... DUE TO JOB NOT ENDING WITH SIGNALS" entry. So the problem
> can't be in the epilog can it?

Yeah, you're probably right there.

Just let me know when you've seen the next failure and how it went.
Comment 39 David Matthews 2017-07-26 09:47:12 MDT
(In reply to Tim Shaw from comment #38)
> Just let me know when you've seen the next failure and how it went.
The UnkillableStepProgram is now running fine - thanks.
Will your patch make it into the next 17.02 release?
Comment 40 Tim Shaw 2017-07-26 15:00:53 MDT
David,

This patch has been checked into commit 9f48e07c2ab8c8 and will be available in the next 17.02.7 release.

Regards.

Tim
Comment 41 David Matthews 2017-07-26 15:07:35 MDT
Thanks Tim.
Has the doc patch gone in as well?
Comment 43 Kilian Cavalotti 2017-07-26 15:12:14 MDT
Hi Tim, 

(In reply to Tim Shaw from comment #40)
> This patch has been checked into commit 9f48e07c2ab8c8 and will be available
> in the next 17.02.7 release.

Do you guys have planned a date to release 17.02.7 yet?

Thanks!
-- 
Kilian
Comment 45 Tim Shaw 2017-07-27 10:00:42 MDT
David,

I got a doc change committed in 04b431b43b which will also be out in 17.02.7.

Tim
Comment 46 Tim Shaw 2017-07-27 10:01:52 MDT
Kilian,

17.02.7 doesn't have a firm date yet but should be out in the next couple of weeks.

Tim
Comment 47 Christopher Samuel 2018-04-10 02:40:44 MDT
Just a comment that we're seeing this issue on 17.11.5 on RHEL7.

We're not using the UnkillableStepProgram but I have just increased UnkillableStepTimeout to 130s from its default of 60s.

If this solves it then it's probably best to increase the default value in the code.

cheers,
Chris
Comment 48 Andrea 2018-04-25 06:58:35 MDT
Hi There,

We are currently running 17.02.10 and the issue is not fixed.
We are doing some troubleshooting, and what we've noticed is that, even though slurm tries to kill them, the jobs are still there.

I wonder why it fails, since we are able to do use the kill command and the jobs will just get killed.

I can confirm that the status of the processes is R when this happens (So no zombies or D processes):

[root@syrah01 ~]# sacct -j 345914
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
345914       jobarstest  cpunormal     geomag          4    TIMEOUT      0:0
345914.batch      batch                geomag          4     FAILED    127:0

[root@cnode033 ~]# ps aufx | grep slurm -A 3
root       5069  0.0  0.0 3469752 7984 ?        S    Apr18   0:04 /cm/shared/apps/slurm/17.02.10/sbin/slurmd
gopinath 117227  0.0  0.0 113144  1436 ?        S    Apr23   0:00 /bin/bash /cm/local/apps/slurm/var/spool/job345914/slurm_script
gopinath 117229  398  0.0 304240 10268 ?        Rl   Apr23 11550:53  \_ ./main imexrk_Nr32_Nm64_dt1.e-8.txt
gopinath 117228  0.0  0.0 113144  1436 ?        S    Apr23   0:00 /bin/bash /cm/local/apps/slurm/var/spool/job345785/slurm_script
gopinath 117230  715  0.0 582100 10600 ?        Rl   Apr23 20749:55  \_ ./main rk4_Nr32_Nm64_dt1.e-8.txt

[root@cnode033 ~]# kill 117227

[root@cnode033 ~]# ps aufx | grep slurm -A 3
root       5069  0.0  0.0 3469752 7984 ?        S    Apr18   0:04 /cm/shared/apps/slurm/17.02.10/sbin/slurmd
gopinath 117228  0.0  0.0 113144  1436 ?        S    Apr23   0:00 /bin/bash /cm/local/apps/slurm/var/spool/job345785/slurm_script
gopinath 117230  715  0.0 582100 10600 ?        Rl   Apr23 20757:57  \_ ./main rk4_Nr32_Nm64_dt1.e-8.txt
gopinath 117229  398  0.0 304240 10268 ?        Rl   Apr23 11555:22 ./main imexrk_Nr32_Nm64_dt1.e-8.txt

[root@cnode033 ~]# pstree -aps  117229
systemd,1 --system
  └─main,117229 imexrk_Nr32_Nm64_dt1.e-8.txt
      ├─{main},117231
      ├─{main},117232
      ├─{main},117233
      └─{main},117234

[root@cnode033 ~]# kill 117229
[root@cnode033 ~]# ps aux | grep 117229
root       9036  0.0  0.0 112656   964 pts/2    S+   14:55   0:00 grep --color=auto 117229

And the job just disappears.

We are working on a script to get executed when the job is not killable, which will perform a simple kill - but would be good to know the root cause.
What kind of signals is Slurm sending to those applications?

Another thing we noticed is that we've killed the father(117227) of the process, but the child kept running. (That shouldn't depend by slurm i suppose).

Regards,
Andrea
Comment 49 Tim Wickberg 2018-04-25 10:05:10 MDT
Hi Andrea -

Please file a separate bug on this; I need to keep customer versions of these issues (especially resolved ones) separate from new non-customer reports of similar issues.

- Tim