Bug 5068 - Duplicate job id & batch job complete failure
Summary: Duplicate job id & batch job complete failure
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other bugs)
Version: 17.11.2
Hardware: Other Linux
: --- 3 - Medium Impact
Assignee: Felip Moll
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-04-15 18:38 MDT by Ahmed Arefin
Modified: 2019-11-14 10:49 MST (History)
5 users (show)

See Also:
Site: CSIRO
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
logs - slurmctld and slurmd (569.31 KB, application/x-zip-compressed)
2018-04-15 18:38 MDT, Ahmed Arefin
Details
slurmd (b060) - batch job complete failure (7.15 MB, text/plain)
2018-04-15 22:36 MDT, Ahmed Arefin
Details
logs - slurmctld and slurmd 30-APR-2018 (823.91 KB, application/x-zip-compressed)
2018-04-29 18:17 MDT, Ahmed Arefin
Details
b025 - slurmd duplicate job id (249.45 KB, text/plain)
2018-04-29 18:24 MDT, Ahmed Arefin
Details
Slurm.conf, cgroup.conf and gres (4.00 KB, application/x-zip-compressed)
2018-04-30 19:38 MDT, Ahmed Arefin
Details
Job submit lua (1.79 KB, text/plain)
2018-05-01 22:33 MDT, Ahmed Arefin
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Ahmed Arefin 2018-04-15 18:38:06 MDT
Created attachment 6635 [details]
logs - slurmctld and slurmd

Hello Team,

Following our recent Slurm upgrade to the version no 17.11.2, we are experiencing a “duplicate job id” issue on nodes, which also drains the machines. 

The duplicate job id issue has not been solved by turning off the ‘job preemption’ parameter in the slurm.conf file. Here is example log from the node b036 that was affected:

[2018-04-15T21:37:58.624] [15165425.batch] task/cgroup: /slurm/uid_581585/job_15165425/step_batch: alloc=24576MB mem.limit=24576MB memsw.limit=unlimited
[2018-04-15T21:38:11.996] error: Job 15165425 already running, do not launch second copy
[2018-04-15T21:38:11.999] [15165425.batch] error: *** JOB 15165425 ON b036 CANCELLED AT 2018-04-15T21:38:11 DUE TO JOB REQUEUE ***
[2018-04-15T21:38:13.087] [15165425.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 15

are018@cm01:~> sacct -j 15165425
       JobID    JobName  Partition      User  AllocCPUS   NNodes    Elapsed   TotalCPU      State  MaxVMSize     MaxRSS     ReqMem        NodeList
------------ ---------- ---------- --------- ---------- -------- ---------- ---------- ---------- ---------- ---------- ---------- ---------------
15165425     FMD6-LIII+ h2gpu,h24+    xxx000          7        1   00:00:00   00:00:00    PENDING                             24Gn   None assigned
Comment 1 Ahmed Arefin 2018-04-15 22:36:21 MDT
Created attachment 6636 [details]
slurmd (b060) - batch job complete failure
Comment 2 Ahmed Arefin 2018-04-15 22:37:34 MDT
Comment on attachment 6635 [details]
logs - slurmctld and slurmd

Duplicate job id
Comment 3 Ahmed Arefin 2018-04-15 22:38:34 MDT
Further to my notes below, we are also experiencing "batch job complete failure" (and drain) on a couple of nodes. For example see the slurmd from the node b060.
Comment 4 Felip Moll 2018-04-16 05:58:00 MDT
Hi Ahmed,

I think you have the limits for slurmctld, slurmd daemons or for the entire system too low for this volume of ~1500 concurrent jobs and 500 nodes.
Your slurmctld is crashing continuously which would explain why you are having issues.

Please, you must fix this urgently:

[2018-04-16T01:58:00.128] fatal: _slurmctld_rpc_mgr: pthread_create error Resource temporarily unavailable

Check the limits for slurm daemons, slurm user and for the system. Set at least:

/proc/sys/fs/file-max: 32.832

Follow this guideliness: https://slurm.schedmd.com/high_throughput.html
If you have systemd a start point would be (note the TasksMax and Limit* settings):

[Unit]
Description=Slurm controller daemon
After=network.target munge.service
ConditionPathExists=/etc/slurm/slurm.conf

[Service]
Type=forking
EnvironmentFile=-/etc/sysconfig/slurmctld
ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/var/run/slurm/slurmctld.pid
TasksMax=infinity
LimitNOFILE=1048576
LimitNPROC=1541404
LimitMEMLOCK=infinity
LimitSTACK=infinity

[Install]
WantedBy=multi-user.target


Fix also this:
[2018-04-16T01:58:09.371] error: chdir(/var/log): Permission denied

And check that node galaxy-bio is set correctly in both slurm.conf and gres.conf (if you have one):
[2018-04-16T01:56:14.622] error: _slurm_rpc_node_registration node=galaxy-bio: Invalid argument

You see in the log many errors due to daemon failure:
[2018-04-16T01:58:09.381] error: _shutdown_backup_controller:send/recv: Connection refused



After fixing this, check everything again. Remember you have to *restart* daemons to apply some limits, or otherwise you have to do it manually modifying /proc/<pid>/limits


Regarding to your second comment and the nodes draining, if a job is requeued due to node failure it is normal that this node is set to drain. This can happen in your situation if a socket cannot be opened to communicate to some nodes. I wouldn't worry about that until you fix the problem mentioned above.

It would be nice too if you can also fix this seen in slurmd:
[2018-03-21T10:06:04.778] error: gres/mic unable to set OFFLOAD_DEVICES, no device files configured


Tell me how it goes.
Comment 5 Felip Moll 2018-04-16 09:51:56 MDT
I just realized that you are managing this with Alex in bug 5064 from James Powell.

Let's keep track of this issue resolution in the other bug, and after that if duplicate job id keeps showing let's diagnose from scratch again from here.

Please, keep me posted about the progression.
Comment 6 Felip Moll 2018-04-19 03:02:43 MDT
Hi Ahmed,

I see that the issue from 5064 has been solved.

Are you still experimenting this issue?
Comment 7 Ahmed Arefin 2018-04-19 21:18:30 MDT
Yes, the duplicate job id issue has been resolved ) following your input(LimitNOFILE=262144 and TasksMax=infinity), however we are still experiencing the 'batch job complete failure' issue time to time. Please see the b060 logs.

Any thoughts?
Comment 8 Felip Moll 2018-04-23 04:57:15 MDT
(In reply to Ahmed Arefin from comment #7)
> Yes, the duplicate job id issue has been resolved ) following your
> input(LimitNOFILE=262144 and TasksMax=infinity), 

Good, glad it helped.

> however we are still
> experiencing the 'batch job complete failure' issue time to time. Please see
> the b060 logs.
> 
> Any thoughts?

Hm, I checked again the logs and I identify this situations:

1st. It is possible that the job have reached memory limit. I see you are using cgroups, are you enforcing memory limits in cgroup.conf (ConstrainRAMSpace=yes)? If this is the case, ensure that in slurm.conf you have the following set:

MemLimitEnforce=no
JobAcctGatherParams=NoOverMemoryKill

This will disable the internal mem. limit enforce mechanism and the job acct gather memory enforce mechanism, so keeping only one mechanism, the cgroup one, enabled for memory limit identification. Having these 3 mechanisms altogether can cause some issues.


[2018-04-13T13:28:12.972] [15132188.batch] task/cgroup: /slurm/uid_296431/job_15132188/step_batch: alloc=81920MB mem.limit=81920MB memsw.limit=unlimited
[2018-04-13T13:38:43.538] [15116233.batch] error: Step 15116233.4294967294 hit memory limit at least once during execution. This may or may not result in some failure.
[2018-04-13T13:38:43.539] [15116233.batch] error: Job 15116233 hit memory limit at least once during execution. This may or may not result in some failure.


2d. There are some jobs cancelled due to time limit, check that the ones you see batch failures are not these ones:
[2018-04-15T16:26:44.193] [15162432.batch] error: *** JOB 15162432 ON b060 CANCELLED AT 2018-04-15T16:26:44 DUE TO TIME LIMIT ***

3rd. I am not sure if this is related, but would be good if you can fix it:
[2018-03-14T12:01:38.473] [14459833.0] error: gres/mic unable to set OFFLOAD_DEVICES, no device files configured

4th. This was probably caused by the already fixed problem:
[2018-03-14T12:01:38.715] [14459833.0] error: Failed to send MESSAGE_TASK_EXIT: Connection refused
[2018-03-14T12:01:38.716] [14459833.0] done with job



If none of this makes sense to you, I would need the new&complete slurmctld logs, slurmd logs and scontrol show job of a failing job. Is it reproducible, or it does happen sporadically?
Comment 9 Ahmed Arefin 2018-04-23 18:40:46 MDT
The following lines were added to the slurm.conf file.


# SchedMD suggested changes Apr18

MemLimitEnforce=no
JobAcctGatherParams=NoOverMemoryKill

I have also ‘resumed’ the nodes that were facing the ‘batch job complete failure’, we now wait and see if the error comes back.
Comment 10 Ahmed Arefin 2018-04-29 18:17:55 MDT
Created attachment 6723 [details]
logs - slurmctld and slurmd 30-APR-2018

logs - slurmctld and slurmd 30-APR-2018
"Batch job complete failure"
slurmd log from b027.
Comment 11 Ahmed Arefin 2018-04-29 18:19:16 MDT
Still not resolved

batch job complete f root      2018-04-29T01:23:26 b[027,038,043]
batch job complete f root      2018-04-29T01:23:55 b[053,055,089]
batch job complete f slurm     2018-04-30T00:12:45 b078


Logs added - slurmctld and slurmd retrieved on 30-APR-2018
Error: "Batch job complete failure"
Slurmd log from the hostname b027.
Comment 12 Ahmed Arefin 2018-04-29 18:24:34 MDT
Created attachment 6724 [details]
b025 - slurmd duplicate job id

b025 - slurmd duplicate job id
Comment 13 Felip Moll 2018-04-30 06:09:27 MDT

[2018-04-28T01:07:46.075] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job15651535/slurm_script

[2018-04-28T01:09:41.763] error: Job 15651535 already running, do not launch second copy


1. The spool dir of slurmd is set to a local filesystem?
2. Please, send me the latest slurm.conf, cgroup.conf and gres.conf
3. What do you have in job_submit.lua?
4. GresPlugins seems to be inconsistently configured. I.e. messages like:

[2018-04-30T01:45:57.918] error: gres_plugin_node_config_unpack: no plugin configured to unpack data type ap-southeast-2 from node galaxy-bio

5. Is cm02, your backup controller, reachable? Up? Configured?
6. Is b[101-108] in your DNS or hosts.conf? I see messages like:

error: _find_node_record(751): lookup failure for b101

7. Is it normal that your jobs last for just 3 seconds?
8. What happened to b025 from 04-27-2018@22:03 to 04-28-2018@00:02? Was it freezed, rebooted?

[2018-04-27T22:03:28.050] [15648753.batch] done with job
[2018-04-28T00:02:40.400] Message aggregation disabled

After that I see some restarts of the slurmd daemon.

9. In b027 I see a restart with an error. What is this about?

[2018-04-30T00:02:01.605] Message aggregation disabled
[2018-04-30T00:02:01.605] CPU frequency setting not configured for this node
[2018-04-30T00:02:01.605] error: GresPlugins changed from ap-southeast-2,gpu,memdir,mic,one to gpu,memdir,mic,one ignored
[2018-04-30T00:02:01.605] error: Restart the slurmctld daemon to change GresPlugins

10. I cannot correlate the slurmd on b027 with slurmctld log, since it starts at 2018-04-30 and the event happened on 2018-04-29.

11. Regarding what I think is your failed job, I see:

[2018-04-29T01:22:26.137] [15657370.batch] error: *** JOB 15657370 ON b027 CANCELLED AT 2018-04-29T01:22:26 DUE TO TIME LIMIT ***
[2018-04-29T01:23:27.000] [15657370.batch] error: *** JOB 15657370 STEPD TERMINATED ON b027 AT 2018-04-29T01:23:26 DUE TO JOB NOT ENDING WITH SIGNALS ***
[2018-04-29T01:23:27.000] [15657370.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4001 status 15
[2018-04-29T01:23:27.001] [15657370.batch] done with job

This seems to be related to bug 3941 and may indeed be the cause of the nodes being drained.


Please, clarify me the previous points, I will investigate more about 11.)
Comment 14 Felip Moll 2018-04-30 06:17:03 MDT
> Please, clarify me the previous points, I will investigate more about 11.)

12. One more thing, would it be possible to get the system log (/var/log/messages) of b027 starting at 2018-04-28 and ending at 2018-04-30 ?
Comment 15 Ahmed Arefin 2018-04-30 19:38:12 MDT
Created attachment 6730 [details]
Slurm.conf, cgroup.conf and gres

Slurm.conf, cgroup.conf and gres
Comment 16 Ahmed Arefin 2018-04-30 19:38:52 MDT
Created attachment 6731 [details]
b027 messages

b027 messages
Comment 17 Ahmed Arefin 2018-05-01 00:18:39 MDT
1. The spool dir of slurmd is set to a local filesystem?
Yes.
SlurmdSpoolDir=/cm/local/apps/slurm/var/spool

2. Please, send me the latest slurm.conf, cgroup.conf and gres.conf
Attached.

3. What do you have in job_submit.lua?
Where is this file (location)? 

4. GresPlugins seems to be inconsistently configured. I.e. messages like:

[2018-04-30T01:45:57.918] error: gres_plugin_node_config_unpack: no plugin configured to unpack data type ap-southeast-2 from node galaxy-bio

Is that a problem? 

5. Is cm02, your backup controller, reachable? Up? Configured?
Yes.

6. Is b[101-108] in your DNS or hosts.conf? I see messages like:

error: _find_node_record(751): lookup failure for b101

Taken away from Slurm for windows Deployment. Do we need to do something on the BCM to let slurm know about them?


7. Is it normal that your jobs last for just 3 seconds?
?

8. What happened to b025 from 04-27-2018@22:03 to 04-28-2018@00:02? Was it freezed, rebooted?

[2018-04-27T22:03:28.050] [15648753.batch] done with job
[2018-04-28T00:02:40.400] Message aggregation disabled

After that I see some restarts of the slurmd daemon.

It wasn't freezed or rebooted.

9. In b027 I see a restart with an error. What is this about?

[2018-04-30T00:02:01.605] Message aggregation disabled
[2018-04-30T00:02:01.605] CPU frequency setting not configured for this node
[2018-04-30T00:02:01.605] error: GresPlugins changed from ap-southeast-2,gpu,memdir,mic,one to gpu,memdir,mic,one ignored
[2018-04-30T00:02:01.605] error: Restart the slurmctld daemon to change GresPlugins

10. I cannot correlate the slurmd on b027 with slurmctld log, since it starts at 2018-04-30 and the event happened on 2018-04-29.

11. Regarding what I think is your failed job, I see:

[2018-04-29T01:22:26.137] [15657370.batch] error: *** JOB 15657370 ON b027 CANCELLED AT 2018-04-29T01:22:26 DUE TO TIME LIMIT ***
[2018-04-29T01:23:27.000] [15657370.batch] error: *** JOB 15657370 STEPD TERMINATED ON b027 AT 2018-04-29T01:23:26 DUE TO JOB NOT ENDING WITH SIGNALS ***
[2018-04-29T01:23:27.000] [15657370.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4001 status 15
[2018-04-29T01:23:27.001] [15657370.batch] done with job

This seems to be related to bug 3941 and may indeed be the cause of the nodes being drained.


Please, clarify me the previous points, I will investigate more about 11.)

-Yes please.


12. One more thing, would it be possible to get the system log (/var/log/messages) of b027 starting at 2018-04-28 and ending at 2018-04-30 ?

Attached.
Comment 18 Felip Moll 2018-05-01 06:13:17 MDT
(In reply to Ahmed Arefin from comment #17)

> 3. What do you have in job_submit.lua?
> Where is this file (location)? 

In your slurmctld log file, it is refrenced:

[2018-04-30T00:02:33.490] job_submit.lua: uid=334466, name='sbatch_production_script', alloc_node='b033': set partition=h2gpu,h24gpu,gpu


job_submit.lua should be in the same directory than slurm.conf file and does modifications to your jobs on submission time.

See 'man slurm.conf' and grep by JobSubmitPlugins.


> 4. GresPlugins seems to be inconsistently configured. I.e. messages like:
>
> [2018-04-30T01:45:57.918] error: gres_plugin_node_config_unpack: no plugin
> configured to unpack data type ap-southeast-2 from node galaxy-bio
> 
> Is that a problem? 

Well, yes, indeed it is... this is the same than 9.).

This message indicates that galaxy-bio node lacks an information about a gres plugin called... "ap-southeast-2"......

From my knowledge, "ap-southeast-2" is an Amazon AWS availability zone name, not a GRES plugin.. what???
I guess somebody in your organization modified slurm.conf and did some CTRL+C CTRL+V and messed up something.

               if (i >= gres_context_cnt) {  
                        error("gres_plugin_node_state_unpack: no plugin "
                              "configured to unpack data type %u from node %s",
                              plugin_id, node_name);
                        /* A likely sign that GresPlugins has changed.
                         * Not a fatal error, skip over the data. */
                        continue;
                }                                                     


The error messages confirm that:

> [2018-04-30T00:02:01.605] error: GresPlugins changed from ap-southeast-2,gpu,memdir,mic,one to gpu,memdir,mic,one ignored
> [2018-04-30T00:02:01.605] error: Restart the slurmctld daemon to change GresPlugins

You have to do some deeper checks in your config and daemon status, I cannot help with such inconsistencies.


More on gres...your gres.conf looks like:

Name=gpu File=/dev/nvidia0
Name=gpu File=/dev/nvidia1
Name=gpu File=/dev/nvidia2
Name=gpu File=/dev/nvidia3
Name=mic Count=0
Name=one Count=1
Name=memdir Count=64

Remove Name=mic Count=0, this makes no sense, and will remove errors in the log files.
Change also your slurm.conf from 'GresTypes=gpu,memdir,mic,one' to 'GresTypes=gpu,memdir,one'

> 5. Is cm02, your backup controller, reachable? Up? Configured?
> Yes.

Are you sure? Can you make all the network tests, ping, port scan, and whatever is necessary to ensure that this is *really* reachable?

[2018-04-30T07:53:37.174] error: _shutdown_backup_controller:send/recv: Connection refused

Please, demonstrate me that it is ok.
I want also a 'ps aux|grep -i slurm', and 'netstat -anlp|grep -i slurm' on cm02.

Also, the state save location must be the same between cm02 and cm01.

> 6. Is b[101-108] in your DNS or hosts.conf? I see messages like:
> 
> error: _find_node_record(751): lookup failure for b101
> 
> Taken away from Slurm for windows Deployment. Do we need to do something on
> the BCM to let slurm know about them?

All nodes defined in slurm.conf must be resolvable or have the address explicitly set.
It seems you first removed the nodes from DNS and then from Slurm, generating these errors. 

This is not the correct order.

Before deleting a node, the usual procedure is to drain the nodes, then remove them from slurm.conf,
restart slurmctld daemons, issue a scontrol reconfig, and finally remove from DNS or whatever you want to do with these nodes. 

The slurmctld daemon has a multitude of bitmaps to track state of nodes and cores in the system. Removing nodes of a running system would require the slurmctld
daemon to rebuild all of those bitmaps, which the developers feel would be safer to do by restarting the daemon.

You also want to run a "scontrol reconfig" to make all nodes to re-read slurm.conf.


> 7. Is it normal that your jobs last for just 3 seconds?
> ?
> 

I see jobs that just start and end in about 3 seconds. If you have very short jobs but hundreds of them, Slurm requires special tuning.

https://slurm.schedmd.com/high_throughput.html

I am expecting some explanation about what's your general cluster use case and if what I observed is expected or not.

You just have to look at the b027 slurm logs to see what I mean.

> 
> 12. One more thing, would it be possible to get the system log
> (/var/log/messages) of b027 starting at 2018-04-28 and ending at 2018-04-30 ?
> 
> Attached.

Good, nothing strange.


---------


Some advices:

-Change your cgroup configuration. Remove this line in cgroup.conf, it is no longer needed in 17.11:
CgroupReleaseAgentDir="/etc/slurm/cgroup"

-A recommendation for your task/affinity setup:
       It is recommended to stack task/affinity,task/cgroup together when configuring TaskPlugin, and  set‐
       ting  TaskAffinity=no  and ConstrainCores=yes in cgroup.conf. This setup uses the task/affinity plugin for
       setting the affinity of the  tasks  (which  is  better  and  different  than  task/cgroup)  and  uses  the
       task/cgroup plugin to fence tasks into the specified resources, thus combining the best of both pieces.

slurm.conf:
TaskPlugin=task/affinity,task/cgroup

cgroup.conf:
ConstrainCores=yes
TaskAffinity=no
Comment 19 Ahmed Arefin 2018-05-01 22:33:18 MDT
Created attachment 6744 [details]
Job submit lua

Job submit lua
Comment 20 Ahmed Arefin 2018-05-01 22:50:24 MDT
I have uploaded the Jobsubmit lua, feel free to have a look. We are going to apply the following suggested changes, will let you know the outcome.


-Change your cgroup configuration. Remove this line in cgroup.conf, it is no longer needed in 17.11:
CgroupReleaseAgentDir="/etc/slurm/cgroup"

-A recommendation for your task/affinity setup:
       It is recommended to stack task/affinity,task/cgroup together when configuring TaskPlugin, and  set‐
       ting  TaskAffinity=no  and ConstrainCores=yes in cgroup.conf. This setup uses the task/affinity plugin for
       setting the affinity of the  tasks  (which  is  better  and  different  than  task/cgroup)  and  uses  the
       task/cgroup plugin to fence tasks into the specified resources, thus combining the best of both pieces.

slurm.conf:
TaskPlugin=task/affinity,task/cgroup

cgroup.conf:
ConstrainCores=yes
TaskAffinity=no
Comment 21 Felip Moll 2018-05-14 06:24:29 MDT
Hi Ahmed,

I just wanted to know if you are still experiencing the problem after the changes I proposed.

There still may be a problem indeed with jobs not ending with signals but would like to know the situation about your specific case.

Thanks
Comment 22 Ahmed Arefin 2018-05-14 17:42:15 MDT
Yes, we are still experiencing this issue. We applied the suggested changes, but we are waiting a clusterwide drain and reboot to propagate the changes which has been delayed until the next week due to a bug in the Bright cluster manager. More news soon.
Comment 23 Felip Moll 2018-05-24 02:35:39 MDT
(In reply to Ahmed Arefin from comment #22)
> Yes, we are still experiencing this issue. We applied the suggested changes,
> but we are waiting a clusterwide drain and reboot to propagate the changes
> which has been delayed until the next week due to a bug in the Bright
> cluster manager. More news soon.

Hi Ahmed, have you finally applied the changes and rebooted to propagate them?

Please, keep me informed, thanks.
Comment 24 Felip Moll 2018-06-06 03:16:10 MDT
Hi Ahmed,

Any info on this matter?
Comment 25 Ahmed Arefin 2018-06-06 20:08:35 MDT
Hello,

We think the issue has been resolved, please kindly wait a couple of days for us to further observe, then close this case. Thanks for your help.

Note: We have also applied:
cm01:~ # scontrol show config | grep UnkillableStepTimeout
UnkillableStepTimeout   = 180 sec

* This gives slurmd 3 minutes to clean up after forcing a job to quit, rather than the default 60 seconds which can be pushing it on a busy file system (especially when large core files are included)
Comment 26 Felip Moll 2018-06-07 05:27:04 MDT
(In reply to Ahmed Arefin from comment #25)
> Hello,
> 
> We think the issue has been resolved, please kindly wait a couple of days
> for us to further observe, then close this case. Thanks for your help.
> 
> Note: We have also applied:
> cm01:~ # scontrol show config | grep UnkillableStepTimeout
> UnkillableStepTimeout   = 180 sec
> 
> * This gives slurmd 3 minutes to clean up after forcing a job to quit,
> rather than the default 60 seconds which can be pushing it on a busy file
> system (especially when large core files are included)

Good, I will wait until next week and if there's no more input here I will close the issue.

Glad it is better now.

Thanks,
Felip
Comment 27 Felip Moll 2018-06-11 05:43:43 MDT
Hi,

I am closing this bug and assuming that after the config. cleanup the original errors, duplicate jobid and batch job complete failures, have disappeared.

Please, if you further see a lot and continuous errors like:

[2018-04-29T01:23:27.000] [15657370.batch] error: *** JOB 15657370 STEPD TERMINATED ON b027 AT 2018-04-29T01:23:26 DUE TO JOB NOT ENDING WITH SIGNALS ***
[2018-04-29T01:23:27.000] [15657370.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4001 status 15

don't hesitate and re-open or attach yourself to the other bug 5262 which is dealing specifically with this error).

Regards
Felip