We updated our 11.4.4 to 11.4.6 and discovered that slurmd started crashing after about an hour of running. The system became unusable since the daemon could not stay up for longer than 5 seconds. We found a bug filed (bug 1557) which resembled our problem after analysing the crash dump and once we too the offending node out of circulation the daemon started running nominally. Looks like this bug was re-introduced in 11.4.6 Thanks, Gene
Could you send a back trace? On May 10, 2015 3:05:07 PM PDT, bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=1660 > > Site: University of Auckland > Bug ID: 1660 > Summary: slurmd crashes after 5 sec of running > Product: Slurm > Version: 14.11.6 > Hardware: Linux > OS: Linux > Status: UNCONFIRMED > Severity: 2 - High Impact > Priority: --- > Component: slurmd > Assignee: david@schedmd.com > Reporter: g.soudlenkov@auckland.ac.nz > CC: brian@schedmd.com, da@schedmd.com, david@schedmd.com, > jette@schedmd.com > >We updated our 11.4.4 to 11.4.6 and discovered that slurmd started >crashing >after about an hour of running. The system became unusable since the >daemon >could not stay up for longer than 5 seconds. We found a bug filed (bug >1557) >which resembled our problem after analysing the crash dump and once we >too the >offending node out of circulation the daemon started running nominally. >Looks >like this bug was re-introduced in 11.4.6 > >Thanks, >Gene > >-- >You are receiving this mail because: >You are on the CC list for the bug.
Hi, Danny Here it is: #0 0x00000036f3832625 in raise () from /lib64/libc.so.6 #1 0x00000036f3833e05 in abort () from /lib64/libc.so.6 #2 0x00000036f382b74e in __assert_fail_base () from /lib64/libc.so.6 #3 0x00000036f382b810 in __assert_fail () from /lib64/libc.so.6 #4 0x00000000004e5847 in bit_and (b1=0x2f9a750, b2=0x2f9acf0) at bitstring.c:624 #5 0x00000000004b3505 in _step_dealloc_lps (step_ptr=0x2f9aa70) at step_mgr.c:2034 #6 0x00000000004b9145 in post_job_step (step_ptr=0x2f9aa70) at step_mgr.c:4097 #7 0x00000000004aeb22 in _internal_step_complete (job_ptr=0x2f99ed0, step_ptr=0x2f9aa70) at step_mgr.c:226 #8 0x00000000004afae4 in job_step_complete (job_id=13770315, step_id=0, uid=0, requeue=false, job_return_code=15) at step_mgr.c:730 #9 0x0000000000493b3d in _slurm_rpc_step_complete (msg=0x7f990c0008d0) at proc_req.c:2949 #10 0x000000000048beed in slurmctld_req (msg=0x7f990c0008d0, arg=0x7f98bc04c160) at proc_req.c:458 #11 0x00000000004357b3 in _service_connection (arg=0x7f98bc04c160) at controller.c:1070 #12 0x00000036f3c079d1 in start_thread () from /lib64/libpthread.so.0 #13 0x00000036f38e88fd in clone () from /lib64/libc.so.6 Gene
Thanks Gene, this appears to be from the slurmctld not the slurmd is that correct? Could you send the log from before the abort? Was there a configuration change around the time it started? On May 10, 2015 4:35:22 PM PDT, bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=1660 > >--- Comment #2 from Gene Soudlenkov <g.soudlenkov@auckland.ac.nz> --- >Hi, Danny > >Here it is: > >#0 0x00000036f3832625 in raise () from /lib64/libc.so.6 >#1 0x00000036f3833e05 in abort () from /lib64/libc.so.6 >#2 0x00000036f382b74e in __assert_fail_base () from /lib64/libc.so.6 >#3 0x00000036f382b810 in __assert_fail () from /lib64/libc.so.6 >#4 0x00000000004e5847 in bit_and (b1=0x2f9a750, b2=0x2f9acf0) at >bitstring.c:624 >#5 0x00000000004b3505 in _step_dealloc_lps (step_ptr=0x2f9aa70) at >step_mgr.c:2034 >#6 0x00000000004b9145 in post_job_step (step_ptr=0x2f9aa70) at >step_mgr.c:4097 >#7 0x00000000004aeb22 in _internal_step_complete (job_ptr=0x2f99ed0, >step_ptr=0x2f9aa70) at step_mgr.c:226 >#8 0x00000000004afae4 in job_step_complete (job_id=13770315, >step_id=0, uid=0, >requeue=false, job_return_code=15) at step_mgr.c:730 >#9 0x0000000000493b3d in _slurm_rpc_step_complete (msg=0x7f990c0008d0) >at >proc_req.c:2949 >#10 0x000000000048beed in slurmctld_req (msg=0x7f990c0008d0, >arg=0x7f98bc04c160) at proc_req.c:458 >#11 0x00000000004357b3 in _service_connection (arg=0x7f98bc04c160) at >controller.c:1070 >#12 0x00000036f3c079d1 in start_thread () from /lib64/libpthread.so.0 >#13 0x00000036f38e88fd in clone () from /lib64/libc.so.6 > > >Gene > >-- >You are receiving this mail because: >You are on the CC list for the bug.
Ah, sorry, my bad - yes, I meant slurmctld Logs: stand by Gene
This is a new problem. The allocation size for the job and step don't match, or at least the bitmaps representing the allocated cores don't match for job 13770315 and it's step #0. Did the configuration change when you restarted? If so, it would be helpful to get before and after slurm.conf files.
Negative, configuration did not change between the starts. We try to keep the configuration files as static as possible with the majority of dynamics being handled by our filter plugin - it does partition routing, auth, etc.
Created attachment 1871 [details] Patch to prevent bitsize mis-match abort I'm not sure how these bitmaps ended up being different sizes and I would definitely like to review the logs for clues, but this patch should prevent this specific abort. I am concerned about what else might be bad though. diff --git a/src/slurmctld/step_mgr.c b/src/slurmctld/step_mgr.c index 16d7147..4d398f3 100644 --- a/src/slurmctld/step_mgr.c +++ b/src/slurmctld/step_mgr.c @@ -2030,10 +2030,19 @@ static void _step_dealloc_lps(struct step_record *step_ptr) xassert(job_resrcs_ptr->core_bitmap_used); if (step_ptr->core_bitmap_job) { /* Mark the job's cores as no longer in use */ - bit_not(step_ptr->core_bitmap_job); - bit_and(job_resrcs_ptr->core_bitmap_used, - step_ptr->core_bitmap_job); - /* no need for bit_not(step_ptr->core_bitmap_job); */ + int job_core_size, step_core_size; + job_core_size = bit_size(job_resrcs_ptr->core_bitmap_used); + step_core_size = bit_size(step_ptr->core_bitmap_job); + if (job_core_size != step_core_size) { + error("%s: %u.%u core_bitmap size mismatch (%d != %d)", + __func__, job_ptr->job_id, step_ptr->step_id, + job_core_size, step_core_size); + } else { + bit_not(step_ptr->core_bitmap_job); + bit_and(job_resrcs_ptr->core_bitmap_used, + step_ptr->core_bitmap_job); + /* no need for bit_not(step_ptr->core_bitmap_job); */ + } FREE_NULL_BITMAP(step_ptr->core_bitmap_job); } #endif
Created attachment 1872 [details] Log file before the crash
Moe, thanks for the patch. If we apply and recompile it, can we update all daemons running without shutting down the cluster/killing the jobs? Gene
irst priority is getting that patch in place and getting your system up and running. After that, I see several odd things right away in the log file to investigate: ================================= May 8 16:39:30 slurm-001-p slurmctld[6595]: error: Registered job 13770315.0 on wrong node compute-c1-003 error: Registered job 13770315.0 on wrong node compute-c1-003 Perhaps you could search for "13770315" in an earlier slurmctld log file. I'm wondering where slurm expects to find the job. It would probably be worthwhile to save the slurmd log(s) from the relevant nodes too. ============================================== The other thing of note is a bunch of messages of this sort for various nodes: May 8 16:39:30 slurm-001-p slurmctld[6595]: error: Node compute-a1-017 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. Are you expecting different slurm.conf files on the different nodes? That is almost always a bad idea. ======================================================== Then it appears there is some problem in your topology.conf file: May 8 16:45:44 slurm-001-p slurmctld[9983]: TOPOLOGY: warning -- no switch can reach all nodes through its descendants.Do not use route/topology TOPOLOGY: warning -- no switch can reach all nodes through its descendants.Do not use route/topology If it's not clear to you what the issue is, you can send me your topology.conf and slurm.conf files to advise. ================================== This might indicate a really old version of slurm on node compute-c1-05: error: _slurm_rpc_node_registration node=compute-c1-058: Invalid argument
(In reply to Gene Soudlenkov from comment #9) > Moe, thanks for the patch. If we apply and recompile it, can we update all > daemons running without shutting down the cluster/killing the jobs? > > Gene This patch only effects the slurmctld, you just need updated code on the head node. No jobs should be lost as a result of this patch. It's just going to log a bitmap size mismatch and if job 13770315 has more job steps, they would probably get stuck, so that specific job may need to be manually killed.
Job related logs <30>May 8 17:01:33 slurm-001-p slurmctld[14903]: Time limit exhausted for JobId=13770315 Time limit exhausted for JobId=13770315 <30>May 8 16:35:48 slurm-001-p slurmctld[31383]: job_complete: JobID=13770315 State=0x8004 NodeCnt=1 WIFEXITED 0 WEXITSTATUS 0 job_complete: JobID=13770315 State=0x8004 NodeCnt=1 WIFEXITED 0 WEXITSTATUS 0 <11>May 8 16:35:47 compute-bigmem-001-p slurmstepd[15759]: error: *** JOB 13770315 CANCELLED AT 2015-05-08T16:35:47 *** on compute-bigmem-001 error: *** JOB 13770315 CANCELLED AT 2015-05-08T16:35:47 *** on compute-bigmem-001 <30>May 8 16:26:20 compute-c1-003-p slurmd[28869]: Warning: revoke on job 13770315 has no expiration Warning: revoke on job 13770315 has no expiration <30>May 8 16:08:12 compute-c1-003-p slurmd[18165]: Warning: revoke on job 13770315 has no expiration Warning: revoke on job 13770315 has no expiration <30>May 8 16:05:45 compute-c1-003-p slurmd[16545]: Warning: revoke on job 13770315 has no expiration Warning: revoke on job 13770315 has no expiration <30>May 8 15:41:51 compute-bigmem-001-p slurmd[9965]: Launching batch job 13770315 for UID 5323 Launching batch job 13770315 for UID 5323 <30>May 8 15:41:51 compute-bigmem-001-p slurmd[9965]: _run_prolog: prolog with lock for job 13770315 ran for 2 seconds _run_prolog: prolog with lock for job 13770315 ran for 2 seconds <30>May 8 15:41:48 slurm-001-p slurmctld[9562]: sched: Allocate JobId=13770315 NodeList=compute-bigmem-001 #CPUs=12 sched: Allocate JobId=13770315 NodeList=compute-bigmem-001 #CPUs=12 <11>May 8 15:41:46 compute-c1-003-p slurmstepd[1945]: error: *** JOB 13770315 CANCELLED AT 2015-05-08T15:41:46 *** on compute-c1-003 error: *** JOB 13770315 CANCELLED AT 2015-05-08T15:41:46 *** on compute-c1-003 <30>May 8 15:41:46 slurm-001-p slurmctld[9562]: job_complete: JobID=13770315 State=0x0 NodeCnt=1 WIFEXITED 0 WEXITSTATUS 0 job_complete: JobID=13770315 State=0x0 NodeCnt=1 WIFEXITED 0 WEXITSTATUS 0 <30>May 8 15:41:46 slurm-001-p slurmctld[9562]: job_complete: JobID=13770315 State=0x0 NodeCnt=1 cancelled from interactive user or node failure job_complete: JobID=13770315 State=0x0 NodeCnt=1 cancelled from interactive user or node failure <30>May 8 15:37:02 slurm-001-p slurmctld[9562]: _slurm_rpc_checkpoint vacate 13770315: Operation already in progress _slurm_rpc_checkpoint vacate 13770315: Operation already in progress <30>May 8 15:35:44 slurm-001-p slurmctld[9562]: _slurm_rpc_checkpoint requeue 13770315: Operation already in progress _slurm_rpc_checkpoint requeue 13770315: Operation already in progress <30>May 8 15:35:09 slurm-001-p slurmctld[9562]: _slurm_rpc_checkpoint create 13770315: Operation already in progress _slurm_rpc_checkpoint create 13770315: Operation already in progress <30>May 8 15:34:07 slurm-001-p slurmctld[9562]: _slurm_rpc_checkpoint create 13770315: Operation already in progress _slurm_rpc_checkpoint create 13770315: Operation already in progress <30>May 8 15:33:31 slurm-001-p slurmctld[9562]: _slurm_rpc_checkpoint requeue for 13770315 usec=11812 _slurm_rpc_checkpoint requeue for 13770315 usec=11812 <30>May 8 15:28:24 compute-c1-003-p slurmd[4285]: _run_prolog: prolog with lock for job 13770315 ran for 0 seconds _run_prolog: prolog with lock for job 13770315 ran for 0 seconds <30>May 8 15:28:24 compute-c1-003-p slurmd[4285]: Launching batch job 13770315 for UID 5323 Launching batch job 13770315 for UID 5323 <30>May 8 15:28:24 slurm-001-p slurmctld[9562]: sched: Allocate JobId=13770315 NodeList=compute-c1-003 #CPUs=12 sched: Allocate JobId=13770315 NodeList=compute-c1-003 #CPUs=12 <85>May 8 15:15:39 compute-gpu-c1-001-p sudo: jbla572 : TTY=pts/18 ; PWD=/home/jbla572/test ; USER=root ; COMMAND=/usr/bin/scontrol update jobid=13770315 priority=60000 jbla572 : TTY=pts/18 ; PWD=/home/jbla572/test ; USER=root ; COMMAND=/usr/bin/scontrol update jobid=13770315 priority=60000 <30>May 8 15:15:38 slurm-001-p slurmctld[9562]: _slurm_rpc_update_job complete JobId=13770315 uid=0 usec=43095 _slurm_rpc_update_job complete JobId=13770315 uid=0 usec=43095 <30>May 8 15:15:38 slurm-001-p slurmctld[9562]: sched: update_job: setting priority to 60000 for job_id 13770315 sched: update_job: setting priority to 60000 for job_id 13770315 <30>May 8 15:15:34 slurm-001-p slurmctld[9562]: _slurm_rpc_submit_batch_job JobId=13770315 usec=10996 _slurm_rpc_submit_batch_job JobId=13770315 usec=10996
I'll try to piece this all together on Monday morning, but it looks like the job started on one node (compute-bigmem-001-p), then got requeued and restarted on another node (compute-c1-003-p). If you can capture the slurmd logs on those two compute nodes with respect to job 13770315, that may help. Given this additional information, I'm pretty sure that my patch does the right thing.
Is your system running now with the patch?
Patched's been installed and the service restarted - seems to be running OK but we will keep an eye on it Thanks a lot, guys Gene
This is what appears to have happened. 15:28:24 Job 13770315 started on compute-c1-003 15:33:31 Job checkpoint requeue requested 15:41:46 Job cancelled from interactive user or node failure It appears to have not have fully terminated, with job step 0 still active on compute-c1-003 15:41:51 Job 13770315 restarted on compute-bigmem-001-p 16:39:30 Job step 13770315.0 reported on compute-c1-003 from incomplete termination, so slurmctld tries to clean up its state for the job, and the number of cores on compute-c1-003 and compute-bigmem-001-p, so a bit AND operation results in an abort due to the difference in sizes. I checked into the code two things: 1. When a job is requeued, make sure that all step state information has been purged. (which you do not have, but probably will not need with the patch that you do have) 2. When terminating a job step, check the bitmap sizes and report an error if they differ rather than aborting. (which is the patch you already have) Both changes will be in version 14.11.7 when released and the commit is here: https://github.com/SchedMD/slurm/commit/beecc7b08e93689c0f26f87d4ad7f00e43a69612
I am closing this bug based upon the patch cited in commit 16. Please re-open if necessary. I also took the liberty of opening a new bug to pursue the configuration problems identified in comment #10. See http://bugs.schedmd.com/show_bug.cgi?id=1662