Ticket 2573 - Contiguous Crashes slurmctld
Summary: Contiguous Crashes slurmctld
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 15.08.8
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Alejandro Sanchez
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2016-03-21 01:29 MDT by Paul Edmon
Modified: 2016-04-06 09:36 MDT (History)
0 users

See Also:
Site: Harvard University
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 15.08.11
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Backtrace of core dump (76.09 KB, text/x-log)
2016-03-28 00:52 MDT, Paul Edmon
Details
slurm.conf (41.59 KB, text/plain)
2016-03-30 04:11 MDT, Paul Edmon
Details
job_submit.lua (10.48 KB, text/plain)
2016-04-01 01:43 MDT, Paul Edmon
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Paul Edmon 2016-03-21 01:29:30 MDT
We had our master go down over the weekend due to a job requesting:

#SBATCH --contiguous
#SBATCH -N 1
#SBATCH -n 16

When we dropped the contiguous the problem went way and the job scheduled.  When the scheduler crashed it produced the following errors:


Mar 19 15:28:23 holy-slurm01 slurmctld[20549]: error: cons_res: _compute_c_b_task_dist invalid allocation for job 58493007
Mar 19 15:28:23 holy-slurm01 slurmctld[20549]: error: cons_res: cr_dist: Error in _compute_c_b_task_dist
Mar 19 15:28:23 holy-slurm01 slurmctld[20549]: error: Select plugin failed to set job resources, nodes
Mar 19 15:28:23 holy-slurm01 slurmctld[20549]: error: _add_job_to_res: job 58493007 has no job_resrcs info
Mar 19 15:28:23 holy-slurm01 slurmctld[20549]: error: select_g_select_nodeinfo_set(58493007): No error
Mar 19 15:28:23 holy-slurm01 kernel: slurmctld_sched[20549]: segfault at 50 ip 0000000000491276 sp 00007fff4e5b5a20 error 4 in slurmctld[400000+28e000]

I also have a core dump from the scheduler at that time.

Let me know if you need more info.  The scheduler is running fine again, this seems to be an edge case issue.  Thanks.
Comment 1 Tim Wickberg 2016-03-21 02:20:42 MDT
Can you give us a backtrace from the core dump?

(gdb) bt
(gdb) thread apply all bt
Comment 2 Paul Edmon 2016-03-23 05:56:53 MDT
So the scheduler hung up again here is the job:

[root@holy-slurm01 ~]# scontrol -dd show job 58831633
JobId=58831633 JobName=Femig1
    UserId=sooran(5699492) GroupId=li_lab_seas(403252)
    Priority=0 Nice=0 Account=li_lab_seas QOS=normal
    JobState=PENDING Reason=JobHeldAdmin Dependency=(null)
    Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
    DerivedExitCode=0:0
    RunTime=00:00:00 TimeLimit=2-00:00:00 TimeMin=N/A
    SubmitTime=2016-03-23T14:35:01 EligibleTime=2016-03-23T14:35:01
    StartTime=Unknown EndTime=Unknown
    PreemptTime=None SuspendTime=None SecsPreSuspend=0
    Partition=general AllocNode:Sid=rclogin06:3698
    ReqNodeList=(null) ExcNodeList=(null)
    NodeList=(null)
    NumNodes=1-1 NumCPUs=64 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
    TRES=cpu=64,mem=12800,node=1
    Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
    MinCPUsNode=64 MinMemoryCPU=200M MinTmpDiskNode=0
    Features=(null) Gres=(null) Reservation=(null)
    Shared=OK Contiguous=1 Licenses=(null) Network=(null)
    Command=/n/regal/li_lab_seas/SK/vasp_t/test2/test3/qvasp
    WorkDir=/n/regal/li_lab_seas/SK/vasp_t/test2/test3
    StdErr=/n/regal/li_lab_seas/SK/vasp_t/test2/test3/run.e58831633
    StdIn=/dev/null
    StdOut=/n/regal/li_lab_seas/SK/vasp_t/test2/test3/run.o58831633
    BatchScript=
#!/bin/bash

#SBATCH -J Femig1
#SBATCH -p general            # Queue name
#SBATCH -N 1                  # Total number of nodes requested (64 
cores/node)
#SBATCH -n 64                 # Total number of mpi tasks requested
#SBATCH -t 48:00:00           # Run time (hh:mm:ss)
#SBATCH --mem-per-cpu=200     # Memory pool for all cores
#SBATCH -o run.o%j            # Output and error file name (%j expands 
to jobID)
#SBATCH -e run.e%j            # Output and error file name (%j expands 
to jobID)
#SBATCH --mail-type=END       # Type of email notification- BEGIN, END, 
FAIL, AL                            L
#SBATCH --mail-user=sooran@seas.harvard.edu # Email to which 
notifications will                             be sent
#SBATCH --contiguous

module load intel/15.0.0-fasrc01 openmpi/1.10.2-fasrc01 
intel-mkl/11.0.0.079-fas                            rc02
mpirun -np $SLURM_NTASKS ~/bin/vasp > vasp.out

    Power= SICP=0

So definitely has to do with the --continguous flag and -N 1 at the same 
time.

-Paul Edmon-

So
Comment 3 Tim Wickberg 2016-03-25 06:27:55 MDT
Have you had a chance to get a backtrace from the original core dump you mentioned, or the more recent one?

I haven't managed to recreate this just yet here, and the backtrace would help isolate what I should be looking at.

- Tim
Comment 4 Paul Edmon 2016-03-28 00:50:43 MDT
Ah, my mistake. I thought I had. Give me a sec.
Comment 5 Paul Edmon 2016-03-28 00:52:15 MDT
Created attachment 2936 [details]
Backtrace of core dump
Comment 7 Alejandro Sanchez 2016-03-30 03:35:19 MDT
Paul, could you please attach your slurm.conf? Thanks.
Comment 8 Paul Edmon 2016-03-30 04:11:57 MDT
Created attachment 2951 [details]
slurm.conf
Comment 9 Paul Edmon 2016-03-30 04:12:39 MDT
Here it is.  Just FYI we are planning on upgrading to 15.08.9 on Monday.  Let us know if that will be a problem.
Comment 14 Alejandro Sanchez 2016-04-01 01:40:15 MDT
Paul, just a few more questions. Is the crash which happens every time someone specifies a job script such as the one in first comment? or crash just happened once? Also, could you please attach your job_submit plugin? Thank you.
Comment 15 Paul Edmon 2016-04-01 01:43:07 MDT
It seems to be every time someone does:

#SBATCH -N 1
#SBATCH --continguous

We've had a few fails due to that.  It's reproducible in our environment 
at least.  I will attach our job submit plugin in just a moment.

-Paul Edmon-

On 04/01/2016 10:40 AM, bugs@schedmd.com wrote:
>
> *Comment # 14 <https://bugs.schedmd.com/show_bug.cgi?id=2573#c14> on 
> bug 2573 <https://bugs.schedmd.com/show_bug.cgi?id=2573> from 
> Alejandro Sanchez <mailto:alex@schedmd.com> *
> Paul, just a few more questions. Is the crash which happens every time someone
> specifies a job script such as the one in first comment? or crash just happened
> once? Also, could you please attach your job_submit plugin? Thank you.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 16 Paul Edmon 2016-04-01 01:43:29 MDT
Created attachment 2966 [details]
job_submit.lua
Comment 19 Moe Jette 2016-04-06 09:36:00 MDT
Not easy to track down, but you'll find a one line fix here:

https://github.com/SchedMD/slurm/commit/47a07b546343efa38307d3b4c2fefeb5d8ddbef2

The fix will be in version 15.08.11 when released, likely in May.
Please re-open the ticket if this doesn't fix the problem for you.