Ticket 2573

Summary:	Contiguous Crashes slurmctld
Product:	Slurm	Reporter:	Paul Edmon <pedmon>
Component:	slurmctld	Assignee:	Alejandro Sanchez <alex>
Status:	RESOLVED FIXED	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	15.08.8
Hardware:	Linux
OS:	Linux
Site:	Harvard University	Alineos Sites:	---
Atos/Eviden Sites:	---	Confidential Site:	---
Coreweave sites:	---	Cray Sites:	---
DS9 clusters:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Linux Distro:	---
Machine Name:		CLE Version:
Version Fixed:	15.08.11	Target Release:	---
DevPrio:	---	Emory-Cloud Sites:	---
Attachments:	Backtrace of core dump slurm.conf job_submit.lua

Description Paul Edmon 2016-03-21 01:29:30 MDT

We had our master go down over the weekend due to a job requesting:

#SBATCH --contiguous
#SBATCH -N 1
#SBATCH -n 16

When we dropped the contiguous the problem went way and the job scheduled.  When the scheduler crashed it produced the following errors:


Mar 19 15:28:23 holy-slurm01 slurmctld[20549]: error: cons_res: _compute_c_b_task_dist invalid allocation for job 58493007
Mar 19 15:28:23 holy-slurm01 slurmctld[20549]: error: cons_res: cr_dist: Error in _compute_c_b_task_dist
Mar 19 15:28:23 holy-slurm01 slurmctld[20549]: error: Select plugin failed to set job resources, nodes
Mar 19 15:28:23 holy-slurm01 slurmctld[20549]: error: _add_job_to_res: job 58493007 has no job_resrcs info
Mar 19 15:28:23 holy-slurm01 slurmctld[20549]: error: select_g_select_nodeinfo_set(58493007): No error
Mar 19 15:28:23 holy-slurm01 kernel: slurmctld_sched[20549]: segfault at 50 ip 0000000000491276 sp 00007fff4e5b5a20 error 4 in slurmctld[400000+28e000]

I also have a core dump from the scheduler at that time.

Let me know if you need more info.  The scheduler is running fine again, this seems to be an edge case issue.  Thanks.

Comment 1 Tim Wickberg 2016-03-21 02:20:42 MDT

Can you give us a backtrace from the core dump?

(gdb) bt
(gdb) thread apply all bt

Comment 2 Paul Edmon 2016-03-23 05:56:53 MDT

So the scheduler hung up again here is the job:

[root@holy-slurm01 ~]# scontrol -dd show job 58831633
JobId=58831633 JobName=Femig1
    UserId=sooran(5699492) GroupId=li_lab_seas(403252)
    Priority=0 Nice=0 Account=li_lab_seas QOS=normal
    JobState=PENDING Reason=JobHeldAdmin Dependency=(null)
    Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
    DerivedExitCode=0:0
    RunTime=00:00:00 TimeLimit=2-00:00:00 TimeMin=N/A
    SubmitTime=2016-03-23T14:35:01 EligibleTime=2016-03-23T14:35:01
    StartTime=Unknown EndTime=Unknown
    PreemptTime=None SuspendTime=None SecsPreSuspend=0
    Partition=general AllocNode:Sid=rclogin06:3698
    ReqNodeList=(null) ExcNodeList=(null)
    NodeList=(null)
    NumNodes=1-1 NumCPUs=64 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
    TRES=cpu=64,mem=12800,node=1
    Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
    MinCPUsNode=64 MinMemoryCPU=200M MinTmpDiskNode=0
    Features=(null) Gres=(null) Reservation=(null)
    Shared=OK Contiguous=1 Licenses=(null) Network=(null)
    Command=/n/regal/li_lab_seas/SK/vasp_t/test2/test3/qvasp
    WorkDir=/n/regal/li_lab_seas/SK/vasp_t/test2/test3
    StdErr=/n/regal/li_lab_seas/SK/vasp_t/test2/test3/run.e58831633
    StdIn=/dev/null
    StdOut=/n/regal/li_lab_seas/SK/vasp_t/test2/test3/run.o58831633
    BatchScript=
#!/bin/bash

#SBATCH -J Femig1
#SBATCH -p general            # Queue name
#SBATCH -N 1                  # Total number of nodes requested (64 
cores/node)
#SBATCH -n 64                 # Total number of mpi tasks requested
#SBATCH -t 48:00:00           # Run time (hh:mm:ss)
#SBATCH --mem-per-cpu=200     # Memory pool for all cores
#SBATCH -o run.o%j            # Output and error file name (%j expands 
to jobID)
#SBATCH -e run.e%j            # Output and error file name (%j expands 
to jobID)
#SBATCH --mail-type=END       # Type of email notification- BEGIN, END, 
FAIL, AL                            L
#SBATCH --mail-user=sooran@seas.harvard.edu # Email to which 
notifications will                             be sent
#SBATCH --contiguous

module load intel/15.0.0-fasrc01 openmpi/1.10.2-fasrc01 
intel-mkl/11.0.0.079-fas                            rc02
mpirun -np $SLURM_NTASKS ~/bin/vasp > vasp.out

    Power= SICP=0

So definitely has to do with the --continguous flag and -N 1 at the same 
time.

-Paul Edmon-

So

Comment 3 Tim Wickberg 2016-03-25 06:27:55 MDT

Have you had a chance to get a backtrace from the original core dump you mentioned, or the more recent one?

I haven't managed to recreate this just yet here, and the backtrace would help isolate what I should be looking at.

- Tim

Comment 4 Paul Edmon 2016-03-28 00:50:43 MDT

Ah, my mistake. I thought I had. Give me a sec.

Comment 5 Paul Edmon 2016-03-28 00:52:15 MDT

Created attachment 2936 [details]
Backtrace of core dump

Comment 7 Alejandro Sanchez 2016-03-30 03:35:19 MDT

Paul, could you please attach your slurm.conf? Thanks.

Comment 8 Paul Edmon 2016-03-30 04:11:57 MDT

Created attachment 2951 [details]
slurm.conf

Comment 9 Paul Edmon 2016-03-30 04:12:39 MDT

Here it is.  Just FYI we are planning on upgrading to 15.08.9 on Monday.  Let us know if that will be a problem.

Comment 14 Alejandro Sanchez 2016-04-01 01:40:15 MDT

Paul, just a few more questions. Is the crash which happens every time someone specifies a job script such as the one in first comment? or crash just happened once? Also, could you please attach your job_submit plugin? Thank you.

Comment 15 Paul Edmon 2016-04-01 01:43:07 MDT

It seems to be every time someone does:

#SBATCH -N 1
#SBATCH --continguous

We've had a few fails due to that.  It's reproducible in our environment 
at least.  I will attach our job submit plugin in just a moment.

-Paul Edmon-

On 04/01/2016 10:40 AM, bugs@schedmd.com wrote:
>
> *Comment # 14 <https://bugs.schedmd.com/show_bug.cgi?id=2573#c14> on 
> bug 2573 <https://bugs.schedmd.com/show_bug.cgi?id=2573> from 
> Alejandro Sanchez <mailto:alex@schedmd.com> *
> Paul, just a few more questions. Is the crash which happens every time someone
> specifies a job script such as the one in first comment? or crash just happened
> once? Also, could you please attach your job_submit plugin? Thank you.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 16 Paul Edmon 2016-04-01 01:43:29 MDT

Created attachment 2966 [details]
job_submit.lua

Comment 19 Moe Jette 2016-04-06 09:36:00 MDT

Not easy to track down, but you'll find a one line fix here:

https://github.com/SchedMD/slurm/commit/47a07b546343efa38307d3b4c2fefeb5d8ddbef2

The fix will be in version 15.08.11 when released, likely in May.
Please re-open the ticket if this doesn't fix the problem for you.