Summary: | Contiguous Crashes slurmctld | ||
---|---|---|---|
Product: | Slurm | Reporter: | Paul Edmon <pedmon> |
Component: | slurmctld | Assignee: | Alejandro Sanchez <alex> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | ||
Version: | 15.08.8 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | Harvard University | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | 15.08.11 | Target Release: | --- |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Attachments: |
Backtrace of core dump
slurm.conf job_submit.lua |
Description
Paul Edmon
2016-03-21 01:29:30 MDT
Can you give us a backtrace from the core dump? (gdb) bt (gdb) thread apply all bt So the scheduler hung up again here is the job: [root@holy-slurm01 ~]# scontrol -dd show job 58831633 JobId=58831633 JobName=Femig1 UserId=sooran(5699492) GroupId=li_lab_seas(403252) Priority=0 Nice=0 Account=li_lab_seas QOS=normal JobState=PENDING Reason=JobHeldAdmin Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 DerivedExitCode=0:0 RunTime=00:00:00 TimeLimit=2-00:00:00 TimeMin=N/A SubmitTime=2016-03-23T14:35:01 EligibleTime=2016-03-23T14:35:01 StartTime=Unknown EndTime=Unknown PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=general AllocNode:Sid=rclogin06:3698 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1-1 NumCPUs=64 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=64,mem=12800,node=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=64 MinMemoryCPU=200M MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=1 Licenses=(null) Network=(null) Command=/n/regal/li_lab_seas/SK/vasp_t/test2/test3/qvasp WorkDir=/n/regal/li_lab_seas/SK/vasp_t/test2/test3 StdErr=/n/regal/li_lab_seas/SK/vasp_t/test2/test3/run.e58831633 StdIn=/dev/null StdOut=/n/regal/li_lab_seas/SK/vasp_t/test2/test3/run.o58831633 BatchScript= #!/bin/bash #SBATCH -J Femig1 #SBATCH -p general # Queue name #SBATCH -N 1 # Total number of nodes requested (64 cores/node) #SBATCH -n 64 # Total number of mpi tasks requested #SBATCH -t 48:00:00 # Run time (hh:mm:ss) #SBATCH --mem-per-cpu=200 # Memory pool for all cores #SBATCH -o run.o%j # Output and error file name (%j expands to jobID) #SBATCH -e run.e%j # Output and error file name (%j expands to jobID) #SBATCH --mail-type=END # Type of email notification- BEGIN, END, FAIL, AL L #SBATCH --mail-user=sooran@seas.harvard.edu # Email to which notifications will be sent #SBATCH --contiguous module load intel/15.0.0-fasrc01 openmpi/1.10.2-fasrc01 intel-mkl/11.0.0.079-fas rc02 mpirun -np $SLURM_NTASKS ~/bin/vasp > vasp.out Power= SICP=0 So definitely has to do with the --continguous flag and -N 1 at the same time. -Paul Edmon- So Have you had a chance to get a backtrace from the original core dump you mentioned, or the more recent one? I haven't managed to recreate this just yet here, and the backtrace would help isolate what I should be looking at. - Tim Ah, my mistake. I thought I had. Give me a sec. Created attachment 2936 [details]
Backtrace of core dump
Paul, could you please attach your slurm.conf? Thanks. Created attachment 2951 [details]
slurm.conf
Here it is. Just FYI we are planning on upgrading to 15.08.9 on Monday. Let us know if that will be a problem. Paul, just a few more questions. Is the crash which happens every time someone specifies a job script such as the one in first comment? or crash just happened once? Also, could you please attach your job_submit plugin? Thank you. It seems to be every time someone does: #SBATCH -N 1 #SBATCH --continguous We've had a few fails due to that. It's reproducible in our environment at least. I will attach our job submit plugin in just a moment. -Paul Edmon- On 04/01/2016 10:40 AM, bugs@schedmd.com wrote: > > *Comment # 14 <https://bugs.schedmd.com/show_bug.cgi?id=2573#c14> on > bug 2573 <https://bugs.schedmd.com/show_bug.cgi?id=2573> from > Alejandro Sanchez <mailto:alex@schedmd.com> * > Paul, just a few more questions. Is the crash which happens every time someone > specifies a job script such as the one in first comment? or crash just happened > once? Also, could you please attach your job_submit plugin? Thank you. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Created attachment 2966 [details]
job_submit.lua
Not easy to track down, but you'll find a one line fix here: https://github.com/SchedMD/slurm/commit/47a07b546343efa38307d3b4c2fefeb5d8ddbef2 The fix will be in version 15.08.11 when released, likely in May. Please re-open the ticket if this doesn't fix the problem for you. |