Ticket 1854 - Segfault
Summary: Segfault
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 14.11.7
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Brian Christiansen
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2015-08-11 08:53 MDT by Brian Christiansen
Modified: 2015-08-18 03:09 MDT (History)
3 users (show)

See Also:
Site: Harvard University
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
var_log_messages (25.87 KB, text/plain)
2015-08-11 09:33 MDT, Scott Yockel
Details
gdb_coredump (1.64 KB, text/plain)
2015-08-11 09:37 MDT, Scott Yockel
Details
var_log_messages.gz (1.78 KB, application/x-gzip)
2015-08-11 14:36 MDT, Scott Yockel
Details
Current slurm.conf (31.95 KB, text/plain)
2015-08-11 23:35 MDT, Scott Yockel
Details
slurmctld_prolog (2.84 KB, text/x-python)
2015-08-12 02:02 MDT, Scott Yockel
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Brian Christiansen 2015-08-11 08:53:00 MDT
Aug 11 14:59:20 holy-slurm01 slurmctld[36344]: find_node_record passed NULL name
Aug 11 14:59:20 holy-slurm01 kernel: slurmctld_sched[57340]: segfault at 68 ip 000000000046ad36 sp 00007f610b9e8c90 error 4 in slurmctld[400000+23f000]
Aug 11 15:00:02 holy-slurm01 purge-binlogs: Purging master logs to binlog.001792

Paul is out of commission at the moment, so I'm trying to get this back into service.  I'm not as seasoned as a SLURM admin as he as.  Any more info that I can provide you, I'm all ears.
Comment 1 Scott Yockel 2015-08-11 08:56:29 MDT
I think that we have found the issue:

Aug 11 16:39:45 holy-slurm01 slurmctld[60098]: error: select_g_select_nodeinfo_set(45712500): No error
Aug 11 16:39:45 holy-slurm01 slurmctld[60098]: sched: Allocate JobId=45712500 NodeList= #CPUs=8

That job is trying to run on a reservation.  I'm cross referencing the reservation to make sure that all the hosts that are defined are still in slurm.conf.
Comment 2 Brian Christiansen 2015-08-11 08:57:38 MDT
Can you tell us a little more about what happening at the time of the crash? Was the configuration changed? Was the slurmctld restarted? Can you attach the logs from the time it crashed?

Do you see a core dump? If so, can you send the backtrace from the core file?

ex.
gdb slurmctld <core_file>
bt
Comment 3 Scott Yockel 2015-08-11 09:33:18 MDT
Created attachment 2104 [details]
var_log_messages

We had this weirdness yesterday where a job was not getting a NodeList= set at the point of allocating the job.  Here is it from yesterday:

Aug  9 22:28:54 holy-slurm01 slurmctld[30502]: error: cons_res: _compute_c_b_task_dist invalid allocation for job 45639094
Aug  9 22:28:54 holy-slurm01 slurmctld[30502]: error: cons_res: cr_dist: Error in _compute_c_b_task_dist
Aug  9 22:28:54 holy-slurm01 slurmctld[30502]: error: Select plugin failed to set job resources, nodes
Aug  9 22:28:54 holy-slurm01 slurmctld[30502]: error: job 45639094 has no job_resrcs info
Aug  9 22:28:54 holy-slurm01 slurmctld[30502]: error: select_g_select_nodeinfo_set(45639094): No error
Aug  9 22:28:54 holy-slurm01 slurmctld[30502]: sched: Allocate JobId=45639094 NodeList= #CPUs=8

We have tracked it back to the reservation that these jobs were trying to use called kuang3 and removed it.  I've have verified that all the nodes in that reservation are in slurm.conf

We hadn't made any changes to slurm.conf
Comment 4 Scott Yockel 2015-08-11 09:37:37 MDT
Created attachment 2105 [details]
gdb_coredump

I'm not sure if I did that "gdb slurmcltd coredump".  We may have moved the code back to the non-debugger compiled version for performance reasons.  I'll have to circle back with Paul tomorrow.
Comment 5 Brian Christiansen 2015-08-11 10:12:17 MDT
Is the slurmctld up and running? Or does is it crash on startup?
Comment 6 Brian Christiansen 2015-08-11 10:19:21 MDT
It doesn't look like the backtrace (bt) got printed out from the core file. Did you type "bt", for backtrace, after loading the core file in gdb?
Comment 7 Scott Yockel 2015-08-11 10:35:44 MDT
Yes slurmcltd is up and running after the reservation removal. Before that
it kept dying.

==================================
Dr. Scott Yockel | Senior Team Lead of HPC
FAS Research Computing | Harvard University
38 Oxford Street Cambridge, MA
Office: 211A | Phone: 617-496-7468
==================================
On Aug 11, 2015 6:12 PM, <bugs@schedmd.com> wrote:

> *Comment # 5 <http://bugs.schedmd.com/show_bug.cgi?id=1854#c5> on bug 1854
> <http://bugs.schedmd.com/show_bug.cgi?id=1854> from Brian Christiansen
> <brian@schedmd.com> *
>
> Is the slurmctld up and running? Or does is it crash on startup?
>
> ------------------------------
> You are receiving this mail because:
>
>    - You are on the CC list for the bug.
>
>
Comment 8 Scott Yockel 2015-08-11 10:48:01 MDT
Backtrace from GDB

Program terminated with signal 11, Segmentation fault.
#0  0x000000000046ad36 in make_batch_job_cred (launch_msg_ptr=0x7f940c140380, job_ptr=0x6e26e90, protocol_version=65534)
    at job_scheduler.c:1908
1908	job_scheduler.c: No such file or directory.
	in job_scheduler.c
Missing separate debuginfos, use: debuginfo-install slurm-14.11.7-1fasrc01.el6.x86_64
(gdb) bt
#0  0x000000000046ad36 in make_batch_job_cred (launch_msg_ptr=0x7f940c140380, job_ptr=0x6e26e90, protocol_version=65534)
    at job_scheduler.c:1908
#1  0x000000000046a6ed in build_launch_job_msg (job_ptr=0x6e26e90, protocol_version=65534) at job_scheduler.c:1787
#2  0x000000000046ac2f in launch_job (job_ptr=0x6e26e90) at job_scheduler.c:1868
#3  0x000000000046e3c1 in _run_prolog (arg=0x6e26e90) at job_scheduler.c:3242
#4  0x0000003185e07a51 in start_thread () from /lib64/libpthread.so.0
#5  0x00000031856e89ad in clone () from /lib64/libc.so.6
Comment 9 Brian Christiansen 2015-08-11 10:55:18 MDT
Thanks. We'll look into the back trace.
Comment 10 Scott Yockel 2015-08-11 14:36:15 MDT
Created attachment 2108 [details]
var_log_messages.gz

Okay so a job to the partition kuang_hp just killed slurmcltd again.  I’ve update the state of the kuang_hp to DOWN and the same for all the nodes in that partition.  It is at the point of allocation when this issue occurs.

(gdb) bt
#0  0x000000000046ad36 in make_batch_job_cred (launch_msg_ptr=0x7f3fd8116ce0, job_ptr=0x650c230, protocol_version=65534) at job_scheduler.c:1908
#1  0x000000000046a6ed in build_launch_job_msg (job_ptr=0x650c230, protocol_version=65534) at job_scheduler.c:1787
#2  0x000000000046ac2f in launch_job (job_ptr=0x650c230) at job_scheduler.c:1868
#3  0x000000000046e3c1 in _run_prolog (arg=0x650c230) at job_scheduler.c:3242
#4  0x0000003185e07a51 in start_thread () from /lib64/libpthread.so.0
#5  0x00000031856e89ad in clone () from /lib64/libc.so.6




==================================
Dr. Scott Yockel | Senior Team Lead of HPC
FAS Research Computing | Harvard University
38 Oxford Street Cambridge, MA
Office: 211A | Phone: 617-496-7468
==================================

> On Aug 11, 2015, at 6:35 PM, Yockel, Scott <syockel@g.harvard.edu> wrote:
> 
> Yes slurmcltd is up and running after the reservation removal. Before that it kept dying.
> 
> ==================================
> Dr. Scott Yockel | Senior Team Lead of HPC
> FAS Research Computing | Harvard University
> 38 Oxford Street Cambridge, MA
> Office: 211A | Phone: 617-496-7468
> ==================================
> 
> On Aug 11, 2015 6:12 PM, <bugs@schedmd.com <mailto:bugs@schedmd.com>> wrote:
> 
> Comment # 5 <http://bugs.schedmd.com/show_bug.cgi?id=1854#c5> on bug 1854 <http://bugs.schedmd.com/show_bug.cgi?id=1854> from Brian Christiansen <mailto:brian@schedmd.com>
> Is the slurmctld up and running? Or does is it crash on startup?
> 
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 11 David Bigagli 2015-08-11 21:15:46 MDT
I think the original problem is still there. The core dump happens at this line

->cred_arg.job_hostlist = job_resrcs_ptr->nodes;

as indicated by the core dump. If you still have the core file could you
print the job_ptr data structure:

(gdb) frame 0
(gdb) print *job_ptr
(gdb) print *job_ptr->job_resrcs

Is there any reservation the jobs in kuang_hp are trying to use?
Could you also append the output of 'scontrol show part'?

Meanwhile we are tracing the code involving job_resrcs to avoid the core
dump at minimum.

David
Comment 12 Scott Yockel 2015-08-11 23:30:06 MDT
(gdb) frame 0
#0  0x000000000046ad36 in make_batch_job_cred (launch_msg_ptr=0x7f86c0009b30, job_ptr=0x7757040, protocol_version=65534)
    at job_scheduler.c:1908
1908	in job_scheduler.c
(gdb) print *job_ptr
$1 = {account = 0x7756f60 "kuang_lab", alias_list = 0x0, alloc_node = 0x7756f30 "rclogin07", alloc_resp_port = 0, alloc_sid = 25208, 
  array_job_id = 0, array_task_id = 4294967294, array_recs = 0x0, assoc_id = 3897, assoc_ptr = 0x2319230, batch_flag = 1, 
  batch_host = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, comment = 0x0, cpu_cnt = 8, cr_enabled = 1, db_index = 4294967294, 
  derived_ec = 0, details = 0x7756d70, direct_set_prio = 0, end_time = 1441503147, epilog_running = false, exit_code = 0, 
  front_end_ptr = 0x0, gres = 0x0, gres_list = 0x0, gres_alloc = 0x7f8550014b20 "", gres_req = 0x7f85500090e0 "", gres_used = 0x0, 
  group_id = 34720, job_id = 45724543, job_next = 0x0, job_array_next_j = 0x0, job_array_next_t = 0x0, job_resrcs = 0x0, job_state = 1, 
  kill_on_node_fail = 1, licenses = 0x0, license_list = 0x0, limit_set_max_cpus = 0, limit_set_max_nodes = 0, limit_set_min_cpus = 0, 
  limit_set_min_nodes = 0, limit_set_pn_min_memory = 0, limit_set_time = 0, limit_set_qos = 0, mail_type = 0, 
  mail_user = 0x7756f90 "kuang@fas.harvard.edu", magic = 4038539564, name = 0x7756f10 "p.101.0", network = 0x0, next_step_id = 0, 
  nodes = 0x7f8550015000 "", node_addr = 0x7f8550015b80, node_bitmap = 0x7f855001bb50, node_bitmap_cg = 0x0, node_cnt = 0, 
  node_cnt_wag = 0, nodes_completing = 0x0, other_port = 0, partition = 0x7756ee0 "kuang_hp", part_ptr_list = 0x0, 
  part_nodes_missing = false, part_ptr = 0x244f760, pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, priority = 100000259, 
  priority_array = 0x0, prio_factors = 0x7756d20, profile = 0, qos_id = 1, qos_ptr = 0x2162af0, reboot = 0 '\000', restart_cnt = 0, 
  resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0, requid = 4294967295, resp_host = 0x0, sched_nodes = 0x0, 
  select_jobinfo = 0x7756cc0, spank_job_env = 0x0, spank_job_env_size = 0, start_protocol_ver = 7168, start_time = 1439343147, 
  state_desc = 0x0, state_reason = 0, step_list = 0x772f3b0, suspend_time = 0, time_last_active = 1439343147, time_limit = 36000, 
  time_min = 0, tot_sus_time = 0, total_cpus = 8, total_nodes = 0, user_id = 34720, wait_all_nodes = 0, warn_flags = 0, warn_signal = 0, 
  warn_time = 0, wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0}
(gdb) print *job_ptr->job_resrcs
Cannot access memory at address 0x0
Comment 13 Scott Yockel 2015-08-11 23:33:40 MDT
There was a reservation kuang3 that was in the kuang_hp set.  We have since removed this reservation.

# scontrol show res kuang3
ReservationName=kuang3 StartTime=2015-07-30T09:00:00 EndTime=2015-08-27T09:00:00 Duration=28-00:00:00
  Nodes=hp[0101-0102,0104,0201,0203-0204,0301-0303,0401-0404,0601-0604,0701,0703-0704,0801-0804,0901-0904,1001-1004,1101,1103-1104,1202,1301-1304,1401-1402,1502-1504,1603-1604,1701-1704,1801-1804,1901-1904,2001,2003,2101-2103] NodeCnt=64 CoreCnt=768 Features=(null) PartitionName=kuang_hp Flags=
  Users=kuang Accounts=(null) Licenses=(null) State=ACTIVE


[root@holy-slurm01 ccpp-2015-08-11-21:32:32-4050]# scontrol show part
PartitionName=airoldi
   AllowGroups=airoldi_lab,rc_admin AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=airoldi[02-07,09-12]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=120 TotalNodes=10 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=aspuru-guzik
   AllowGroups=aspuru-guzik_lab,rc_admin AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=7-00:00:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=holy2a1910[1-8],holy2a1920[1-8],holy2a2010[1-8],holy2a2020[1-8],holy2a2110[1-8],holy2a2120[1-8],aag0[7-9],aag1[0-6]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=3712 TotalNodes=58 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=aspuru-samsung
   AllowGroups=slurm_group_aspuru-samsung,rc_admin AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=aag0[1-6]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=384 TotalNodes=6 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=aspuru-samsung-gpu
   AllowGroups=slurm_group_aspuru-samsung,rc_admin AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=aaggpu0[1-8]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=256 TotalNodes=8 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=bertoldi
   AllowGroups=rc_admin,bertoldi_lab AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=bertoldi01
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=betley
   AllowGroups=rc_admin,betley_lab AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=holyconroy05
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=64 TotalNodes=1 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=bigmem
   AllowGroups=rc_admin,slurm_group_bigmem AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=holybigmem0[1-8]
   Priority=2 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=512 TotalNodes=8 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=conroy
   AllowGroups=rc_admin,conroy_lab AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=holy2a1420[1-8],holy2a1430[1-8]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=1024 TotalNodes=16 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=conte
   AllowGroups=rc_admin,zhuang_lab AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=holy2b0930[1-2]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=128 TotalNodes=2 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=davis
   AllowGroups=rc_admin,davis_lab AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=davis0[1-4]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=48 TotalNodes=4 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=dkelley
   AllowGroups=rc_admin,rinn_lab AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=dkelley01
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=64 TotalNodes=1 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=dli
   AllowGroups=rc_admin,li_hbs_lab AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=holyhbs01
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=64 TotalNodes=1 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=eddy
   AllowGroups=rc_admin,eddy_lab AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=holy2a0110[7-8]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=128 TotalNodes=2 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=eldorado
   AllowGroups=rc_admin,aspuru-guzik_lab AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=eldorado0[9],eldorado1[0-9],eldorado2[0-9],eldorado3[0-9],eldorado4[0-4,6,8],eldorado5[1-2]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=480 TotalNodes=40 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=evans
   AllowGroups=evans_lab,rc_admin AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=dae1[1-4],dae2[1-4]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=64 TotalNodes=8 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=general
   AllowGroups=rc_admin,cluster_users AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=holy2a0110[1-8],holy2a0120[1-8],holy2a0210[1-8],holy2a0220[1-8],holy2a0310[1-8],holy2a0320[1-8],holy2a0330[1-8],holy2a0410[1-8],holy2a0420[1-8],holy2a0430[1-8],holy2a0510[1-8],holy2a0520[1-8],holy2a0610[1-8],holy2a0620[1-8],holy2a0710[1-8],holy2a0720[1-8],holy2a0730[1-8],holy2a0810[1-8],holy2a0820[1-8],holy2a0830[1-8],holy2a0910[1-8],holy2a0920[1-8],holy2a0930[1-2],holy2a1110[1-6],holyconroy06
   Priority=2 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=11840 TotalNodes=185 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=giribet
   AllowGroups=rc_admin,giribet_lab AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=giribet0[1-4]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=48 TotalNodes=4 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=gpgpu
   AllowGroups=rc_admin,aspuru-guzik_lab,greenhill_lab,computefestgpu,pfister_lab,slurm_group_gpgpu,slurm_group_aspuru-samsung AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=holygpu[01-16]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=512 TotalNodes=16 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=gpu
   AllowGroups=rc_admin,cluster_users AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=supermicgpu01
   Priority=2 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=24 TotalNodes=1 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=hbs_pilot
   AllowGroups=rc_admin,hbs_pilot AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=holyhbs03
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=64 TotalNodes=1 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=hernquist
   AllowGroups=rc_admin,hernquist_lab AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=holy2a2130[1-8],holy2a2210[1-8],holy2a2220[1-8],holy2a2230[1-8],holy2a2310[1-8],holy2a2320[1-8],holy2a2410[1-8],holy2a2420[1-7]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=4032 TotalNodes=63 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=hernquist-dev
   AllowGroups=rc_admin,hernquist_lab AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=holy2a24208
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=64 TotalNodes=1 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=hips
   AllowGroups=rc_admin,adams_lab_seas AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=adams0[1-7]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=224 TotalNodes=7 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=holygiribet
   AllowGroups=rc_admin,giribet_lab AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=holygiribet0[1-6]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=384 TotalNodes=6 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=holyhoekstra
   AllowGroups=rc_admin,hoekstra_lab AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=holyhoekstra0[1-4]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=256 TotalNodes=4 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=hoekstra
   AllowGroups=rc_admin,hoekstra_lab AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=hoekstrafs1,hoekstrafs2
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=32 TotalNodes=2 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=holman
   AllowGroups=rc_admin,holman_lab AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=holmanfs1
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=24 TotalNodes=1 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=hsph
   AllowGroups=rc_admin,hsph_bioinfo,slurm_group_hsph AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=hsph0[5-6]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=128 TotalNodes=2 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=seas_iacs
   AllowGroups=rc_admin,seas_iacs AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=holy2a0930[7-8]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=128 TotalNodes=2 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=informatics-dev
   AllowGroups=rc_admin,sequencing AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=sandy-rc0[1-4]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=64 TotalNodes=4 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=interact
   AllowGroups=cluster_users,rc_admin AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=holy2a1830[1-8]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=512 TotalNodes=8 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=irizarry
   AllowGroups=rc_admin,irizarry_lab AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=irizarry0[1-2]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=128 TotalNodes=2 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=itc_cluster
   AllowGroups=rc_admin,itc_lab,kovac_lab,slurm_group_itc AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=itc012,itc022,itc041,itc05[1-2],itc06[1-2],itc07[1-2],itc08[1-2],itc09[1-2],itc101,itc111
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=960 TotalNodes=15 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=jacob
   AllowGroups=rc_admin,jacob_lab AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=18:00:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=1-12:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=holy2a1120[1-4]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=256 TotalNodes=4 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerCPU=4608

PartitionName=jacobsen
   AllowGroups=rc_admin,jacobsen_lab AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=enj[01-09,12]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=240 TotalNodes=10 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=jacobsen_amd
   AllowGroups=rc_admin,jacobsen_lab AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=holy2a1410[5-8]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=256 TotalNodes=4 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=jenny
   AllowGroups=rc_admin,rice_lab AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=jenny0[2,4]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=128 TotalNodes=2 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=karplus
   AllowGroups=rc_admin,karplus_lab AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=karplus0[1-4]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=32 TotalNodes=4 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=kaxiras
   AllowGroups=rc_admin,kaxiras_lab AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=holy2a1120[5-8],holy2a1310[1-8]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=768 TotalNodes=12 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=kou
   AllowGroups=rc_admin,kou AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=kou1[1-4],kou2[1-4],kou3[1-4],kou4[1-4],kou5[1-4]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=240 TotalNodes=20 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=kuang
   AllowGroups=rc_admin,kuang_lab,tziperman_lab,slurm_group_kuang AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=holy2b0510[1-8],holy2b0520[1-8],holy2b0530[1-8],holy2b0710[1-8],holy2b0920[2-8],holy2a1720[1-8],holy2a1730[1-8],holy2a1810[1-8]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=4032 TotalNodes=63 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=kuang_bigmem
   AllowGroups=rc_admin,kuang_lab,tziperman_lab,slurm_group_kuang AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=holy2b0720[1-8],holy2b0910[1-8]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=1024 TotalNodes=16 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=kuang_hp
   AllowGroups=rc_admin,kuang_lab,tziperman_lab,stewart_lab,slurm_group_kuang AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=hp010[1-4],hp020[1-4],hp030[1-3],hp040[1-4],hp060[1-4],hp070[1,3-4],hp080[1-4],hp090[1-4],hp100[1-4],hp110[1,3-4],hp120[1-2],hp130[1-4],hp140[1-2],hp150[2-4],hp160[1-4],hp170[1-4],hp180[1-4],hp190[1-4],hp200[1,3],hp210[1-4],hp220[3-4],hp230[3-4],hp240[1-2,4],hp250[1-4],hp260[1-3],hp2702
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=1020 TotalNodes=85 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=leroy
   AllowGroups=rc_admin,leroy_lab AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=leroy0[1-2,4]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=36 TotalNodes=3 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=meade
   AllowGroups=rc_admin,meade_lab AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=holy2a1320[1-8],holy2a1330[1-8]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=1024 TotalNodes=16 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=midas
   AllowGroups=rc_admin,lipsitch_lab AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=midas0[1-2]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=128 TotalNodes=2 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=mitrovica
   AllowGroups=rc_admin,mitrovica_lab AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=holy2a1620[1-8],holy2a1710[1-8]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=1024 TotalNodes=16 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=moorcroft_amd
   AllowGroups=rc_admin,moorcroft_lab AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=holymoorcroft0[1-8]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=512 TotalNodes=8 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=moorcroft_6100
   AllowGroups=rc_admin,moorcroft_lab AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=moorcroft[01-16,18-29,31-39]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=444 TotalNodes=37 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=ncf
   AllowGroups=rc_admin,luk_lab,ncfuser,tkadmin,cnl,nrg,anl,mcl,scn,vsl,hooley_lab,xnat,snp,sml,cnp,vcn,ncfadmin_group,mclaughlin_lab,sheridan_lab,ncf_users,pascual-leone,jwb,mrimgmt,holt_lab AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=ncf270[1-7]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=448 TotalNodes=7 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=nelson
   AllowGroups=rc_admin,nelson_lab AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=nelson0[1-2]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=128 TotalNodes=2 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=ngwe
   AllowGroups=rc_admin,ngwe_hbs_lab AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=holyhbs02
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=64 TotalNodes=1 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=ni_lab
   AllowGroups=rc_admin,ni_lab AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=holy2a1410[1-4]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=256 TotalNodes=4 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=pierce
   AllowGroups=rc_admin,pierce_lab,slurm_group_pierce AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=holy2b09303,holy2b09201
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=128 TotalNodes=2 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=pmage
   AllowGroups=rc_admin,slurm_group_pmage AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=pmage1
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=64 TotalNodes=1 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=priority
   AllowGroups=rc_admin AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=NONE DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=aag0[1-9],aag1[0-6],aaggpu0[1-8],adams0[1-7],airoldi[02-07,09-12],bertoldi01,dae1[1-4],dae2[1-4],davis0[1-4],dkelley01,eldorado0[9],eldorado1[0-9],eldorado2[0-9],eldorado3[0-9],eldorado4[0-4,6,8],eldorado5[1-2],enj[01-09,12],giribet0[1-4],holyhbs0[1-3],holyconroy0[1-6],holygiribet0[1-6],hoekstrafs1,hoekstrafs2,holmanfs1,holyhoekstra0[1-4],holy2a0110[1-8],holy2a0120[1-8],holy2a0210[1-8],holy2a0220[1-8],holy2a0310[1-8],holy2a0320[1-8],holy2a0330[1-8],holy2a0410[1-8],holy2a0420[1-8],holy2a0430[1-8],holy2a0510[1-8],holy2a0520[1-8],holy2a0610[1-8],holy2a0620[1-8],holy2a0710[1-8],holy2a0720[1-8],holy2a0730[1-8],holy2a0810[1-8],holy2a0820[1-8],holy2a0830[1-8],holy2a0910[1-8],holy2a0920[1-8],holy2a0930[1-8],holy2a1110[1-8],holy2a1120[1-8],holy2a1310[1-8],holy2a1320[1-8],holy2a1330[1-8],holy2a1410[1-8],holy2a1420[1-8],holy2a1430[1-8],holy2a1510[1-8],holy2a1520[1-8],holy2a1610[1-8],holy2a1620[1-8],holy2a1710[1-8],holy2a1720[1-8],holy2a1730[1-8],holy2a1810[1-8],holy2a1820[1-8],holy2a1830[1-8],holy2a1910[1-8],holy2a1920[1-8],holy2a2010[1-8],holy2a2020[1-8],holy2a2110[1-8],holy2a2120[1-8],holy2a2130[1-8],holy2a2210[1-8],holy2a2220[1-8],holy2a2230[1-8],holy2a2310[1-8],holy2a2320[1-8],holy2a2410[1-8],holy2a2420[1-8],holy2b0510[1-8],holy2b0520[1-8],holy2b0530[1-8],holy2b0710[1-8],holy2b0720[1-8],holy2b0910[1-8],holy2b0920[1-8],holy2b0930[1-3],holybigmem0[1-8],holygpu[01-16],holymoorcroft0[1-8],holyseasgpu[01-13],holystat0[1-9],holystat1[0-9],holystat2[0-2],hp010[1-4],hp020[1-4],hp030[1-3],hp040[1-4],hp060[1-4],hp070[1,3-4],hp080[1-4],hp090[1-4],hp100[1-4],hp110[1,3-4],hp120[1-2],hp130[1-4],hp140[1-2],hp150[2-4],hp160[2-4],hp170[1-4],hp180[1-4],hp190[1-4],hp200[1,3],hp210[1-4],hp220[3-4],hp230[3-4],hp240[1-2,4],hp250[1-4],hp260[1-3],hp2702,hsph0[5-6],irizarry0[1-2],itc012,itc022,itc041,itc05[1-2],itc06[1-2],itc07[1-2],itc08[1-2],itc09[1-2],itc101,itc111,jenny0[2,4],karplus0[1-4],kou1[1-4],kou2[1-4],kou3[1-4],kou4[1-4],kou5[1-4],leroy0[1-2,4],midas0[1-2],mvogels[01-32],moorcroft[01-16,18-29,31-39],ncf270[1-7],nelson0[1-2],regal[01-18],sandy-rc0[1-4],seasgpu0[1-9],seasgpu1[0-5],shakgpu0[1-9],shakgpu1[0-9],shakgpu2[0-9],shakgpu3[0-9],shakgpu4[0-9],shakgpu50,shock0[1-4,6-7],shock12,supermicgpu01,wofsy01[1-4],wofsy02[1-3],xie01,zorana0[1-2]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=46578 TotalNodes=1014 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=regal
   AllowGroups=rc_admin,slurm_group_regal,hepl AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=regal[01-18]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=288 TotalNodes=18 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=resonance
   AllowGroups=rc_admin,resonance AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=seasgpu0[1-9],seasgpu1[0-5]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=90 TotalNodes=15 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=holyseasgpu
   AllowGroups=rc_admin,seas,computefestgpu AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=holyseasgpu[01-13]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=624 TotalNodes=13 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=serial_requeue
   AllowGroups=rc_admin,cluster_users AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=YES
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=1 MaxTime=7-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=aag0[1-9],aag1[0-6],aaggpu0[1-8],adams0[1-7],airoldi[02-07,09-12],bertoldi01,dae1[1-4],dae2[1-4],davis0[1-4],eldorado0[9],eldorado1[0-9],eldorado2[0-9],eldorado3[0-9],eldorado4[0-4,6,8],eldorado5[1-2],enj[01-09,12],giribet0[1-4],holyconroy0[1-6],holygiribet0[1-6],holyhoekstra0[1-4],holy2a0110[1-8],holy2a0120[1-8],holy2a0210[1-8],holy2a0220[1-8],holy2a0310[1-8],holy2a0320[1-8],holy2a0330[1-8],holy2a0410[1-8],holy2a0420[1-8],holy2a0430[1-8],holy2a0510[1-8],holy2a0520[1-8],holy2a0610[1-8],holy2a0620[1-8],holy2a0710[1-8],holy2a0720[1-8],holy2a0730[1-8],holy2a0810[1-8],holy2a0820[1-8],holy2a0830[1-8],holy2a0910[1-8],holy2a0920[1-8],holy2a0930[1-8],holy2a1110[1-8],holy2a1120[1-8],holy2a1310[1-8],holy2a1320[1-8],holy2a1330[1-8],holy2a1410[1-8],holy2a1420[1-8],holy2a1430[1-8],holy2a1510[1-8],holy2a1520[1-8],holy2a1610[1-8],holy2a1620[1-8],holy2a1710[1-8],holy2a1820[1-8],holy2a1910[1-8],holy2a1920[1-8],holy2a2010[1-8],holy2a2020[1-8],holy2a2110[1-8],holy2a2120[1-8],holy2a2130[1-8],holy2a2210[1-8],holy2a2220[1-8],holy2a2230[1-8],holy2a2310[1-8],holy2a2320[1-8],holy2a2410[1-8],holy2a2420[1-7],holy2b0720[1-8],holy2b0910[1-8],holy2b0930[1-2],holybigmem0[1-8],holygpu[01-16],holymoorcroft0[1-8],holyseasgpu[01-13],holystat0[1-9],holystat1[0-9],holystat2[0-2],hsph0[5-6],irizarry0[1-2],jenny0[2,4],karplus0[1-4],itc012,itc022,itc041,itc05[1-2],itc06[1-2],itc07[1-2],itc08[1-2],itc09[1-2],itc101,itc111,kou1[1-4],kou2[1-4],kou3[1-4],kou4[1-4],kou5[1-4],leroy0[1-2,4],midas0[1-2],moorcroft[01-16,18-29,31-39],mvogels[01-32],nelson0[1-2],regal[01-18],sandy-rc0[1-4],shakgpu0[1-9],shakgpu1[0-9],shakgpu2[0-9],shakgpu3[0-9],shakgpu4[0-9],shakgpu50,shock0[1-4,6-7],shock12,supermicgpu01,wofsy01[1-4],wofsy02[1-3],xie01,zorana0[1-2],holy2a1720[1-8],holy2a1730[1-8],holy2a1810[1-8],holy2b0710[1-8],holy2b0510[1-8],holy2b0520[1-8],holy2b0530[1-8],holy2b0920[1-8],hp010[1-4],hp020[1-4],hp030[1-3],hp040[1-4],hp060[1-4],hp070[1,3-4],hp080[1-4],hp090[1-4],hp100[1-4],hp110[1,3-4],hp120[1-2],hp130[1-4],hp140[1-2],hp150[2-4],hp160[2-4],hp170[1-4],hp180[1-4],hp190[1-4],hp200[1,3],hp210[1-4],hp220[3-4],hp230[3-4],hp240[1-2,4],hp250[1-4],hp260[1-3],hp2702
   Priority=1 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=45088 TotalNodes=975 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=shakhnovich
   AllowGroups=rc_admin,shakhnovich_lab AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=holy2a1510[1-8],holy2a1520[1-8],holy2a1610[1-8]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=1536 TotalNodes=24 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=shakgpu
   AllowGroups=rc_admin,shakhnovich_lab AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=shakgpu0[1-9],shakgpu1[0-9],shakgpu2[0-9],shakgpu3[0-9],shakgpu4[0-9],shakgpu50
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=1600 TotalNodes=50 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=shock
   AllowGroups=rc_admin,stewart_lab AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=shock0[1-4,6-7],shock12
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=60 TotalNodes=7 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=spierce
   AllowGroups=rc_admin,spierce_lab AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=holyconroy0[1-4]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=256 TotalNodes=4 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=stats
   AllowGroups=rc_admin,airoldi_lab,rubin_lab,bornn_lab,liu,miratrix_lab,stat115,stat221,slurm_group_stats AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=holystat0[1-9],holystat1[0-9],holystat2[0-2]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=336 TotalNodes=22 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=unrestricted
   AllowGroups=rc_admin,cluster_users AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=holy2a1820[1-8]
   Priority=2 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=512 TotalNodes=8 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=vogelsberger
   AllowGroups=rc_admin,vogelsberger_lab AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=mvogels[01-32]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=2048 TotalNodes=32 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=wofsy
   AllowGroups=rc_admin,wofsy_lab AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=wofsy01[1-4],wofsy02[1-3]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=84 TotalNodes=7 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=wolkovich
   AllowGroups=rc_admin,wolkovich_lab AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=holy2a0930[3-6]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=256 TotalNodes=4 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=xie
   AllowGroups=rc_admin,xie_lab AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=xie01
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=64 TotalNodes=1 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=zorana
   AllowGroups=rc_admin,brenner_lab AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=zorana0[1-2]
   Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=128 TotalNodes=2 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
Comment 14 Scott Yockel 2015-08-11 23:35:04 MDT
Created attachment 2110 [details]
Current slurm.conf
Comment 15 David Bigagli 2015-08-11 23:38:45 MDT
Thanks for the data. This is the cause of the core dump:

(gdb) print *job_ptr->job_resrcs
Cannot access memory at address 0x0

now we have to try to reproduce to find out the sequence of events leading
to this.

David
Comment 16 Scott Yockel 2015-08-12 00:28:23 MDT
David,

Thanks for the info.  So do you think this is tied to a node in the kuang_hp partition?  That is where we kept having issues.

~Scott

> On Aug 12, 2015, at 7:38 AM, bugs@schedmd.com wrote:
> 
> 
> Comment # 15 <http://bugs.schedmd.com/show_bug.cgi?id=1854#c15> on bug 1854 <http://bugs.schedmd.com/show_bug.cgi?id=1854> from David Bigagli <mailto:david@schedmd.com>
> Thanks for the data. This is the cause of the core dump:
> 
> (gdb) print *job_ptr->job_resrcs
> Cannot access memory at address 0x0
> 
> now we have to try to reproduce to find out the sequence of events leading
> to this.
> 
> David
> 
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 17 David Bigagli 2015-08-12 00:49:09 MDT
The job that caused the core dump is in kuang_hp, that's empirical evidence..
Could you also print from the core file:

(gdb)print job_ptr->part_ptr
(gdb)print job_ptr->qos_ptr

Beside the job_resrcs being NULL the job looks pretty normal to me.

Thanks, David
Comment 18 David Bigagli 2015-08-12 01:21:43 MDT
Another strange thing is that the job does not have a batch host set, like
if something set the batch_host to NULL. The job is being started after your
PrologSlurmctld is executed. What does prolog do? Would it be possible to
run without it for a while?

David
Comment 19 Scott Yockel 2015-08-12 02:02:29 MDT
Created attachment 2111 [details]
slurmctld_prolog

From slurm.conf
#Prolog=/usr/local/bin/slurm_prolog
PrologSlurmctld=/usr/local/sbin/slurmctld_prolog

Looks like we don't use job level prolog.  I'm attaching the slurmctld_prolog.
Comment 20 Scott Yockel 2015-08-12 02:04:42 MDT
(gdb) print job_ptr->part_ptr
$1 = (struct part_record *) 0x244f760
(gdb) print job_ptr->qos_ptr
$2 = (void *) 0x2162af0
Comment 21 David Bigagli 2015-08-12 02:21:33 MDT
Sorry I made a mistake I meant:

(gdb)print * job_ptr->part_ptr

The reason I was asking about the PrologSlurmctld is that there are 2 code
paths that start a job if this parameter is configured or if not.
In the case it is configured it is the thread that runs the prolog that
starts the job, in the other case it is the main slurmctld background thread.
I don't know if this is relevant to what we see but it would be worth to try.
Can you disable the PrologSlurmctld for a while and then enable kuang_hp 
partition? You can even add the reservation as before.

David
Comment 22 Scott Yockel 2015-08-12 02:27:33 MDT
(gdb) print * job_ptr->part_ptr
$1 = {allow_accounts = 0x0, allow_account_array = 0x0, allow_alloc_nodes = 0x0, allow_groups = 0x2443cc0 "rc_admin,kuang_lab,tziperman_lab,stewart_lab,slurm_group_kuang", allow_uids = 0x7e97ac0, 
  allow_qos = 0x0, allow_qos_bitstr = 0x0, alternate = 0x0, def_mem_per_cpu = 0, default_time = 10, deny_accounts = 0x0, deny_account_array = 0x0, deny_qos = 0x0, deny_qos_bitstr = 0x0, flags = 0, 
  grace_time = 0, magic = 0, max_cpus_per_node = 4294967295, max_mem_per_cpu = 0, max_nodes = 4294967295, max_nodes_orig = 4294967295, max_offset = 0, max_share = 1, max_time = 4294967295, 
  min_nodes = 1, min_offset = 0, min_nodes_orig = 1, name = 0x244f850 "kuang_hp", node_bitmap = 0x776cb70, 
  nodes = 0x244f880 "hp010[1-4],hp020[1-4],hp030[1-3],hp040[1-4],hp060[1-4],hp070[1,3-4],hp080[1-4],hp090[1-4],hp100[1-4],hp110[1,3-4],hp120[1-2],hp130[1-4],hp140[1-2],hp150[2-4],hp160[1-4],hp170[1-4],hp180[1-4],hp190[1-4"..., norm_priority = 1, preempt_mode = 65534, priority = 10, state_up = 3, total_nodes = 85, total_cpus = 1020, cr_type = 0}
Comment 23 Scott Yockel 2015-08-12 03:53:39 MDT
(In reply to David Bigagli from comment #21)
> Sorry I made a mistake I meant:
> 
> (gdb)print * job_ptr->part_ptr
> 
> The reason I was asking about the PrologSlurmctld is that there are 2 code
> paths that start a job if this parameter is configured or if not.
> In the case it is configured it is the thread that runs the prolog that
> starts the job, in the other case it is the main slurmctld background thread.
> I don't know if this is relevant to what we see but it would be worth to try.
> Can you disable the PrologSlurmctld for a while and then enable kuang_hp 
> partition? You can even add the reservation as before.
> 
> David

We have disabled the prologslurmctld and enabled the kuang_hp partition.
Comment 24 David Bigagli 2015-08-12 22:02:27 MDT
No core dump in the past ~24h?

David
Comment 25 Scott Yockel 2015-08-13 01:54:52 MDT
Nope.  It has been smooth sailing.

==================================
Dr. Scott Yockel | Senior Team Lead of HPC
FAS Research Computing | Harvard University
38 Oxford Street Cambridge, MA
Office: 211A | Phone: 617-496-7468
==================================

> On Aug 13, 2015, at 6:02 AM, bugs@schedmd.com wrote:
> 
> 
> Comment # 24 <http://bugs.schedmd.com/show_bug.cgi?id=1854#c24> on bug 1854 <http://bugs.schedmd.com/show_bug.cgi?id=1854> from David Bigagli <mailto:david@schedmd.com>
> No core dump in the past ~24h?
> 
> David
> 
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 26 David Bigagli 2015-08-13 02:28:36 MDT
There is probably a race condition somewhere causing this. I will provide a fix to prevent the core dump for now.

David
Comment 27 David Bigagli 2015-08-13 02:50:43 MDT
In 14.11.8 commit 2d8d92aab90a892 we have already introduced a code to prevent
a core dump should this problem happened. Upgrading will improve slurmctld
stability should this happen again.

David
Comment 28 Scott Yockel 2015-08-13 07:41:41 MDT
Okay great.  We will do an upgrade in the morning and also put back the slurmcltd_prolog
Comment 29 David Bigagli 2015-08-18 03:09:49 MDT
Hi, I saw you have upgraded. Please reopen should you see the error message
in the log file that states: job xyz  missing job_resrcs info.

David