Aug 11 14:59:20 holy-slurm01 slurmctld[36344]: find_node_record passed NULL name Aug 11 14:59:20 holy-slurm01 kernel: slurmctld_sched[57340]: segfault at 68 ip 000000000046ad36 sp 00007f610b9e8c90 error 4 in slurmctld[400000+23f000] Aug 11 15:00:02 holy-slurm01 purge-binlogs: Purging master logs to binlog.001792 Paul is out of commission at the moment, so I'm trying to get this back into service. I'm not as seasoned as a SLURM admin as he as. Any more info that I can provide you, I'm all ears.
I think that we have found the issue: Aug 11 16:39:45 holy-slurm01 slurmctld[60098]: error: select_g_select_nodeinfo_set(45712500): No error Aug 11 16:39:45 holy-slurm01 slurmctld[60098]: sched: Allocate JobId=45712500 NodeList= #CPUs=8 That job is trying to run on a reservation. I'm cross referencing the reservation to make sure that all the hosts that are defined are still in slurm.conf.
Can you tell us a little more about what happening at the time of the crash? Was the configuration changed? Was the slurmctld restarted? Can you attach the logs from the time it crashed? Do you see a core dump? If so, can you send the backtrace from the core file? ex. gdb slurmctld <core_file> bt
Created attachment 2104 [details] var_log_messages We had this weirdness yesterday where a job was not getting a NodeList= set at the point of allocating the job. Here is it from yesterday: Aug 9 22:28:54 holy-slurm01 slurmctld[30502]: error: cons_res: _compute_c_b_task_dist invalid allocation for job 45639094 Aug 9 22:28:54 holy-slurm01 slurmctld[30502]: error: cons_res: cr_dist: Error in _compute_c_b_task_dist Aug 9 22:28:54 holy-slurm01 slurmctld[30502]: error: Select plugin failed to set job resources, nodes Aug 9 22:28:54 holy-slurm01 slurmctld[30502]: error: job 45639094 has no job_resrcs info Aug 9 22:28:54 holy-slurm01 slurmctld[30502]: error: select_g_select_nodeinfo_set(45639094): No error Aug 9 22:28:54 holy-slurm01 slurmctld[30502]: sched: Allocate JobId=45639094 NodeList= #CPUs=8 We have tracked it back to the reservation that these jobs were trying to use called kuang3 and removed it. I've have verified that all the nodes in that reservation are in slurm.conf We hadn't made any changes to slurm.conf
Created attachment 2105 [details] gdb_coredump I'm not sure if I did that "gdb slurmcltd coredump". We may have moved the code back to the non-debugger compiled version for performance reasons. I'll have to circle back with Paul tomorrow.
Is the slurmctld up and running? Or does is it crash on startup?
It doesn't look like the backtrace (bt) got printed out from the core file. Did you type "bt", for backtrace, after loading the core file in gdb?
Yes slurmcltd is up and running after the reservation removal. Before that it kept dying. ================================== Dr. Scott Yockel | Senior Team Lead of HPC FAS Research Computing | Harvard University 38 Oxford Street Cambridge, MA Office: 211A | Phone: 617-496-7468 ================================== On Aug 11, 2015 6:12 PM, <bugs@schedmd.com> wrote: > *Comment # 5 <http://bugs.schedmd.com/show_bug.cgi?id=1854#c5> on bug 1854 > <http://bugs.schedmd.com/show_bug.cgi?id=1854> from Brian Christiansen > <brian@schedmd.com> * > > Is the slurmctld up and running? Or does is it crash on startup? > > ------------------------------ > You are receiving this mail because: > > - You are on the CC list for the bug. > >
Backtrace from GDB Program terminated with signal 11, Segmentation fault. #0 0x000000000046ad36 in make_batch_job_cred (launch_msg_ptr=0x7f940c140380, job_ptr=0x6e26e90, protocol_version=65534) at job_scheduler.c:1908 1908 job_scheduler.c: No such file or directory. in job_scheduler.c Missing separate debuginfos, use: debuginfo-install slurm-14.11.7-1fasrc01.el6.x86_64 (gdb) bt #0 0x000000000046ad36 in make_batch_job_cred (launch_msg_ptr=0x7f940c140380, job_ptr=0x6e26e90, protocol_version=65534) at job_scheduler.c:1908 #1 0x000000000046a6ed in build_launch_job_msg (job_ptr=0x6e26e90, protocol_version=65534) at job_scheduler.c:1787 #2 0x000000000046ac2f in launch_job (job_ptr=0x6e26e90) at job_scheduler.c:1868 #3 0x000000000046e3c1 in _run_prolog (arg=0x6e26e90) at job_scheduler.c:3242 #4 0x0000003185e07a51 in start_thread () from /lib64/libpthread.so.0 #5 0x00000031856e89ad in clone () from /lib64/libc.so.6
Thanks. We'll look into the back trace.
Created attachment 2108 [details] var_log_messages.gz Okay so a job to the partition kuang_hp just killed slurmcltd again. I’ve update the state of the kuang_hp to DOWN and the same for all the nodes in that partition. It is at the point of allocation when this issue occurs. (gdb) bt #0 0x000000000046ad36 in make_batch_job_cred (launch_msg_ptr=0x7f3fd8116ce0, job_ptr=0x650c230, protocol_version=65534) at job_scheduler.c:1908 #1 0x000000000046a6ed in build_launch_job_msg (job_ptr=0x650c230, protocol_version=65534) at job_scheduler.c:1787 #2 0x000000000046ac2f in launch_job (job_ptr=0x650c230) at job_scheduler.c:1868 #3 0x000000000046e3c1 in _run_prolog (arg=0x650c230) at job_scheduler.c:3242 #4 0x0000003185e07a51 in start_thread () from /lib64/libpthread.so.0 #5 0x00000031856e89ad in clone () from /lib64/libc.so.6 ================================== Dr. Scott Yockel | Senior Team Lead of HPC FAS Research Computing | Harvard University 38 Oxford Street Cambridge, MA Office: 211A | Phone: 617-496-7468 ================================== > On Aug 11, 2015, at 6:35 PM, Yockel, Scott <syockel@g.harvard.edu> wrote: > > Yes slurmcltd is up and running after the reservation removal. Before that it kept dying. > > ================================== > Dr. Scott Yockel | Senior Team Lead of HPC > FAS Research Computing | Harvard University > 38 Oxford Street Cambridge, MA > Office: 211A | Phone: 617-496-7468 > ================================== > > On Aug 11, 2015 6:12 PM, <bugs@schedmd.com <mailto:bugs@schedmd.com>> wrote: > > Comment # 5 <http://bugs.schedmd.com/show_bug.cgi?id=1854#c5> on bug 1854 <http://bugs.schedmd.com/show_bug.cgi?id=1854> from Brian Christiansen <mailto:brian@schedmd.com> > Is the slurmctld up and running? Or does is it crash on startup? > > You are receiving this mail because: > You are on the CC list for the bug.
I think the original problem is still there. The core dump happens at this line ->cred_arg.job_hostlist = job_resrcs_ptr->nodes; as indicated by the core dump. If you still have the core file could you print the job_ptr data structure: (gdb) frame 0 (gdb) print *job_ptr (gdb) print *job_ptr->job_resrcs Is there any reservation the jobs in kuang_hp are trying to use? Could you also append the output of 'scontrol show part'? Meanwhile we are tracing the code involving job_resrcs to avoid the core dump at minimum. David
(gdb) frame 0 #0 0x000000000046ad36 in make_batch_job_cred (launch_msg_ptr=0x7f86c0009b30, job_ptr=0x7757040, protocol_version=65534) at job_scheduler.c:1908 1908 in job_scheduler.c (gdb) print *job_ptr $1 = {account = 0x7756f60 "kuang_lab", alias_list = 0x0, alloc_node = 0x7756f30 "rclogin07", alloc_resp_port = 0, alloc_sid = 25208, array_job_id = 0, array_task_id = 4294967294, array_recs = 0x0, assoc_id = 3897, assoc_ptr = 0x2319230, batch_flag = 1, batch_host = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, comment = 0x0, cpu_cnt = 8, cr_enabled = 1, db_index = 4294967294, derived_ec = 0, details = 0x7756d70, direct_set_prio = 0, end_time = 1441503147, epilog_running = false, exit_code = 0, front_end_ptr = 0x0, gres = 0x0, gres_list = 0x0, gres_alloc = 0x7f8550014b20 "", gres_req = 0x7f85500090e0 "", gres_used = 0x0, group_id = 34720, job_id = 45724543, job_next = 0x0, job_array_next_j = 0x0, job_array_next_t = 0x0, job_resrcs = 0x0, job_state = 1, kill_on_node_fail = 1, licenses = 0x0, license_list = 0x0, limit_set_max_cpus = 0, limit_set_max_nodes = 0, limit_set_min_cpus = 0, limit_set_min_nodes = 0, limit_set_pn_min_memory = 0, limit_set_time = 0, limit_set_qos = 0, mail_type = 0, mail_user = 0x7756f90 "kuang@fas.harvard.edu", magic = 4038539564, name = 0x7756f10 "p.101.0", network = 0x0, next_step_id = 0, nodes = 0x7f8550015000 "", node_addr = 0x7f8550015b80, node_bitmap = 0x7f855001bb50, node_bitmap_cg = 0x0, node_cnt = 0, node_cnt_wag = 0, nodes_completing = 0x0, other_port = 0, partition = 0x7756ee0 "kuang_hp", part_ptr_list = 0x0, part_nodes_missing = false, part_ptr = 0x244f760, pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, priority = 100000259, priority_array = 0x0, prio_factors = 0x7756d20, profile = 0, qos_id = 1, qos_ptr = 0x2162af0, reboot = 0 '\000', restart_cnt = 0, resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0, requid = 4294967295, resp_host = 0x0, sched_nodes = 0x0, select_jobinfo = 0x7756cc0, spank_job_env = 0x0, spank_job_env_size = 0, start_protocol_ver = 7168, start_time = 1439343147, state_desc = 0x0, state_reason = 0, step_list = 0x772f3b0, suspend_time = 0, time_last_active = 1439343147, time_limit = 36000, time_min = 0, tot_sus_time = 0, total_cpus = 8, total_nodes = 0, user_id = 34720, wait_all_nodes = 0, warn_flags = 0, warn_signal = 0, warn_time = 0, wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0} (gdb) print *job_ptr->job_resrcs Cannot access memory at address 0x0
There was a reservation kuang3 that was in the kuang_hp set. We have since removed this reservation. # scontrol show res kuang3 ReservationName=kuang3 StartTime=2015-07-30T09:00:00 EndTime=2015-08-27T09:00:00 Duration=28-00:00:00 Nodes=hp[0101-0102,0104,0201,0203-0204,0301-0303,0401-0404,0601-0604,0701,0703-0704,0801-0804,0901-0904,1001-1004,1101,1103-1104,1202,1301-1304,1401-1402,1502-1504,1603-1604,1701-1704,1801-1804,1901-1904,2001,2003,2101-2103] NodeCnt=64 CoreCnt=768 Features=(null) PartitionName=kuang_hp Flags= Users=kuang Accounts=(null) Licenses=(null) State=ACTIVE [root@holy-slurm01 ccpp-2015-08-11-21:32:32-4050]# scontrol show part PartitionName=airoldi AllowGroups=airoldi_lab,rc_admin AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=airoldi[02-07,09-12] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=120 TotalNodes=10 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=aspuru-guzik AllowGroups=aspuru-guzik_lab,rc_admin AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=7-00:00:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=holy2a1910[1-8],holy2a1920[1-8],holy2a2010[1-8],holy2a2020[1-8],holy2a2110[1-8],holy2a2120[1-8],aag0[7-9],aag1[0-6] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=3712 TotalNodes=58 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=aspuru-samsung AllowGroups=slurm_group_aspuru-samsung,rc_admin AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=aag0[1-6] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=384 TotalNodes=6 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=aspuru-samsung-gpu AllowGroups=slurm_group_aspuru-samsung,rc_admin AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=aaggpu0[1-8] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=256 TotalNodes=8 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=bertoldi AllowGroups=rc_admin,bertoldi_lab AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=bertoldi01 Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=betley AllowGroups=rc_admin,betley_lab AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=holyconroy05 Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=64 TotalNodes=1 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=bigmem AllowGroups=rc_admin,slurm_group_bigmem AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=holybigmem0[1-8] Priority=2 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=512 TotalNodes=8 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=conroy AllowGroups=rc_admin,conroy_lab AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=holy2a1420[1-8],holy2a1430[1-8] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=1024 TotalNodes=16 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=conte AllowGroups=rc_admin,zhuang_lab AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=holy2b0930[1-2] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=128 TotalNodes=2 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=davis AllowGroups=rc_admin,davis_lab AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=davis0[1-4] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=48 TotalNodes=4 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=dkelley AllowGroups=rc_admin,rinn_lab AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=dkelley01 Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=64 TotalNodes=1 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=dli AllowGroups=rc_admin,li_hbs_lab AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=holyhbs01 Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=64 TotalNodes=1 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=eddy AllowGroups=rc_admin,eddy_lab AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=holy2a0110[7-8] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=128 TotalNodes=2 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=eldorado AllowGroups=rc_admin,aspuru-guzik_lab AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=eldorado0[9],eldorado1[0-9],eldorado2[0-9],eldorado3[0-9],eldorado4[0-4,6,8],eldorado5[1-2] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=480 TotalNodes=40 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=evans AllowGroups=evans_lab,rc_admin AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=dae1[1-4],dae2[1-4] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=64 TotalNodes=8 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=general AllowGroups=rc_admin,cluster_users AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=holy2a0110[1-8],holy2a0120[1-8],holy2a0210[1-8],holy2a0220[1-8],holy2a0310[1-8],holy2a0320[1-8],holy2a0330[1-8],holy2a0410[1-8],holy2a0420[1-8],holy2a0430[1-8],holy2a0510[1-8],holy2a0520[1-8],holy2a0610[1-8],holy2a0620[1-8],holy2a0710[1-8],holy2a0720[1-8],holy2a0730[1-8],holy2a0810[1-8],holy2a0820[1-8],holy2a0830[1-8],holy2a0910[1-8],holy2a0920[1-8],holy2a0930[1-2],holy2a1110[1-6],holyconroy06 Priority=2 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=11840 TotalNodes=185 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=giribet AllowGroups=rc_admin,giribet_lab AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=giribet0[1-4] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=48 TotalNodes=4 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=gpgpu AllowGroups=rc_admin,aspuru-guzik_lab,greenhill_lab,computefestgpu,pfister_lab,slurm_group_gpgpu,slurm_group_aspuru-samsung AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=holygpu[01-16] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=512 TotalNodes=16 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=gpu AllowGroups=rc_admin,cluster_users AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=supermicgpu01 Priority=2 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=24 TotalNodes=1 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=hbs_pilot AllowGroups=rc_admin,hbs_pilot AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=holyhbs03 Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=64 TotalNodes=1 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=hernquist AllowGroups=rc_admin,hernquist_lab AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=holy2a2130[1-8],holy2a2210[1-8],holy2a2220[1-8],holy2a2230[1-8],holy2a2310[1-8],holy2a2320[1-8],holy2a2410[1-8],holy2a2420[1-7] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=4032 TotalNodes=63 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=hernquist-dev AllowGroups=rc_admin,hernquist_lab AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=holy2a24208 Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=64 TotalNodes=1 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=hips AllowGroups=rc_admin,adams_lab_seas AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=adams0[1-7] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=224 TotalNodes=7 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=holygiribet AllowGroups=rc_admin,giribet_lab AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=holygiribet0[1-6] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=384 TotalNodes=6 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=holyhoekstra AllowGroups=rc_admin,hoekstra_lab AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=holyhoekstra0[1-4] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=256 TotalNodes=4 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=hoekstra AllowGroups=rc_admin,hoekstra_lab AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=hoekstrafs1,hoekstrafs2 Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=32 TotalNodes=2 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=holman AllowGroups=rc_admin,holman_lab AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=holmanfs1 Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=24 TotalNodes=1 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=hsph AllowGroups=rc_admin,hsph_bioinfo,slurm_group_hsph AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=hsph0[5-6] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=128 TotalNodes=2 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=seas_iacs AllowGroups=rc_admin,seas_iacs AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=holy2a0930[7-8] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=128 TotalNodes=2 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=informatics-dev AllowGroups=rc_admin,sequencing AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=sandy-rc0[1-4] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=64 TotalNodes=4 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=interact AllowGroups=cluster_users,rc_admin AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=holy2a1830[1-8] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=512 TotalNodes=8 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=irizarry AllowGroups=rc_admin,irizarry_lab AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=irizarry0[1-2] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=128 TotalNodes=2 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=itc_cluster AllowGroups=rc_admin,itc_lab,kovac_lab,slurm_group_itc AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=itc012,itc022,itc041,itc05[1-2],itc06[1-2],itc07[1-2],itc08[1-2],itc09[1-2],itc101,itc111 Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=960 TotalNodes=15 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=jacob AllowGroups=rc_admin,jacob_lab AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=18:00:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=1-12:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=holy2a1120[1-4] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=256 TotalNodes=4 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerCPU=4608 PartitionName=jacobsen AllowGroups=rc_admin,jacobsen_lab AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=enj[01-09,12] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=240 TotalNodes=10 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=jacobsen_amd AllowGroups=rc_admin,jacobsen_lab AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=holy2a1410[5-8] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=256 TotalNodes=4 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=jenny AllowGroups=rc_admin,rice_lab AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=jenny0[2,4] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=128 TotalNodes=2 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=karplus AllowGroups=rc_admin,karplus_lab AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=karplus0[1-4] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=32 TotalNodes=4 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=kaxiras AllowGroups=rc_admin,kaxiras_lab AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=holy2a1120[5-8],holy2a1310[1-8] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=768 TotalNodes=12 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=kou AllowGroups=rc_admin,kou AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=kou1[1-4],kou2[1-4],kou3[1-4],kou4[1-4],kou5[1-4] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=240 TotalNodes=20 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=kuang AllowGroups=rc_admin,kuang_lab,tziperman_lab,slurm_group_kuang AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=holy2b0510[1-8],holy2b0520[1-8],holy2b0530[1-8],holy2b0710[1-8],holy2b0920[2-8],holy2a1720[1-8],holy2a1730[1-8],holy2a1810[1-8] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=4032 TotalNodes=63 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=kuang_bigmem AllowGroups=rc_admin,kuang_lab,tziperman_lab,slurm_group_kuang AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=holy2b0720[1-8],holy2b0910[1-8] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=1024 TotalNodes=16 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=kuang_hp AllowGroups=rc_admin,kuang_lab,tziperman_lab,stewart_lab,slurm_group_kuang AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=hp010[1-4],hp020[1-4],hp030[1-3],hp040[1-4],hp060[1-4],hp070[1,3-4],hp080[1-4],hp090[1-4],hp100[1-4],hp110[1,3-4],hp120[1-2],hp130[1-4],hp140[1-2],hp150[2-4],hp160[1-4],hp170[1-4],hp180[1-4],hp190[1-4],hp200[1,3],hp210[1-4],hp220[3-4],hp230[3-4],hp240[1-2,4],hp250[1-4],hp260[1-3],hp2702 Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=1020 TotalNodes=85 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=leroy AllowGroups=rc_admin,leroy_lab AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=leroy0[1-2,4] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=36 TotalNodes=3 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=meade AllowGroups=rc_admin,meade_lab AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=holy2a1320[1-8],holy2a1330[1-8] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=1024 TotalNodes=16 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=midas AllowGroups=rc_admin,lipsitch_lab AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=midas0[1-2] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=128 TotalNodes=2 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=mitrovica AllowGroups=rc_admin,mitrovica_lab AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=holy2a1620[1-8],holy2a1710[1-8] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=1024 TotalNodes=16 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=moorcroft_amd AllowGroups=rc_admin,moorcroft_lab AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=holymoorcroft0[1-8] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=512 TotalNodes=8 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=moorcroft_6100 AllowGroups=rc_admin,moorcroft_lab AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=moorcroft[01-16,18-29,31-39] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=444 TotalNodes=37 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=ncf AllowGroups=rc_admin,luk_lab,ncfuser,tkadmin,cnl,nrg,anl,mcl,scn,vsl,hooley_lab,xnat,snp,sml,cnp,vcn,ncfadmin_group,mclaughlin_lab,sheridan_lab,ncf_users,pascual-leone,jwb,mrimgmt,holt_lab AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=ncf270[1-7] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=448 TotalNodes=7 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=nelson AllowGroups=rc_admin,nelson_lab AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=nelson0[1-2] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=128 TotalNodes=2 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=ngwe AllowGroups=rc_admin,ngwe_hbs_lab AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=holyhbs02 Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=64 TotalNodes=1 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=ni_lab AllowGroups=rc_admin,ni_lab AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=holy2a1410[1-4] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=256 TotalNodes=4 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=pierce AllowGroups=rc_admin,pierce_lab,slurm_group_pierce AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=holy2b09303,holy2b09201 Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=128 TotalNodes=2 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=pmage AllowGroups=rc_admin,slurm_group_pmage AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=pmage1 Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=64 TotalNodes=1 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=priority AllowGroups=rc_admin AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=NONE DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=aag0[1-9],aag1[0-6],aaggpu0[1-8],adams0[1-7],airoldi[02-07,09-12],bertoldi01,dae1[1-4],dae2[1-4],davis0[1-4],dkelley01,eldorado0[9],eldorado1[0-9],eldorado2[0-9],eldorado3[0-9],eldorado4[0-4,6,8],eldorado5[1-2],enj[01-09,12],giribet0[1-4],holyhbs0[1-3],holyconroy0[1-6],holygiribet0[1-6],hoekstrafs1,hoekstrafs2,holmanfs1,holyhoekstra0[1-4],holy2a0110[1-8],holy2a0120[1-8],holy2a0210[1-8],holy2a0220[1-8],holy2a0310[1-8],holy2a0320[1-8],holy2a0330[1-8],holy2a0410[1-8],holy2a0420[1-8],holy2a0430[1-8],holy2a0510[1-8],holy2a0520[1-8],holy2a0610[1-8],holy2a0620[1-8],holy2a0710[1-8],holy2a0720[1-8],holy2a0730[1-8],holy2a0810[1-8],holy2a0820[1-8],holy2a0830[1-8],holy2a0910[1-8],holy2a0920[1-8],holy2a0930[1-8],holy2a1110[1-8],holy2a1120[1-8],holy2a1310[1-8],holy2a1320[1-8],holy2a1330[1-8],holy2a1410[1-8],holy2a1420[1-8],holy2a1430[1-8],holy2a1510[1-8],holy2a1520[1-8],holy2a1610[1-8],holy2a1620[1-8],holy2a1710[1-8],holy2a1720[1-8],holy2a1730[1-8],holy2a1810[1-8],holy2a1820[1-8],holy2a1830[1-8],holy2a1910[1-8],holy2a1920[1-8],holy2a2010[1-8],holy2a2020[1-8],holy2a2110[1-8],holy2a2120[1-8],holy2a2130[1-8],holy2a2210[1-8],holy2a2220[1-8],holy2a2230[1-8],holy2a2310[1-8],holy2a2320[1-8],holy2a2410[1-8],holy2a2420[1-8],holy2b0510[1-8],holy2b0520[1-8],holy2b0530[1-8],holy2b0710[1-8],holy2b0720[1-8],holy2b0910[1-8],holy2b0920[1-8],holy2b0930[1-3],holybigmem0[1-8],holygpu[01-16],holymoorcroft0[1-8],holyseasgpu[01-13],holystat0[1-9],holystat1[0-9],holystat2[0-2],hp010[1-4],hp020[1-4],hp030[1-3],hp040[1-4],hp060[1-4],hp070[1,3-4],hp080[1-4],hp090[1-4],hp100[1-4],hp110[1,3-4],hp120[1-2],hp130[1-4],hp140[1-2],hp150[2-4],hp160[2-4],hp170[1-4],hp180[1-4],hp190[1-4],hp200[1,3],hp210[1-4],hp220[3-4],hp230[3-4],hp240[1-2,4],hp250[1-4],hp260[1-3],hp2702,hsph0[5-6],irizarry0[1-2],itc012,itc022,itc041,itc05[1-2],itc06[1-2],itc07[1-2],itc08[1-2],itc09[1-2],itc101,itc111,jenny0[2,4],karplus0[1-4],kou1[1-4],kou2[1-4],kou3[1-4],kou4[1-4],kou5[1-4],leroy0[1-2,4],midas0[1-2],mvogels[01-32],moorcroft[01-16,18-29,31-39],ncf270[1-7],nelson0[1-2],regal[01-18],sandy-rc0[1-4],seasgpu0[1-9],seasgpu1[0-5],shakgpu0[1-9],shakgpu1[0-9],shakgpu2[0-9],shakgpu3[0-9],shakgpu4[0-9],shakgpu50,shock0[1-4,6-7],shock12,supermicgpu01,wofsy01[1-4],wofsy02[1-3],xie01,zorana0[1-2] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=46578 TotalNodes=1014 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=regal AllowGroups=rc_admin,slurm_group_regal,hepl AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=regal[01-18] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=288 TotalNodes=18 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=resonance AllowGroups=rc_admin,resonance AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=seasgpu0[1-9],seasgpu1[0-5] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=90 TotalNodes=15 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=holyseasgpu AllowGroups=rc_admin,seas,computefestgpu AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=holyseasgpu[01-13] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=624 TotalNodes=13 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=serial_requeue AllowGroups=rc_admin,cluster_users AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=YES DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=1 MaxTime=7-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=aag0[1-9],aag1[0-6],aaggpu0[1-8],adams0[1-7],airoldi[02-07,09-12],bertoldi01,dae1[1-4],dae2[1-4],davis0[1-4],eldorado0[9],eldorado1[0-9],eldorado2[0-9],eldorado3[0-9],eldorado4[0-4,6,8],eldorado5[1-2],enj[01-09,12],giribet0[1-4],holyconroy0[1-6],holygiribet0[1-6],holyhoekstra0[1-4],holy2a0110[1-8],holy2a0120[1-8],holy2a0210[1-8],holy2a0220[1-8],holy2a0310[1-8],holy2a0320[1-8],holy2a0330[1-8],holy2a0410[1-8],holy2a0420[1-8],holy2a0430[1-8],holy2a0510[1-8],holy2a0520[1-8],holy2a0610[1-8],holy2a0620[1-8],holy2a0710[1-8],holy2a0720[1-8],holy2a0730[1-8],holy2a0810[1-8],holy2a0820[1-8],holy2a0830[1-8],holy2a0910[1-8],holy2a0920[1-8],holy2a0930[1-8],holy2a1110[1-8],holy2a1120[1-8],holy2a1310[1-8],holy2a1320[1-8],holy2a1330[1-8],holy2a1410[1-8],holy2a1420[1-8],holy2a1430[1-8],holy2a1510[1-8],holy2a1520[1-8],holy2a1610[1-8],holy2a1620[1-8],holy2a1710[1-8],holy2a1820[1-8],holy2a1910[1-8],holy2a1920[1-8],holy2a2010[1-8],holy2a2020[1-8],holy2a2110[1-8],holy2a2120[1-8],holy2a2130[1-8],holy2a2210[1-8],holy2a2220[1-8],holy2a2230[1-8],holy2a2310[1-8],holy2a2320[1-8],holy2a2410[1-8],holy2a2420[1-7],holy2b0720[1-8],holy2b0910[1-8],holy2b0930[1-2],holybigmem0[1-8],holygpu[01-16],holymoorcroft0[1-8],holyseasgpu[01-13],holystat0[1-9],holystat1[0-9],holystat2[0-2],hsph0[5-6],irizarry0[1-2],jenny0[2,4],karplus0[1-4],itc012,itc022,itc041,itc05[1-2],itc06[1-2],itc07[1-2],itc08[1-2],itc09[1-2],itc101,itc111,kou1[1-4],kou2[1-4],kou3[1-4],kou4[1-4],kou5[1-4],leroy0[1-2,4],midas0[1-2],moorcroft[01-16,18-29,31-39],mvogels[01-32],nelson0[1-2],regal[01-18],sandy-rc0[1-4],shakgpu0[1-9],shakgpu1[0-9],shakgpu2[0-9],shakgpu3[0-9],shakgpu4[0-9],shakgpu50,shock0[1-4,6-7],shock12,supermicgpu01,wofsy01[1-4],wofsy02[1-3],xie01,zorana0[1-2],holy2a1720[1-8],holy2a1730[1-8],holy2a1810[1-8],holy2b0710[1-8],holy2b0510[1-8],holy2b0520[1-8],holy2b0530[1-8],holy2b0920[1-8],hp010[1-4],hp020[1-4],hp030[1-3],hp040[1-4],hp060[1-4],hp070[1,3-4],hp080[1-4],hp090[1-4],hp100[1-4],hp110[1,3-4],hp120[1-2],hp130[1-4],hp140[1-2],hp150[2-4],hp160[2-4],hp170[1-4],hp180[1-4],hp190[1-4],hp200[1,3],hp210[1-4],hp220[3-4],hp230[3-4],hp240[1-2,4],hp250[1-4],hp260[1-3],hp2702 Priority=1 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=45088 TotalNodes=975 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=shakhnovich AllowGroups=rc_admin,shakhnovich_lab AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=holy2a1510[1-8],holy2a1520[1-8],holy2a1610[1-8] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=1536 TotalNodes=24 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=shakgpu AllowGroups=rc_admin,shakhnovich_lab AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=shakgpu0[1-9],shakgpu1[0-9],shakgpu2[0-9],shakgpu3[0-9],shakgpu4[0-9],shakgpu50 Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=1600 TotalNodes=50 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=shock AllowGroups=rc_admin,stewart_lab AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=shock0[1-4,6-7],shock12 Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=60 TotalNodes=7 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=spierce AllowGroups=rc_admin,spierce_lab AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=holyconroy0[1-4] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=256 TotalNodes=4 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=stats AllowGroups=rc_admin,airoldi_lab,rubin_lab,bornn_lab,liu,miratrix_lab,stat115,stat221,slurm_group_stats AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=holystat0[1-9],holystat1[0-9],holystat2[0-2] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=336 TotalNodes=22 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=unrestricted AllowGroups=rc_admin,cluster_users AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=holy2a1820[1-8] Priority=2 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=512 TotalNodes=8 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=vogelsberger AllowGroups=rc_admin,vogelsberger_lab AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=mvogels[01-32] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=2048 TotalNodes=32 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=wofsy AllowGroups=rc_admin,wofsy_lab AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=wofsy01[1-4],wofsy02[1-3] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=84 TotalNodes=7 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=wolkovich AllowGroups=rc_admin,wolkovich_lab AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=holy2a0930[3-6] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=256 TotalNodes=4 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=xie AllowGroups=rc_admin,xie_lab AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=xie01 Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=64 TotalNodes=1 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=zorana AllowGroups=rc_admin,brenner_lab AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:10:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=zorana0[1-2] Priority=10 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=128 TotalNodes=2 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
Created attachment 2110 [details] Current slurm.conf
Thanks for the data. This is the cause of the core dump: (gdb) print *job_ptr->job_resrcs Cannot access memory at address 0x0 now we have to try to reproduce to find out the sequence of events leading to this. David
David, Thanks for the info. So do you think this is tied to a node in the kuang_hp partition? That is where we kept having issues. ~Scott > On Aug 12, 2015, at 7:38 AM, bugs@schedmd.com wrote: > > > Comment # 15 <http://bugs.schedmd.com/show_bug.cgi?id=1854#c15> on bug 1854 <http://bugs.schedmd.com/show_bug.cgi?id=1854> from David Bigagli <mailto:david@schedmd.com> > Thanks for the data. This is the cause of the core dump: > > (gdb) print *job_ptr->job_resrcs > Cannot access memory at address 0x0 > > now we have to try to reproduce to find out the sequence of events leading > to this. > > David > > You are receiving this mail because: > You are on the CC list for the bug.
The job that caused the core dump is in kuang_hp, that's empirical evidence.. Could you also print from the core file: (gdb)print job_ptr->part_ptr (gdb)print job_ptr->qos_ptr Beside the job_resrcs being NULL the job looks pretty normal to me. Thanks, David
Another strange thing is that the job does not have a batch host set, like if something set the batch_host to NULL. The job is being started after your PrologSlurmctld is executed. What does prolog do? Would it be possible to run without it for a while? David
Created attachment 2111 [details] slurmctld_prolog From slurm.conf #Prolog=/usr/local/bin/slurm_prolog PrologSlurmctld=/usr/local/sbin/slurmctld_prolog Looks like we don't use job level prolog. I'm attaching the slurmctld_prolog.
(gdb) print job_ptr->part_ptr $1 = (struct part_record *) 0x244f760 (gdb) print job_ptr->qos_ptr $2 = (void *) 0x2162af0
Sorry I made a mistake I meant: (gdb)print * job_ptr->part_ptr The reason I was asking about the PrologSlurmctld is that there are 2 code paths that start a job if this parameter is configured or if not. In the case it is configured it is the thread that runs the prolog that starts the job, in the other case it is the main slurmctld background thread. I don't know if this is relevant to what we see but it would be worth to try. Can you disable the PrologSlurmctld for a while and then enable kuang_hp partition? You can even add the reservation as before. David
(gdb) print * job_ptr->part_ptr $1 = {allow_accounts = 0x0, allow_account_array = 0x0, allow_alloc_nodes = 0x0, allow_groups = 0x2443cc0 "rc_admin,kuang_lab,tziperman_lab,stewart_lab,slurm_group_kuang", allow_uids = 0x7e97ac0, allow_qos = 0x0, allow_qos_bitstr = 0x0, alternate = 0x0, def_mem_per_cpu = 0, default_time = 10, deny_accounts = 0x0, deny_account_array = 0x0, deny_qos = 0x0, deny_qos_bitstr = 0x0, flags = 0, grace_time = 0, magic = 0, max_cpus_per_node = 4294967295, max_mem_per_cpu = 0, max_nodes = 4294967295, max_nodes_orig = 4294967295, max_offset = 0, max_share = 1, max_time = 4294967295, min_nodes = 1, min_offset = 0, min_nodes_orig = 1, name = 0x244f850 "kuang_hp", node_bitmap = 0x776cb70, nodes = 0x244f880 "hp010[1-4],hp020[1-4],hp030[1-3],hp040[1-4],hp060[1-4],hp070[1,3-4],hp080[1-4],hp090[1-4],hp100[1-4],hp110[1,3-4],hp120[1-2],hp130[1-4],hp140[1-2],hp150[2-4],hp160[1-4],hp170[1-4],hp180[1-4],hp190[1-4"..., norm_priority = 1, preempt_mode = 65534, priority = 10, state_up = 3, total_nodes = 85, total_cpus = 1020, cr_type = 0}
(In reply to David Bigagli from comment #21) > Sorry I made a mistake I meant: > > (gdb)print * job_ptr->part_ptr > > The reason I was asking about the PrologSlurmctld is that there are 2 code > paths that start a job if this parameter is configured or if not. > In the case it is configured it is the thread that runs the prolog that > starts the job, in the other case it is the main slurmctld background thread. > I don't know if this is relevant to what we see but it would be worth to try. > Can you disable the PrologSlurmctld for a while and then enable kuang_hp > partition? You can even add the reservation as before. > > David We have disabled the prologslurmctld and enabled the kuang_hp partition.
No core dump in the past ~24h? David
Nope. It has been smooth sailing. ================================== Dr. Scott Yockel | Senior Team Lead of HPC FAS Research Computing | Harvard University 38 Oxford Street Cambridge, MA Office: 211A | Phone: 617-496-7468 ================================== > On Aug 13, 2015, at 6:02 AM, bugs@schedmd.com wrote: > > > Comment # 24 <http://bugs.schedmd.com/show_bug.cgi?id=1854#c24> on bug 1854 <http://bugs.schedmd.com/show_bug.cgi?id=1854> from David Bigagli <mailto:david@schedmd.com> > No core dump in the past ~24h? > > David > > You are receiving this mail because: > You are on the CC list for the bug.
There is probably a race condition somewhere causing this. I will provide a fix to prevent the core dump for now. David
In 14.11.8 commit 2d8d92aab90a892 we have already introduced a code to prevent a core dump should this problem happened. Upgrading will improve slurmctld stability should this happen again. David
Okay great. We will do an upgrade in the morning and also put back the slurmcltd_prolog
Hi, I saw you have upgraded. Please reopen should you see the error message in the log file that states: job xyz missing job_resrcs info. David