Ticket 470 - Segfault on job start due to NULL job->job_resrcs
Summary: Segfault on job start due to NULL job->job_resrcs
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 2.6.x
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Moe Jette
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2013-10-19 04:03 MDT by John Morrissey
Modified: 2013-10-22 07:51 MDT (History)
1 user (show)

See Also:
Site: Harvard University
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurmctld logs since process start (814.93 KB, application/octet-stream)
2013-10-19 04:04 MDT, John Morrissey
Details
likely fix for the problem (1.11 KB, patch)
2013-10-19 14:33 MDT, Moe Jette
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description John Morrissey 2013-10-19 04:03:06 MDT
Running 2.6.3 plus patches for http://bugs.schedmd.com/show_bug.cgi?id=445, 0b68c2edceb54c66f796f9084148e78e052ca5bb, and e1dce4a5a9668ffb0382ddaf1200be80aa118397.

Here's the backtrace; still have the core if you want it, too.


Core was generated by `/usr/sbin/slurmctld'.
Program terminated with signal 11, Segmentation fault.
#0  make_batch_job_cred (launch_msg_ptr=0x2b9dc00043d8, job_ptr=<value optimized out>)
    at job_scheduler.c:1295
1295	job_scheduler.c: No such file or directory.
	in job_scheduler.c
Missing separate debuginfos, use: debuginfo-install slurm-2.6.3-1rc1.el6.x86_64
(gdb) bt
#0  make_batch_job_cred (launch_msg_ptr=0x2b9dc00043d8, job_ptr=<value optimized out>)
    at job_scheduler.c:1295
#1  0x000000000044eb46 in build_launch_job_msg (job_ptr=0x2b9df41511e8)
    at job_scheduler.c:1200
#2  0x000000000044eddb in launch_job (job_ptr=0x2b9df41511e8) at job_scheduler.c:1256
#3  0x000000000044f28d in _run_prolog (arg=<value optimized out>) at job_scheduler.c:2256
#4  0x0000003b2aa07851 in start_thread () from /lib64/libpthread.so.0
#5  0x0000003b2a6e890d in clone () from /lib64/libc.so.6
(gdb) print job_resrcs_ptr
$1 = (job_resources_t *) 0x0
(gdb) frame 2
#2  0x000000000044eddb in launch_job (job_ptr=0x2b9df41511e8) at job_scheduler.c:1256
1256	in job_scheduler.c
(gdb) print job_ptr->job_resrcs
$5 = (job_resources_t *) 0x0
(gdb) print *job_ptr
$6 = {account = 0x2b9df4109c28 "cluster_users", alias_list = 0x0, 
  alloc_node = 0x2b9df4109c08 "heroint1", alloc_resp_port = 0, alloc_sid = 22789, 
  array_job_id = 0, array_task_id = 65534, assoc_id = 1501, assoc_ptr = 0x101df38, 
  batch_flag = 1, batch_host = 0x5079128 "holy2b05108", check_job = 0x0, 
  ckpt_interval = 0, ckpt_time = 0, comment = 0x0, cpu_cnt = 2, cr_enabled = 1, 
  db_index = 2437177, derived_ec = 0, details = 0x2b9df4109a08, direct_set_prio = 0, 
  end_time = 1382236702, exit_code = 0, front_end_ptr = 0x0, gres = 0x0, 
  gres_list = 0x0, gres_alloc = 0x5079178 "", gres_req = 0x5079078 "", gres_used = 0x0, 
  group_id = 40122, job_id = 2336227, job_next = 0x0, job_resrcs = 0x0, job_state = 1, 
  kill_on_node_fail = 1, licenses = 0x0, license_list = 0x0, limit_set_max_cpus = 0, 
  limit_set_max_nodes = 0, limit_set_min_cpus = 0, limit_set_min_nodes = 0, 
  limit_set_pn_min_memory = 0, limit_set_time = 0, limit_set_qos = 0, mail_type = 0, 
  mail_user = 0x2b9df4109c48 "sstokes", magic = 4038539564, 
  name = 0x2b9df4109bb8 "1350_2146_as3620120427C_dk113_filtp3_weight3_gs_dp0000_jack51", 
  network = 0x0, next_step_id = 0, nodes = 0x51f9538 "holy2b05108", 
  node_addr = 0x5224da8, node_bitmap = 0x6c8fbc0, node_bitmap_cg = 0x0, node_cnt = 1, 
  nodes_completing = 0x0, other_port = 0, partition = 0x2b9df4151488 "serial_requeue", 
  part_ptr_list = 0x0, part_nodes_missing = false, part_ptr = 0x2b9bd40b7108, 
  pre_sus_time = 0, preempt_time = 0, priority = 143994309, priority_array = 0x0, 
  prio_factors = 0x2b9df4109b68, profile = 0, qos_id = 1, qos_ptr = 0xfc5328, 
  restart_cnt = 0, resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0, 
  resv_flags = 0, requid = 4294967295, resp_host = 0x0, select_jobinfo = 0x2b9df40053d8, 
  spank_job_env = 0x0, spank_job_env_size = 0, start_time = 0, state_desc = 0x0, 
  state_reason = 0, step_list = 0x288c0e8, suspend_time = 0, 
  time_last_active = 1382150302, time_limit = 4294967294, time_min = 0, 
  tot_sus_time = 0, total_cpus = 1, total_nodes = 1, user_id = 50024, 
  wait_all_nodes = 0, warn_signal = 0, warn_time = 0, wckey = 0x0, req_switch = 0, 
  wait4switch = 0, best_switch = true, wait4switch_start = 0}
Comment 1 John Morrissey 2013-10-19 04:04:58 MDT
Created attachment 457 [details]
slurmctld logs since process start
Comment 2 Moe Jette 2013-10-19 13:31:50 MDT
I am not certain exactly what the problem is, but suspect it to be some race condition related to job preemption (from your log):
Oct 18 21:52:32 holy-slurm01 slurmctld[39785]: _slurm_rpc_submit_batch_job JobId=2336227 usec=2121609
Oct 18 22:38:22 holy-slurm01 slurmctld[39785]: sched: Allocate JobId=2336227 NodeList=holy2b05108 #CPUs=2
Oct 18 22:38:32 holy-slurm01 slurmctld[39785]: error: find_preemptable_jobs: job 2336227 not pending
Oct 18 22:38:32 holy-slurm01 slurmctld[39785]: error: find_preemptable_jobs: job 2336227 not pending
Oct 18 22:38:33 holy-slurm01 slurmctld[39785]: error: find_preemptable_jobs: job 2336227 not pending
Oct 18 22:38:33 holy-slurm01 slurmctld[39785]: error: find_preemptable_jobs: job 2336227 not pending
I will keep looking and update you if I have more information to report.

The other thing that I want to raise about your configuration is that you definitely do not want "default_queue_depth=5000". That defines how far down the queue that the scheduler should test every time that a job is submitted or ends (if you did not have defer) or whenever the scheduler runs on other events. There is no relenquishing of locks for other events. The default value is 100 and setting the value to 50000 will result in really poor responsiveness. Let the backfill scheduler do the heavy lifting and it will release locks every couple of seconds for other things to happen. I will add information to the documentation on this matter.
Comment 3 Moe Jette 2013-10-19 13:43:46 MDT
Updated documentation:

       SchedulerParameters
              The interpretation of this parameter  varies  by  SchedulerType.
              Multiple options may be comma separated.

              default_queue_depth=#
                     The  default  number  of jobs to attempt scheduling (i.e.
                     the queue depth) when a running job  completes  or  other
                     routine  actions occur.  The full queue will be tested on
                     a less frequent basis.  The default value is 100.  In the
                     case  of large clusters (more than 1000 nodes), configur‐
                     ing a relatively small value may be desirable.   Specify‐
                     ing a large value (say 1000 or higher) can be expected to
                     result in poor system responsiveness since this  schedul‐
                     ing  logic  will  not  release  locks for other events to
                     occur.  It would be better to let the backfill  scheduler
                     process  a larger number of jobs (see max_job_bf, bf_con‐
                     tinue  and other options here for more information).

              defer  Setting this option will  avoid  attempting  to  schedule
                     each  job  individually  at job submit time, but defer it
                     until a later time when scheduling multiple jobs simulta‐
                     neously  may be possible.  This option may improve system
                     responsiveness when large numbers of jobs (many hundreds)
                     are  submitted  at  the  same time, but it will delay the
                     initiation   time   of   individual   jobs.   Also    see
                     default_queue_depth above.

              bf_continue
                     The  backfill  scheduler  periodically  releases locks in
                     order to permit other operations to proceed  rather  than
                     blocking  all  activity  for  what  could  be an extended
                     period of time.  Setting this option will cause the back‐
                     fill  scheduler  to continue processing pending jobs from
                     its original job list after releasing locks even  if  job
                     or node state changes.  This can result in lower priority
                     jobs from  being  backfill  scheduled  instead  of  newly
                     arrived higher priority jobs, but will permit more queued
                     jobs to be considered for backfill scheduling.

              bf_interval=#
                     The number of seconds between iterations.  Higher  values
                     result  in  less overhead and better responsiveness.  The
                     default value is 30 seconds.  This option applies only to
                     SchedulerType=sched/backfill.

              bf_max_job_part=#
                     The maximum number of jobs per partition to attempt back‐
                     fill scheduling for, not counting jobs  which  cannot  be
                     started due to an association resource limit. This can be
                     especially helpful for systems with large numbers of par‐
                     titions and jobs.  The default value is 0, which means no
                     limit.   This   option   applies   only   to   Scheduler‐
                     Type=sched/backfill.

              bf_max_job_user=#
                     The  maximum  number of jobs per user to attempt backfill
                     scheduling for, not counting jobs which cannot be started
                     due  to  an association resource limit.  One can set this
                     limit to prevent users from flooding the  backfill  queue
                     with  jobs  that  cannot start and that prevent jobs from
                     other users to start.  This is  similar  to  the  MAXIJOB
                     limit  in  Maui.   The default value is 0, which means no
                     limit.   This   option   applies   only   to   Scheduler‐
                     Type=sched/backfill.

              bf_resolution=#
                     The  number  of  seconds  in the resolution of data main‐
                     tained about when jobs  begin  and  end.   Higher  values
                     result  in  less overhead and better responsiveness.  The
                     default value is 60 seconds.  This option applies only to
                     SchedulerType=sched/backfill.

              bf_window=#
                     The  number  of minutes into the future to look when con‐
                     sidering jobs to schedule.  Higher values result in  more
                     overhead  and  less responsiveness.  The default value is
                     1440 minutes (one day).  A value at least as long as  the
                     highest allowed time limit is generally advisable to pre‐
                     vent job starvation.  In order limit the amount  of  data
                     managed  by  the  backfill  scheduler,  if  the  value of
                     bf_window is increased, then it is generally advisable to
                     also increase bf_resolution.  if This option applies only
                     to SchedulerType=sched/backfill.

              max_job_bf=#
                     The maximum number of jobs to attempt backfill scheduling
                     for (i.e. the queue depth).  Higher values result in more
                     overhead and less responsiveness.  Until  an  attempt  is
                     made  to backfill schedule a job, its expected initiation
                     time value will not be set.  The  default  value  is  50.
                     This option applies only to SchedulerType=sched/backfill.

              max_depend_depth=#
                     Maximum  number of
              max_switch_wait=#
                     Maximum  number of seconds that a job can delay execution
                     waiting for  the  specified  desired  switch  count.  The
                     default value is 300 seconds.
 jobs to test for a circular job depen‐
                     dency. Stop testing after this number of job dependencies
                     have been tested. The default value is 10 jobs.

              max_switch_wait=#
                     Maximum  number of seconds that a job can delay execution
                     waiting for  the  specified  desired  switch  count.  The
                     default value is 300 seconds.
Comment 4 Moe Jette 2013-10-19 14:33:30 MDT
Created attachment 458 [details]
likely fix for the problem

This should fix the problem.

Also see commit here:
https://github.com/SchedMD/slurm/commit/ea1b316c60ad2863bdc39bb2610229024bce17fa

If the backfill scheduler relinquishes locks and the normal job
scheduler starts a job that the backfill scheduler was actively
working, the backfill scheduler will try to re-schedule that
same job, possibly resulting in an invalid memory reference
or other badness.
Comment 5 Moe Jette 2013-10-19 14:43:44 MDT
Sorry for all the problems here. The bf_continue option is newly added and there are a lot of things that can hapen with the scheduler relinquishes locks.

Besides this patch, I recommend changing
default_queue_depth to something much smaller, like 100, otherwise performance will be bed
max_job_bf to your queue depth, say 50000
bf_max_job_part, I'm not sure what would make sense for you, perhaps 100
bf_resolution=300 to reduce backfill scheduler overhead
bf_continue should be safe now


Perhaps you could send me some of your slurmdctld log after patching and changes to configuration recommended above.
Comment 6 Moe Jette 2013-10-22 07:51:40 MDT
I'm going to close this based upon the patch. Please re-open if necessary. We should tag version 2.6.4 within a couple of weeks and that contains a number of bug fixes specifically for your environment.