Running 2.6.3 plus patches for http://bugs.schedmd.com/show_bug.cgi?id=445, 0b68c2edceb54c66f796f9084148e78e052ca5bb, and e1dce4a5a9668ffb0382ddaf1200be80aa118397. Here's the backtrace; still have the core if you want it, too. Core was generated by `/usr/sbin/slurmctld'. Program terminated with signal 11, Segmentation fault. #0 make_batch_job_cred (launch_msg_ptr=0x2b9dc00043d8, job_ptr=<value optimized out>) at job_scheduler.c:1295 1295 job_scheduler.c: No such file or directory. in job_scheduler.c Missing separate debuginfos, use: debuginfo-install slurm-2.6.3-1rc1.el6.x86_64 (gdb) bt #0 make_batch_job_cred (launch_msg_ptr=0x2b9dc00043d8, job_ptr=<value optimized out>) at job_scheduler.c:1295 #1 0x000000000044eb46 in build_launch_job_msg (job_ptr=0x2b9df41511e8) at job_scheduler.c:1200 #2 0x000000000044eddb in launch_job (job_ptr=0x2b9df41511e8) at job_scheduler.c:1256 #3 0x000000000044f28d in _run_prolog (arg=<value optimized out>) at job_scheduler.c:2256 #4 0x0000003b2aa07851 in start_thread () from /lib64/libpthread.so.0 #5 0x0000003b2a6e890d in clone () from /lib64/libc.so.6 (gdb) print job_resrcs_ptr $1 = (job_resources_t *) 0x0 (gdb) frame 2 #2 0x000000000044eddb in launch_job (job_ptr=0x2b9df41511e8) at job_scheduler.c:1256 1256 in job_scheduler.c (gdb) print job_ptr->job_resrcs $5 = (job_resources_t *) 0x0 (gdb) print *job_ptr $6 = {account = 0x2b9df4109c28 "cluster_users", alias_list = 0x0, alloc_node = 0x2b9df4109c08 "heroint1", alloc_resp_port = 0, alloc_sid = 22789, array_job_id = 0, array_task_id = 65534, assoc_id = 1501, assoc_ptr = 0x101df38, batch_flag = 1, batch_host = 0x5079128 "holy2b05108", check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, comment = 0x0, cpu_cnt = 2, cr_enabled = 1, db_index = 2437177, derived_ec = 0, details = 0x2b9df4109a08, direct_set_prio = 0, end_time = 1382236702, exit_code = 0, front_end_ptr = 0x0, gres = 0x0, gres_list = 0x0, gres_alloc = 0x5079178 "", gres_req = 0x5079078 "", gres_used = 0x0, group_id = 40122, job_id = 2336227, job_next = 0x0, job_resrcs = 0x0, job_state = 1, kill_on_node_fail = 1, licenses = 0x0, license_list = 0x0, limit_set_max_cpus = 0, limit_set_max_nodes = 0, limit_set_min_cpus = 0, limit_set_min_nodes = 0, limit_set_pn_min_memory = 0, limit_set_time = 0, limit_set_qos = 0, mail_type = 0, mail_user = 0x2b9df4109c48 "sstokes", magic = 4038539564, name = 0x2b9df4109bb8 "1350_2146_as3620120427C_dk113_filtp3_weight3_gs_dp0000_jack51", network = 0x0, next_step_id = 0, nodes = 0x51f9538 "holy2b05108", node_addr = 0x5224da8, node_bitmap = 0x6c8fbc0, node_bitmap_cg = 0x0, node_cnt = 1, nodes_completing = 0x0, other_port = 0, partition = 0x2b9df4151488 "serial_requeue", part_ptr_list = 0x0, part_nodes_missing = false, part_ptr = 0x2b9bd40b7108, pre_sus_time = 0, preempt_time = 0, priority = 143994309, priority_array = 0x0, prio_factors = 0x2b9df4109b68, profile = 0, qos_id = 1, qos_ptr = 0xfc5328, restart_cnt = 0, resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0, resv_flags = 0, requid = 4294967295, resp_host = 0x0, select_jobinfo = 0x2b9df40053d8, spank_job_env = 0x0, spank_job_env_size = 0, start_time = 0, state_desc = 0x0, state_reason = 0, step_list = 0x288c0e8, suspend_time = 0, time_last_active = 1382150302, time_limit = 4294967294, time_min = 0, tot_sus_time = 0, total_cpus = 1, total_nodes = 1, user_id = 50024, wait_all_nodes = 0, warn_signal = 0, warn_time = 0, wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0}
Created attachment 457 [details] slurmctld logs since process start
I am not certain exactly what the problem is, but suspect it to be some race condition related to job preemption (from your log): Oct 18 21:52:32 holy-slurm01 slurmctld[39785]: _slurm_rpc_submit_batch_job JobId=2336227 usec=2121609 Oct 18 22:38:22 holy-slurm01 slurmctld[39785]: sched: Allocate JobId=2336227 NodeList=holy2b05108 #CPUs=2 Oct 18 22:38:32 holy-slurm01 slurmctld[39785]: error: find_preemptable_jobs: job 2336227 not pending Oct 18 22:38:32 holy-slurm01 slurmctld[39785]: error: find_preemptable_jobs: job 2336227 not pending Oct 18 22:38:33 holy-slurm01 slurmctld[39785]: error: find_preemptable_jobs: job 2336227 not pending Oct 18 22:38:33 holy-slurm01 slurmctld[39785]: error: find_preemptable_jobs: job 2336227 not pending I will keep looking and update you if I have more information to report. The other thing that I want to raise about your configuration is that you definitely do not want "default_queue_depth=5000". That defines how far down the queue that the scheduler should test every time that a job is submitted or ends (if you did not have defer) or whenever the scheduler runs on other events. There is no relenquishing of locks for other events. The default value is 100 and setting the value to 50000 will result in really poor responsiveness. Let the backfill scheduler do the heavy lifting and it will release locks every couple of seconds for other things to happen. I will add information to the documentation on this matter.
Updated documentation: SchedulerParameters The interpretation of this parameter varies by SchedulerType. Multiple options may be comma separated. default_queue_depth=# The default number of jobs to attempt scheduling (i.e. the queue depth) when a running job completes or other routine actions occur. The full queue will be tested on a less frequent basis. The default value is 100. In the case of large clusters (more than 1000 nodes), configur‐ ing a relatively small value may be desirable. Specify‐ ing a large value (say 1000 or higher) can be expected to result in poor system responsiveness since this schedul‐ ing logic will not release locks for other events to occur. It would be better to let the backfill scheduler process a larger number of jobs (see max_job_bf, bf_con‐ tinue and other options here for more information). defer Setting this option will avoid attempting to schedule each job individually at job submit time, but defer it until a later time when scheduling multiple jobs simulta‐ neously may be possible. This option may improve system responsiveness when large numbers of jobs (many hundreds) are submitted at the same time, but it will delay the initiation time of individual jobs. Also see default_queue_depth above. bf_continue The backfill scheduler periodically releases locks in order to permit other operations to proceed rather than blocking all activity for what could be an extended period of time. Setting this option will cause the back‐ fill scheduler to continue processing pending jobs from its original job list after releasing locks even if job or node state changes. This can result in lower priority jobs from being backfill scheduled instead of newly arrived higher priority jobs, but will permit more queued jobs to be considered for backfill scheduling. bf_interval=# The number of seconds between iterations. Higher values result in less overhead and better responsiveness. The default value is 30 seconds. This option applies only to SchedulerType=sched/backfill. bf_max_job_part=# The maximum number of jobs per partition to attempt back‐ fill scheduling for, not counting jobs which cannot be started due to an association resource limit. This can be especially helpful for systems with large numbers of par‐ titions and jobs. The default value is 0, which means no limit. This option applies only to Scheduler‐ Type=sched/backfill. bf_max_job_user=# The maximum number of jobs per user to attempt backfill scheduling for, not counting jobs which cannot be started due to an association resource limit. One can set this limit to prevent users from flooding the backfill queue with jobs that cannot start and that prevent jobs from other users to start. This is similar to the MAXIJOB limit in Maui. The default value is 0, which means no limit. This option applies only to Scheduler‐ Type=sched/backfill. bf_resolution=# The number of seconds in the resolution of data main‐ tained about when jobs begin and end. Higher values result in less overhead and better responsiveness. The default value is 60 seconds. This option applies only to SchedulerType=sched/backfill. bf_window=# The number of minutes into the future to look when con‐ sidering jobs to schedule. Higher values result in more overhead and less responsiveness. The default value is 1440 minutes (one day). A value at least as long as the highest allowed time limit is generally advisable to pre‐ vent job starvation. In order limit the amount of data managed by the backfill scheduler, if the value of bf_window is increased, then it is generally advisable to also increase bf_resolution. if This option applies only to SchedulerType=sched/backfill. max_job_bf=# The maximum number of jobs to attempt backfill scheduling for (i.e. the queue depth). Higher values result in more overhead and less responsiveness. Until an attempt is made to backfill schedule a job, its expected initiation time value will not be set. The default value is 50. This option applies only to SchedulerType=sched/backfill. max_depend_depth=# Maximum number of max_switch_wait=# Maximum number of seconds that a job can delay execution waiting for the specified desired switch count. The default value is 300 seconds. jobs to test for a circular job depen‐ dency. Stop testing after this number of job dependencies have been tested. The default value is 10 jobs. max_switch_wait=# Maximum number of seconds that a job can delay execution waiting for the specified desired switch count. The default value is 300 seconds.
Created attachment 458 [details] likely fix for the problem This should fix the problem. Also see commit here: https://github.com/SchedMD/slurm/commit/ea1b316c60ad2863bdc39bb2610229024bce17fa If the backfill scheduler relinquishes locks and the normal job scheduler starts a job that the backfill scheduler was actively working, the backfill scheduler will try to re-schedule that same job, possibly resulting in an invalid memory reference or other badness.
Sorry for all the problems here. The bf_continue option is newly added and there are a lot of things that can hapen with the scheduler relinquishes locks. Besides this patch, I recommend changing default_queue_depth to something much smaller, like 100, otherwise performance will be bed max_job_bf to your queue depth, say 50000 bf_max_job_part, I'm not sure what would make sense for you, perhaps 100 bf_resolution=300 to reduce backfill scheduler overhead bf_continue should be safe now Perhaps you could send me some of your slurmdctld log after patching and changes to configuration recommended above.
I'm going to close this based upon the patch. Please re-open if necessary. We should tag version 2.6.4 within a couple of weeks and that contains a number of bug fixes specifically for your environment.