Created attachment 831 [details] core dump of the invalid assoc_ptr failure. We are running 2.6.9 with some patches. We had been running stably for quite some time. However this afternoon the master tipped over. After being unable to restart it we managed to find out that these jobs: [2014-05-09T13:18:28.001] error: Invalid assoc_ptr for jobid=7480837 [2014-05-09T13:18:28.001] error: Invalid assoc_ptr for jobid=7485672 [2014-05-09T13:18:28.001] error: Invalid assoc_ptr for jobid=7485676 [2014-05-09T13:18:28.001] error: Invalid assoc_ptr for jobid=8096192 [2014-05-09T13:18:28.001] error: Invalid assoc_ptr for jobid=8426686 [2014-05-09T13:18:28.001] error: Invalid assoc_ptr for jobid=9521413 Were causing the problem. Now prior to this I had seen these flags but they had never caused a fail. However, for whatever reason this time it caused the master to die basically on start up. When I removed these jobs the scheduler started working again with out any problems. I've attached the dump to this so you can take a look.
Paul, could you please just send the backtrace? Without your complete build the core file is of no use. A thread apply all bt full should be sufficient.
Created attachment 832 [details] backtrace of failure
I've attached it. When I was running trying to get it back up, I ran it straight from the command line and saw this error as well. slurmctld: debug3: sched: JobId=9980436. State=PENDING. Reason=Resources. Priority=69888458. Partition=bigmem. slurmctld: bitstring.c:182: bit_test: Assertion `(b) != ((void *)0)' failed. As I said before we are back and stable so while this bug does cause the scheduler to crash hard we are not in that state anymore.
Paul, could you send me the complete output from that last start? I needed that debug3 turned on :) This doesn't appear to be related to associations, but to reservations.
I am also guessing you modified that debug3 line. I don't see the Partition= portion in the man code. Would you mind sending me your job_scheduler.c file?
Created attachment 835 [details] debug log.
In your backtrace in thread 1 could you print *job_ptr print *job_ptr->resv_ptr print *avail_node_bitmap and send that output? Having your job_scheduler.c would be very handy as well.
Created attachment 836 [details] Heres our source RPM containing all our build.
(gdb) thread 1 [Switching to thread 1 (Thread 0x2b4780100700 (LWP 1341))]#0 0x0000003b2a6328a5 in raise () from /lib64/libc.so.6 (gdb) print *job_ptr No symbol "job_ptr" in current context. (gdb) print *job_ptr->resv_ptr No symbol "job_ptr" in current context. (gdb) print *avail_node_bitmap $1 = 1111704645
Could you do the same in frame 5? On May 9, 2014 11:59:25 AM PDT, bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=798 > >--- Comment #9 from Paul Edmon <pedmon@cfa.harvard.edu> --- >(gdb) thread 1 >[Switching to thread 1 (Thread 0x2b4780100700 (LWP 1341))]#0 >0x0000003b2a6328a5 in raise () from /lib64/libc.so.6 >(gdb) print *job_ptr >No symbol "job_ptr" in current context. >(gdb) print *job_ptr->resv_ptr >No symbol "job_ptr" in current context. >(gdb) print *avail_node_bitmap >$1 = 1111704645 > >-- >You are receiving this mail because: >You are on the CC list for the bug. >You are the assignee for the bug.
Created attachment 837 [details] attachment-11923-0.html
(gdb) frame 5 #5 0x00000000004506c4 in schedule (job_limit=50) at job_scheduler.c:1035 1035 job_scheduler.c: No such file or directory. in job_scheduler.c (gdb) print *job_ptr $2 = {account = 0x1408728 "cluster_users", alias_list = 0x0, alloc_node = 0x1408708 "rclogin09", alloc_resp_port = 0, alloc_sid = 10625, array_job_id = 0, array_task_id = 65534, assoc_id = 3, assoc_ptr = 0x1300f98, batch_flag = 1, batch_host = 0x1408788 "moorcroft04", check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, comment = 0x0, cpu_cnt = 0, cr_enabled = 0, db_index = 10800089, derived_ec = 0, details = 0x1408548, direct_set_prio = 0, end_time = 0, exit_code = 0, front_end_ptr = 0x0, gres = 0x0, gres_list = 0x0, gres_alloc = 0x1408748 "", gres_req = 0x1408768 "", gres_used = 0x0, group_id = 33118, job_id = 7485672, job_next = 0x0, job_resrcs = 0x14087c8, job_state = 0, kill_on_node_fail = 1, licenses = 0x0, license_list = 0x0, limit_set_max_cpus = 0, limit_set_max_nodes = 0, limit_set_min_cpus = 0, limit_set_min_nodes = 0, limit_set_pn_min_memory = 0, limit_set_time = 0, limit_set_qos = 0, mail_type = 0, mail_user = 0x0, magic = 4038539564, name = 0x14086c8 "4th_trial_prescribedPhenlast-USHa1", network = 0x0, next_step_id = 0, nodes = 0x0, node_addr = 0x0, node_bitmap = 0x0, node_bitmap_cg = 0x0, node_cnt = 0, nodes_completing = 0x0, other_port = 0, partition = 0x14086a8 "moorcroft_6100", part_ptr_list = 0x0, part_nodes_missing = false, part_ptr = 0x13bb958, pre_sus_time = 0, preempt_time = 0, priority = 69080342, priority_array = 0x0, prio_factors = 0x1408258, profile = 0, qos_id = 1, qos_ptr = 0x126b9f8, restart_cnt = 1, resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0, resv_flags = 0, requid = 4294967295, resp_host = 0x0, select_jobinfo = 0x14087a8, spank_job_env = 0x0, spank_job_env_size = 0, start_time = 0, state_desc = 0x0, state_reason = 0, step_list = 0x132e668, suspend_time = 0, time_last_active = 1399654686, time_limit = 4294967295, time_min = 0, tot_sus_time = 0, total_cpus = 0, total_nodes = 1, user_id = 40746, wait_all_nodes = 0, warn_signal = 0, warn_time = 0, wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0} (gdb) print *job_ptr->resv_ptr Cannot access memory at address 0x0 (gdb) print *avail_node_bitmap $3 = 1111704645
Thanks Paul, in the same frame could you give me print *job_ptr->details
(gdb) print *job_ptr->details $4 = {acctg_freq = 0x0, argc = 1, argv = 0x1408a38, begin_time = 1399652845, ckpt_dir = 0x12423e8 "/n/moorcroftfs2/kzhang/NorthA/Region/MCMC_Sites/4th_trial_prescribedPhenlast", contiguous = 0, cpu_bind = 0x0, cpu_bind_type = 0, cpus_per_task = 1, depend_list = 0x0, dependency = 0x0, orig_dependency = 0x0, env_cnt = 0, env_sup = 0x0, exc_node_bitmap = 0x0, exc_nodes = 0x0, expanding_jobid = 0, feature_list = 0x0, features = 0x0, magic = 0, max_cpus = 4294967294, max_nodes = 0, mc_ptr = 0x14089e8, mem_bind = 0x0, mem_bind_type = 0, min_cpus = 1, min_nodes = 1, nice = 10000, ntasks_per_node = 0, num_tasks = 1, open_mode = 0 '\000', overcommit = 0 '\000', plane_size = 0, pn_min_cpus = 1, pn_min_memory = 2147485648, pn_min_tmp_disk = 0, prolog_running = 0 '\000', reserved_resources = 0, req_node_bitmap = 0x0, req_node_layout = 0x0, preempt_start_time = 0, req_nodes = 0x0, requeue = 1, restart_dir = 0x0, shared = 2, std_err = 0x0, std_in = 0x1408a18 "/dev/null", std_out = 0x14081d8 "/n/moorcroftfs2/kzhang/NorthA/Region/MCMC_Sites/4th_trial_prescribedPhenlast/USHa1/serial_lsf.out", submit_time = 1399652835, task_dist = 10, usable_nodes = 0, work_dir = 0x1408178 "/n/moorcroftfs2/kzhang/NorthA/Region/MCMC_Sites/4th_trial_prescribedPhenlast"}
Created attachment 838 [details] I believe this is the job_scheduler.c you are running After applying the patches you sent this appear to be what you are running. In frame 5 could you send me print *assoc_ptr print *assoc_ptr->usage
(gdb) print *assoc_ptr value has been optimized out (gdb) print *assoc_ptr->usage value has been optimized out
what about print *(slurmdb_association_rec_t *)job_ptr->assoc_ptr
(gdb) print *(slurmdb_association_rec_t *)job_ptr->assoc_ptr $5 = {accounting_list = 0x0, acct = 0x1301068 "cluster_users", cluster = 0x1301088 "odyssey", def_qos_id = 0, grp_cpu_mins = 4294967295, grp_cpu_run_mins = 4294967295, grp_cpus = 4294967295, grp_jobs = 4294967295, grp_mem = 4294967295, grp_nodes = 4294967295, grp_submit_jobs = 4294967295, grp_wall = 4294967295, id = 3, is_def = 0, lft = 1274, max_cpu_mins_pj = 4294967295, max_cpu_run_mins = 4294967295, max_cpus_pj = 4294967295, max_jobs = 10100, max_nodes_pj = 4294967295, max_submit_jobs = 4294967295, max_wall_pj = 4294967295, parent_acct = 0x13010a8 "root", parent_id = 1, partition = 0x0, qos_list = 0x12f9408, rgt = 8967, shares_raw = 100, uid = 4294967294, usage = 0x131e0d8, user = 0x0}
Could you also try the same thing with usage. I think print *((slurmdb_association_rec_t *)job_ptr->assoc_ptr)->usage will do it. thanks
Paul, it would appear this association isn't the correct association. It doesn't appear to be a user association. If you could find the user name of uid 40746 and send me this... sacctmgr list assoc user=$USERNAME acct=cluster_users cluster=odyssey format=id and sacctmgr list assoc user='' acct=cluster_users cluster=odyssey format=id My guess is this second one is 3. Which would be incorrect. Not sure how that could of happened.
Paul, ok, I have a fix for you. It turns out it did have to deal with the Invalid assoc_ptr message. Commit 2261d3939438e2291d1825a49037e892a68f8b14 should fix you up. Please reopen if you find otherwise.
Excellent thank you. Will this fix be ported forward to 14.03? Or is it not relevant there? We will be upgrading soon (probably a week or two).
Up, just noticed that the fix was ported forward to 14.03.4. We will wait till that version to grab it. Thanks a bunch.
Already ported up, it will be in 14.03.4. On May 9, 2014 8:11:37 PM PDT, bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=798 > >--- Comment #22 from Paul Edmon <pedmon@cfa.harvard.edu> --- >Excellent thank you. Will this fix be ported forward to 14.03? Or is >it not >relevant there? We will be upgrading soon (probably a week or two). > >-- >You are receiving this mail because: >You are on the CC list for the bug. >You are the assignee for the bug.
Created attachment 841 [details] attachment-15637-0.html