798 – Invalid assoc_ptr causes Ctld crash

Ticket 798 - Invalid assoc_ptr causes Ctld crash

Summary: Invalid assoc_ptr causes Ctld crash

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	2.6.7
Hardware:	Linux Linux

Importance:	--- 1 - System not usable
Assignee:	Danny Auble
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2014-05-09 06:07 MDT by Paul Edmon
Modified:	2014-05-09 15:20 MDT (History)
CC List:	1 user (show)

See Also:
Site:	Harvard University
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	2.6.10 14.03.4
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
core dump of the invalid assoc_ptr failure. (46.16 MB, application/x-gzip) 2014-05-09 06:07 MDT, Paul Edmon	Details
backtrace of failure (67.51 KB, text/plain) 2014-05-09 06:20 MDT, Paul Edmon	Details
debug log. (8.46 MB, text/x-log) 2014-05-09 06:51 MDT, Paul Edmon	Details
Heres our source RPM containing all our build. (5.70 MB, application/x-rpm) 2014-05-09 06:56 MDT, Paul Edmon	Details
attachment-11923-0.html (1.55 KB, text/html) 2014-05-09 07:11 MDT, Danny Auble	Details
I believe this is the job_scheduler.c you are running (84.22 KB, text/x-csrc) 2014-05-09 07:59 MDT, Danny Auble	Details
attachment-15637-0.html (1.39 KB, text/html) 2014-05-09 15:20 MDT, Danny Auble	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Paul Edmon 2014-05-09 06:07:28 MDT

Created attachment 831 [details]
core dump of the invalid assoc_ptr failure.

We are running 2.6.9 with some patches.  We had been running stably for quite some time.  However this afternoon the master tipped over.  After being unable to restart it we managed to find out that these jobs:

[2014-05-09T13:18:28.001] error: Invalid assoc_ptr for jobid=7480837
[2014-05-09T13:18:28.001] error: Invalid assoc_ptr for jobid=7485672
[2014-05-09T13:18:28.001] error: Invalid assoc_ptr for jobid=7485676
[2014-05-09T13:18:28.001] error: Invalid assoc_ptr for jobid=8096192
[2014-05-09T13:18:28.001] error: Invalid assoc_ptr for jobid=8426686
[2014-05-09T13:18:28.001] error: Invalid assoc_ptr for jobid=9521413

Were causing the problem.  Now prior to this I had seen these flags but they had never caused a fail.  However, for whatever reason this time it caused the master to die basically on start up.  When I removed these jobs the scheduler started working again with out any problems.

I've attached the dump to this so you can take a look.

Comment 1 Danny Auble 2014-05-09 06:10:40 MDT

Paul, could you please just send the backtrace?  Without your complete build the core file is of no use.

A

thread apply all bt full

should be sufficient.

Comment 2 Paul Edmon 2014-05-09 06:20:26 MDT

Created attachment 832 [details]
backtrace of failure

Comment 3 Paul Edmon 2014-05-09 06:21:44 MDT

I've attached it.  When I was running trying to get it back up, I ran it straight from the command line and saw this error as well.

slurmctld: debug3: sched: JobId=9980436. State=PENDING. Reason=Resources. Priority=69888458. Partition=bigmem.

slurmctld: bitstring.c:182: bit_test: Assertion `(b) != ((void *)0)' failed.

As I said before we are back and stable so while this bug does cause the scheduler to crash hard we are not in that state anymore.

Comment 4 Danny Auble 2014-05-09 06:23:43 MDT

Paul, could you send me the complete output from that last start?  I needed that debug3 turned on :)

This doesn't appear to be related to associations, but to reservations.

Comment 5 Danny Auble 2014-05-09 06:26:35 MDT

I am also guessing you modified that debug3 line.  I don't see the Partition= portion in the man code.

Would you mind sending me your job_scheduler.c file?

Comment 6 Paul Edmon 2014-05-09 06:51:12 MDT

Created attachment 835 [details]
debug log.

Comment 7 Danny Auble 2014-05-09 06:55:16 MDT

In your backtrace in thread 1 could you 

print *job_ptr
print *job_ptr->resv_ptr
print *avail_node_bitmap

and send that output?

Having your job_scheduler.c would be very handy as well.

Comment 8 Paul Edmon 2014-05-09 06:56:23 MDT

Created attachment 836 [details]
Heres our source RPM containing all our build.

Comment 9 Paul Edmon 2014-05-09 06:59:25 MDT

(gdb) thread 1
[Switching to thread 1 (Thread 0x2b4780100700 (LWP 1341))]#0  0x0000003b2a6328a5 in raise () from /lib64/libc.so.6
(gdb) print *job_ptr
No symbol "job_ptr" in current context.
(gdb) print *job_ptr->resv_ptr
No symbol "job_ptr" in current context.
(gdb) print *avail_node_bitmap
$1 = 1111704645

Comment 10 Danny Auble 2014-05-09 07:11:25 MDT

Could you do the same in frame 5?

On May 9, 2014 11:59:25 AM PDT, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=798
>
>--- Comment #9 from Paul Edmon <pedmon@cfa.harvard.edu> ---
>(gdb) thread 1
>[Switching to thread 1 (Thread 0x2b4780100700 (LWP 1341))]#0 
>0x0000003b2a6328a5 in raise () from /lib64/libc.so.6
>(gdb) print *job_ptr
>No symbol "job_ptr" in current context.
>(gdb) print *job_ptr->resv_ptr
>No symbol "job_ptr" in current context.
>(gdb) print *avail_node_bitmap
>$1 = 1111704645
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.
>You are the assignee for the bug.

Comment 11 Danny Auble 2014-05-09 07:11:33 MDT

Created attachment 837 [details]
attachment-11923-0.html

Comment 12 Paul Edmon 2014-05-09 07:17:30 MDT

(gdb) frame 5
#5  0x00000000004506c4 in schedule (job_limit=50) at job_scheduler.c:1035
1035	job_scheduler.c: No such file or directory.
	in job_scheduler.c
(gdb) print *job_ptr
$2 = {account = 0x1408728 "cluster_users", alias_list = 0x0, alloc_node = 0x1408708 "rclogin09", alloc_resp_port = 0, alloc_sid = 10625, array_job_id = 0, array_task_id = 65534, assoc_id = 3, 
  assoc_ptr = 0x1300f98, batch_flag = 1, batch_host = 0x1408788 "moorcroft04", check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, comment = 0x0, cpu_cnt = 0, cr_enabled = 0, db_index = 10800089, 
  derived_ec = 0, details = 0x1408548, direct_set_prio = 0, end_time = 0, exit_code = 0, front_end_ptr = 0x0, gres = 0x0, gres_list = 0x0, gres_alloc = 0x1408748 "", gres_req = 0x1408768 "", 
  gres_used = 0x0, group_id = 33118, job_id = 7485672, job_next = 0x0, job_resrcs = 0x14087c8, job_state = 0, kill_on_node_fail = 1, licenses = 0x0, license_list = 0x0, limit_set_max_cpus = 0, 
  limit_set_max_nodes = 0, limit_set_min_cpus = 0, limit_set_min_nodes = 0, limit_set_pn_min_memory = 0, limit_set_time = 0, limit_set_qos = 0, mail_type = 0, mail_user = 0x0, 
  magic = 4038539564, name = 0x14086c8 "4th_trial_prescribedPhenlast-USHa1", network = 0x0, next_step_id = 0, nodes = 0x0, node_addr = 0x0, node_bitmap = 0x0, node_bitmap_cg = 0x0, node_cnt = 0, 
  nodes_completing = 0x0, other_port = 0, partition = 0x14086a8 "moorcroft_6100", part_ptr_list = 0x0, part_nodes_missing = false, part_ptr = 0x13bb958, pre_sus_time = 0, preempt_time = 0, 
  priority = 69080342, priority_array = 0x0, prio_factors = 0x1408258, profile = 0, qos_id = 1, qos_ptr = 0x126b9f8, restart_cnt = 1, resize_time = 0, resv_id = 0, resv_name = 0x0, 
  resv_ptr = 0x0, resv_flags = 0, requid = 4294967295, resp_host = 0x0, select_jobinfo = 0x14087a8, spank_job_env = 0x0, spank_job_env_size = 0, start_time = 0, state_desc = 0x0, 
  state_reason = 0, step_list = 0x132e668, suspend_time = 0, time_last_active = 1399654686, time_limit = 4294967295, time_min = 0, tot_sus_time = 0, total_cpus = 0, total_nodes = 1, 
  user_id = 40746, wait_all_nodes = 0, warn_signal = 0, warn_time = 0, wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0}
(gdb) print *job_ptr->resv_ptr
Cannot access memory at address 0x0
(gdb) print *avail_node_bitmap
$3 = 1111704645

Comment 13 Danny Auble 2014-05-09 07:45:44 MDT

Thanks Paul, in the same frame could you give me

print *job_ptr->details

Comment 14 Paul Edmon 2014-05-09 07:49:00 MDT

(gdb) print *job_ptr->details
$4 = {acctg_freq = 0x0, argc = 1, argv = 0x1408a38, begin_time = 1399652845, ckpt_dir = 0x12423e8 "/n/moorcroftfs2/kzhang/NorthA/Region/MCMC_Sites/4th_trial_prescribedPhenlast", contiguous = 0, 
  cpu_bind = 0x0, cpu_bind_type = 0, cpus_per_task = 1, depend_list = 0x0, dependency = 0x0, orig_dependency = 0x0, env_cnt = 0, env_sup = 0x0, exc_node_bitmap = 0x0, exc_nodes = 0x0, 
  expanding_jobid = 0, feature_list = 0x0, features = 0x0, magic = 0, max_cpus = 4294967294, max_nodes = 0, mc_ptr = 0x14089e8, mem_bind = 0x0, mem_bind_type = 0, min_cpus = 1, min_nodes = 1, 
  nice = 10000, ntasks_per_node = 0, num_tasks = 1, open_mode = 0 '\000', overcommit = 0 '\000', plane_size = 0, pn_min_cpus = 1, pn_min_memory = 2147485648, pn_min_tmp_disk = 0, 
  prolog_running = 0 '\000', reserved_resources = 0, req_node_bitmap = 0x0, req_node_layout = 0x0, preempt_start_time = 0, req_nodes = 0x0, requeue = 1, restart_dir = 0x0, shared = 2, 
  std_err = 0x0, std_in = 0x1408a18 "/dev/null", std_out = 0x14081d8 "/n/moorcroftfs2/kzhang/NorthA/Region/MCMC_Sites/4th_trial_prescribedPhenlast/USHa1/serial_lsf.out", 
  submit_time = 1399652835, task_dist = 10, usable_nodes = 0, work_dir = 0x1408178 "/n/moorcroftfs2/kzhang/NorthA/Region/MCMC_Sites/4th_trial_prescribedPhenlast"}

Comment 15 Danny Auble 2014-05-09 07:59:11 MDT

Created attachment 838 [details]
I believe this is the job_scheduler.c you are running

After applying the patches you sent this appear to be what you are running.

In frame 5 could you send me

print *assoc_ptr

print *assoc_ptr->usage

Comment 16 Paul Edmon 2014-05-09 08:03:33 MDT

(gdb) print *assoc_ptr
value has been optimized out
(gdb) print *assoc_ptr->usage
value has been optimized out

Comment 17 Danny Auble 2014-05-09 08:04:33 MDT

what about

print *(slurmdb_association_rec_t *)job_ptr->assoc_ptr

Comment 18 Paul Edmon 2014-05-09 08:06:40 MDT

(gdb) print *(slurmdb_association_rec_t *)job_ptr->assoc_ptr
$5 = {accounting_list = 0x0, acct = 0x1301068 "cluster_users", cluster = 0x1301088 "odyssey", def_qos_id = 0, grp_cpu_mins = 4294967295, grp_cpu_run_mins = 4294967295, grp_cpus = 4294967295, 
  grp_jobs = 4294967295, grp_mem = 4294967295, grp_nodes = 4294967295, grp_submit_jobs = 4294967295, grp_wall = 4294967295, id = 3, is_def = 0, lft = 1274, max_cpu_mins_pj = 4294967295, 
  max_cpu_run_mins = 4294967295, max_cpus_pj = 4294967295, max_jobs = 10100, max_nodes_pj = 4294967295, max_submit_jobs = 4294967295, max_wall_pj = 4294967295, parent_acct = 0x13010a8 "root", 
  parent_id = 1, partition = 0x0, qos_list = 0x12f9408, rgt = 8967, shares_raw = 100, uid = 4294967294, usage = 0x131e0d8, user = 0x0}

Comment 19 Danny Auble 2014-05-09 08:11:05 MDT

Could you also try the same thing with usage.

I think

print *((slurmdb_association_rec_t *)job_ptr->assoc_ptr)->usage

will do it.

thanks

Comment 20 Danny Auble 2014-05-09 08:29:20 MDT

Paul, it would appear this association isn't the correct association.  It doesn't appear to be a user association.


If you could find the user name of uid 40746 and send me this...

sacctmgr list assoc user=$USERNAME acct=cluster_users cluster=odyssey format=id

and

sacctmgr list assoc user='' acct=cluster_users cluster=odyssey format=id

My guess is this second one is 3. Which would be incorrect.  Not sure how that could of happened.

Comment 21 Danny Auble 2014-05-09 09:08:18 MDT

Paul, ok, I have a fix for you.  It turns out it did have to deal with the Invalid assoc_ptr message.

Commit 2261d3939438e2291d1825a49037e892a68f8b14 should fix you up.

Please reopen if you find otherwise.

Comment 22 Paul Edmon 2014-05-09 15:11:37 MDT

Excellent thank you.  Will this fix be ported forward to 14.03?  Or is it not relevant there?  We will be upgrading soon (probably a week or two).

Comment 23 Paul Edmon 2014-05-09 15:18:10 MDT

Up, just noticed that the fix was ported forward to 14.03.4.  We will wait till that version to grab it.  Thanks a bunch.

Comment 24 Danny Auble 2014-05-09 15:20:05 MDT

Already ported up, it will be in 14.03.4.

On May 9, 2014 8:11:37 PM PDT, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=798
>
>--- Comment #22 from Paul Edmon <pedmon@cfa.harvard.edu> ---
>Excellent thank you.  Will this fix be ported forward to 14.03?  Or is
>it not
>relevant there?  We will be upgrading soon (probably a week or two).
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.
>You are the assignee for the bug.

Comment 25 Danny Auble 2014-05-09 15:20:12 MDT

Created attachment 841 [details]
attachment-15637-0.html