Ticket 7404 - Multiple DB issues: *** RESTART SLURMDBD NOW *** and others
Summary: Multiple DB issues: *** RESTART SLURMDBD NOW *** and others
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Database (show other tickets)
Version: 18.08.7
Hardware: Linux Linux
: --- 2 - High Impact
Assignee: Nate Rini
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-07-12 09:08 MDT by pbisbal
Modified: 2019-07-15 08:07 MDT (History)
3 users (show)

See Also:
Site: Princeton Plasma Physics Laboratory (PPPL)
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 18.08.8
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
sdiag output (4.13 KB, text/plain)
2019-07-12 09:35 MDT, pbisbal
Details
slurm.conf file (9.47 KB, text/plain)
2019-07-12 09:54 MDT, pbisbal
Details
slurmdbd.conf file (744 bytes, text/plain)
2019-07-12 09:55 MDT, pbisbal
Details
gcore file (605.00 KB, application/x-bzip)
2019-07-12 09:56 MDT, pbisbal
Details
slurm messages from /var/log/messages (7.24 KB, application/x-gzip)
2019-07-12 11:06 MDT, pbisbal
Details
/var/log/slurm/slurmctld.log (205.00 KB, application/x-gzip)
2019-07-12 11:07 MDT, pbisbal
Details
/var/log/slurm/slurmdbd.log (5.79 KB, application/x-gzip)
2019-07-12 11:07 MDT, pbisbal
Details
mysql logs (5.02 KB, application/x-gzip)
2019-07-12 11:08 MDT, pbisbal
Details
mysqld logs after resetting the root password and restarting (5.15 KB, application/x-gzip)
2019-07-12 14:12 MDT, pbisbal
Details
slurmdbd..log file after restarting mysql and slurmdbd (21.20 KB, application/x-troff-man)
2019-07-12 14:31 MDT, pbisbal
Details
slurmctld logs after restarting mysql and slurmdbd (43.95 KB, application/x-troff-man)
2019-07-12 14:31 MDT, pbisbal
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description pbisbal 2019-07-12 09:08:28 MDT
On June 18th, we upgraded Slurm from version 18.08.6 to 18.08.7. Yesterday, discovered that sreport that every day since then has had 100% utilization, which we know is not correct. For example: 

$ sreport -a -tMinPer cluster Utilization  Start=2019-06-18 End=2019-06-18
--------------------------------------------------------------------------------
Cluster Utilization 2019-06-18T00:00:00 - 2019-06-18T00:59:59
Usage reported in CPU Minutes/Percentage of Total
--------------------------------------------------------------------------------
  Cluster         Allocated              Down         PLND Down              Idle          Reserved          Reported 
--------- ----------------- ----------------- ----------------- ----------------- ----------------- ----------------- 
test_pppl    289942(76.46%)       4800(1.27%)          0(0.00%)     84457(22.27%)          1(0.00%)   379200(100.00%) 


$ sreport -a -tMinPer cluster Utilization  Start=2019-06-19 End=2019-06-19
--------------------------------------------------------------------------------
Cluster Utilization 2019-06-19T00:00:00 - 2019-06-19T00:59:59
Usage reported in CPU Minutes/Percentage of Total
--------------------------------------------------------------------------------
  Cluster         Allocated              Down         PLND Down              Idle          Reserved          Reported 
--------- ----------------- ----------------- ----------------- ----------------- ----------------- ----------------- 
test_pppl   417600(100.00%)          0(0.00%)          0(0.00%)          0(0.00%)          0(0.00%)   417600(100.00%) 


$ sreport -a -tMinPer cluster Utilization  Start=2019-07-11 End=2019-07-11
--------------------------------------------------------------------------------
Cluster Utilization 2019-07-11T00:00:00 - 2019-07-11T00:59:59
Usage reported in CPU Minutes/Percentage of Total
--------------------------------------------------------------------------------
  Cluster         Allocated              Down         PLND Down              Idle          Reserved          Reported 
--------- ----------------- ----------------- ----------------- ----------------- ----------------- ----------------- 
test_pppl   413760(100.00%)          0(0.00%)          0(0.00%)          0(0.00%)          0(0.00%)   413760(100.00%) 


After conferring with my colleagues at Princeton University (Josko Plazonic and Bill Wichser), I ran 'sacctmgr show runawayjobs' and selected 'Y' at the prompt. However, that did not fix the issue (the sreport out above is from this morning). 

I stopped slurmctld, slurmdbd, and mysqld this morning and then restarted them in the correct order. I'm now getting errors like this: 

[2019-07-12T11:07:11.041] error: There is no reservation by id 38, time_start 1560852332, and cluster 'test_pppl'

and this: 

Jul 12 10:54:19 kay slurmctld: *** RESTART SLURMDBD NOW ***

And ideas? Please let me know what logging or configuration information I can provide you. 

Prentice
Comment 1 pbisbal 2019-07-12 09:09:53 MDT
I should also add that when I start slurmdbd, I can not stop it with 'service slurm stop'. I can only stop it with a kill -9 <PID>
Comment 2 Nate Rini 2019-07-12 09:14:43 MDT
(In reply to pbisbal from comment #0)
> Jul 12 10:54:19 kay slurmctld: *** RESTART SLURMDBD NOW ***

This is bad and needs to be corrected ASAP. Please restart your slurmdbd daemon immediately. Your system is either losing job records or will be soon.

Please call sdiag and attach the output.

Please attach your slurmctld and slurmdbd logs (post slurmdbd restart).

(In reply to pbisbal from comment #1)
> I should also add that when I start slurmdbd, I can not stop it with
> 'service slurm stop'. I can only stop it with a kill -9 <PID>
This is a great way to guarantee data loss.

Please avoid ever calling SIGKILL against any slurm daemon. Please call gcore (and provide 'thread apply all bt full' from gdb) against any daemon that refuses to stop and open a ticket instead.

If gcore fails, then there is likely a kernel issue outside of Slurm's control.
Comment 4 pbisbal 2019-07-12 09:35:48 MDT
Attached sdiag output working on restarting slurmdbd as requested and sending the logs for that.
Comment 5 Nate Rini 2019-07-12 09:39:30 MDT
(In reply to pbisbal from comment #3)
> Created attachment 10894 [details]
> sdiag output

slurmctld is unable to send records to slurmdbd:
> DBD Agent queue size: 21074

Luckily, that number is pretty low, so there may be no data loss yet.
Comment 6 Nate Rini 2019-07-12 09:40:34 MDT
Please also attach an updated copy of your slurm.conf.
Comment 12 pbisbal 2019-07-12 09:59:33 MDT
I've attached slurm.conf, slurmdbd.conf, and the gcore file created from running gcore on the slurmdbd process. I'm not really experienced with gdb and don't know exaclty how to analyze the file properly. Below is my attempt to analyze the file how I think you wanted me to do it. 

$ gdb /usr/sbin/slurmdbd gcore.slurmdbd.3521 
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-92.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/slurmdbd...done.
[New Thread 3532]
[New Thread 3553]
[New Thread 3555]
[New Thread 3521]
Reading symbols from /usr/lib64/slurm/libslurmfull.so...done.
Loaded symbols for /usr/lib64/slurm/libslurmfull.so
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_files.so.2
Reading symbols from /usr/lib64/slurm/auth_munge.so...done.
Loaded symbols for /usr/lib64/slurm/auth_munge.so
Reading symbols from /usr/lib64/libmunge.so.2...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libmunge.so.2
Reading symbols from /usr/lib64/slurm/accounting_storage_mysql.so...done.
Loaded symbols for /usr/lib64/slurm/accounting_storage_mysql.so
Reading symbols from /usr/lib64/mysql/libmysqlclient_r.so.16...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/mysql/libmysqlclient_r.so.16
Reading symbols from /lib64/libz.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libz.so.1
Reading symbols from /lib64/libcrypt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libcrypt.so.1
Reading symbols from /lib64/libnsl.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnsl.so.1
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /usr/lib64/libssl.so.10...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libssl.so.10
Reading symbols from /usr/lib64/libcrypto.so.10...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libcrypto.so.10
Reading symbols from /usr/lib64/libfreebl3.so...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libfreebl3.so
Reading symbols from /lib64/libgssapi_krb5.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libgssapi_krb5.so.2
Reading symbols from /lib64/libkrb5.so.3...(no debugging symbols found)...done.
Loaded symbols for /lib64/libkrb5.so.3
Reading symbols from /lib64/libcom_err.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libcom_err.so.2
Reading symbols from /lib64/libk5crypto.so.3...(no debugging symbols found)...done.
Loaded symbols for /lib64/libk5crypto.so.3
Reading symbols from /lib64/libresolv.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libresolv.so.2
Reading symbols from /lib64/libkrb5support.so.0...(no debugging symbols found)...done.
Loaded symbols for /lib64/libkrb5support.so.0
Reading symbols from /lib64/libkeyutils.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libkeyutils.so.1
Reading symbols from /lib64/libselinux.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libselinux.so.1
Reading symbols from /lib64/libgcc_s.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libgcc_s.so.1
Reading symbols from /lib64/libnss_ldap.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_ldap.so.2
Core was generated by `/usr/sbin/slurmdbd'.
#0  0x000000358220e334 in __lll_lock_wait () from /lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install slurm-slurmdbd-18.08.7-1.el6.x86_64
(gdb) thread apply all bt full

Thread 4 (Thread 0x7f16e0e38700 (LWP 3521)):
#0  0x00000035822082fd in pthread_join () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x0000000000410f63 in main (argc=<value optimized out>, argv=0x7fff4bc618b8) at slurmdbd.c:275
        node_name_short = "kay\000\065\000\000\000\060\062\307K\377\177\000\000\320\027\306K\377\177\000\000\000\000\000\000\000\000\000\000\002", '\000' <repeats 23 times>"\250, *\302\201\065\000\000\000\260\027\306K\377\177\000\000\350\027\306K\377\177\000\000 '\302\201\065\000\000\000\000\030\306K\377\177\000\000\177U\335q\000\000\000\000j\242\240\201\065", '\000' <repeats 12 times>, "\030\306K\377\177\000"
        node_name_long = "kay.pppl.gov", '\000' <repeats 12 times>, "`\027\306K\377\177\000\000X1\307K\377\177\000\000\210\061\307K\377\177", '\000' <repeats 50 times>, " '\302\201\065", '\000' <repeats 12 times>, "\030\306K\377\177\000\000 '\302\201\065\000\000"
        db_conn = 0x2064f10
        assoc_init_arg = {cache_level = 38, enforce = 0, running_cache = 0x0, add_license_notify = 0, 
          resize_qos_notify = 0, remove_assoc_notify = 0, remove_license_notify = 0, remove_qos_notify = 0, 
          state_save_location = 0x0, sync_license_notify = 0, update_assoc_notify = 0, 
          update_cluster_tres = 0, update_license_notify = 0, update_qos_notify = 0, update_resvs = 0}
        __func__ = "main"

Thread 3 (Thread 0x7f16df418700 (LWP 3555)):
#0  0x000000358220e82d in read () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x0000003583a57599 in vio_read_buff () from /usr/lib64/mysql/libmysqlclient_r.so.16
---Type <return> to continue, or q <return> to quit---
No symbol table info available.
#2  0x0000003583a5874f in ?? () from /usr/lib64/mysql/libmysqlclient_r.so.16
No symbol table info available.
#3  0x0000003583a58b55 in my_net_read () from /usr/lib64/mysql/libmysqlclient_r.so.16
No symbol table info available.
#4  0x0000003583a5312a in cli_safe_read () from /usr/lib64/mysql/libmysqlclient_r.so.16
No symbol table info available.
#5  0x0000003583a538a9 in ?? () from /usr/lib64/mysql/libmysqlclient_r.so.16
No symbol table info available.
#6  0x0000003583a51c4c in mysql_real_query () from /usr/lib64/mysql/libmysqlclient_r.so.16
No symbol table info available.
#7  0x00007f16e0808c14 in _mysql_query_internal (db_conn=0x7f16c8000970, 
    query=0x7f16c80073e0 "select node_name, time_start, time_end, state, tres from \"test_pppl_event_table\" where !(state & 32768) && (time_start < 64627200 && (time_end >= 64623600 || time_end = 0)) order by node_name, time_st"...) at mysql_common.c:144
No locals.
#8  0x00007f16e080902f in mysql_db_query_ret (mysql_conn=0x7f16df417e00, 
    query=0x7f16c80073e0 "select node_name, time_start, time_end, state, tres from \"test_pppl_event_table\" where !(state & 32768) && (time_start < 64627200 && (time_end >= 64623600 || time_end = 0)) order by node_name, time_st"..., last=false) at mysql_common.c:888
        result = 0x0
        __func__ = "mysql_db_query_ret"
#9  0x00007f16e07fe610 in _setup_cluster_usage (mysql_conn=0x7f16df417e00, 
---Type <return> to continue, or q <return> to quit---q
cluster_name=0x207d6b0 "test_pppl"Quit
(gdb) q
Comment 13 Nate Rini 2019-07-12 10:00:03 MDT
(In reply to pbisbal from comment #11)
> Created attachment 10898 [details]
> gcore file
> 
> File created by 'gcore -o gcore.slurmdbd <PID>. bzipped to save space.

Can you please load the gcore in gdb and call the following:
> gdb slurmdbd gcore.slurmdbd
> set pagination off
> thread apply all bt full

We can't load a core into gdb unless you attach all of your binaries and libraries that are linked by slurmd.
Comment 14 Nate Rini 2019-07-12 10:01:19 MDT
(In reply to pbisbal from comment #12)
> I've attached slurm.conf, slurmdbd.conf, and the gcore file created from
> running gcore on the slurmdbd process. I'm not really experienced with gdb
> and don't know exaclty how to analyze the file properly. Below is my attempt
> to analyze the file how I think you wanted me to do it. 
That works perfectly, I just needed the output for now.
Comment 15 Nate Rini 2019-07-12 10:02:12 MDT
(In reply to Nate Rini from comment #14)
> (In reply to pbisbal from comment #12)
> > I've attached slurm.conf, slurmdbd.conf, and the gcore file created from
> > running gcore on the slurmdbd process. I'm not really experienced with gdb
> > and don't know exaclty how to analyze the file properly. Below is my attempt
> > to analyze the file how I think you wanted me to do it. 
> That works perfectly, I just needed the output for now.

Correction: looks like the pages got lost. If you could please make sure to attach the other pages or use 'set pagination off'.
Comment 16 pbisbal 2019-07-12 10:03:49 MDT
Sure. Here ya go: 

$ gdb slurmdbd gcore.slurmdbd.3521 
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-92.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/slurmdbd...done.
[New Thread 3532]
[New Thread 3553]
[New Thread 3555]
[New Thread 3521]
Reading symbols from /usr/lib64/slurm/libslurmfull.so...done.
Loaded symbols for /usr/lib64/slurm/libslurmfull.so
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_files.so.2
Reading symbols from /usr/lib64/slurm/auth_munge.so...done.
Loaded symbols for /usr/lib64/slurm/auth_munge.so
Reading symbols from /usr/lib64/libmunge.so.2...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libmunge.so.2
Reading symbols from /usr/lib64/slurm/accounting_storage_mysql.so...done.
Loaded symbols for /usr/lib64/slurm/accounting_storage_mysql.so
Reading symbols from /usr/lib64/mysql/libmysqlclient_r.so.16...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/mysql/libmysqlclient_r.so.16
Reading symbols from /lib64/libz.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libz.so.1
Reading symbols from /lib64/libcrypt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libcrypt.so.1
Reading symbols from /lib64/libnsl.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnsl.so.1
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /usr/lib64/libssl.so.10...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libssl.so.10
Reading symbols from /usr/lib64/libcrypto.so.10...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libcrypto.so.10
Reading symbols from /usr/lib64/libfreebl3.so...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libfreebl3.so
Reading symbols from /lib64/libgssapi_krb5.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libgssapi_krb5.so.2
Reading symbols from /lib64/libkrb5.so.3...(no debugging symbols found)...done.
Loaded symbols for /lib64/libkrb5.so.3
Reading symbols from /lib64/libcom_err.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libcom_err.so.2
Reading symbols from /lib64/libk5crypto.so.3...(no debugging symbols found)...done.
Loaded symbols for /lib64/libk5crypto.so.3
Reading symbols from /lib64/libresolv.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libresolv.so.2
Reading symbols from /lib64/libkrb5support.so.0...(no debugging symbols found)...done.
Loaded symbols for /lib64/libkrb5support.so.0
Reading symbols from /lib64/libkeyutils.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libkeyutils.so.1
Reading symbols from /lib64/libselinux.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libselinux.so.1
Reading symbols from /lib64/libgcc_s.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libgcc_s.so.1
Reading symbols from /lib64/libnss_ldap.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_ldap.so.2
Core was generated by `/usr/sbin/slurmdbd'.
#0  0x000000358220e334 in __lll_lock_wait () from /lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install slurm-slurmdbd-18.08.7-1.el6.x86_64
(gdb) set pagination off 
(gdb) thread apply all bt full

Thread 4 (Thread 0x7f16e0e38700 (LWP 3521)):
#0  0x00000035822082fd in pthread_join () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x0000000000410f63 in main (argc=<value optimized out>, argv=0x7fff4bc618b8) at slurmdbd.c:275
        node_name_short = "kay\000\065\000\000\000\060\062\307K\377\177\000\000\320\027\306K\377\177\000\000\000\000\000\000\000\000\000\000\002", '\000' <repeats 23 times>"\250, *\302\201\065\000\000\000\260\027\306K\377\177\000\000\350\027\306K\377\177\000\000 '\302\201\065\000\000\000\000\030\306K\377\177\000\000\177U\335q\000\000\000\000j\242\240\201\065", '\000' <repeats 12 times>, "\030\306K\377\177\000"
        node_name_long = "kay.pppl.gov", '\000' <repeats 12 times>, "`\027\306K\377\177\000\000X1\307K\377\177\000\000\210\061\307K\377\177", '\000' <repeats 50 times>, " '\302\201\065", '\000' <repeats 12 times>, "\030\306K\377\177\000\000 '\302\201\065\000\000"
        db_conn = 0x2064f10
        assoc_init_arg = {cache_level = 38, enforce = 0, running_cache = 0x0, add_license_notify = 0, resize_qos_notify = 0, remove_assoc_notify = 0, remove_license_notify = 0, remove_qos_notify = 0, state_save_location = 0x0, sync_license_notify = 0, update_assoc_notify = 0, update_cluster_tres = 0, update_license_notify = 0, update_qos_notify = 0, update_resvs = 0}
        __func__ = "main"

Thread 3 (Thread 0x7f16df418700 (LWP 3555)):
#0  0x000000358220e82d in read () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x0000003583a57599 in vio_read_buff () from /usr/lib64/mysql/libmysqlclient_r.so.16
No symbol table info available.
#2  0x0000003583a5874f in ?? () from /usr/lib64/mysql/libmysqlclient_r.so.16
No symbol table info available.
#3  0x0000003583a58b55 in my_net_read () from /usr/lib64/mysql/libmysqlclient_r.so.16
No symbol table info available.
#4  0x0000003583a5312a in cli_safe_read () from /usr/lib64/mysql/libmysqlclient_r.so.16
No symbol table info available.
#5  0x0000003583a538a9 in ?? () from /usr/lib64/mysql/libmysqlclient_r.so.16
No symbol table info available.
#6  0x0000003583a51c4c in mysql_real_query () from /usr/lib64/mysql/libmysqlclient_r.so.16
No symbol table info available.
#7  0x00007f16e0808c14 in _mysql_query_internal (db_conn=0x7f16c8000970, query=0x7f16c80073e0 "select node_name, time_start, time_end, state, tres from \"test_pppl_event_table\" where !(state & 32768) && (time_start < 64627200 && (time_end >= 64623600 || time_end = 0)) order by node_name, time_st"...) at mysql_common.c:144
No locals.
#8  0x00007f16e080902f in mysql_db_query_ret (mysql_conn=0x7f16df417e00, query=0x7f16c80073e0 "select node_name, time_start, time_end, state, tres from \"test_pppl_event_table\" where !(state & 32768) && (time_start < 64627200 && (time_end >= 64623600 || time_end = 0)) order by node_name, time_st"..., last=false) at mysql_common.c:888
        result = 0x0
        __func__ = "mysql_db_query_ret"
#9  0x00007f16e07fe610 in _setup_cluster_usage (mysql_conn=0x7f16df417e00, cluster_name=0x207d6b0 "test_pppl", start=<value optimized out>, end=1562940000, archive_data=1) at as_mysql_rollup.c:826
        c_usage = 0x0
        query = 0x7f16c80073e0 "select node_name, time_start, time_end, state, tres from \"test_pppl_event_table\" where !(state & 32768) && (time_start < 64627200 && (time_end >= 64623600 || time_end = 0)) order by node_name, time_st"...
        loc_c_usage = <value optimized out>
        result = 0x0
        row = <value optimized out>
        i = 5
        d_itr = 0x0
        event_req_inx = {0x7f16e080d2bc "node_name", 0x7f16e081ae58 "time_start", 0x7f16e081ae38 "time_end", 0x7f16e0819142 "state", 0x7f16e081ae76 "tres"}
        event_str = 0x0
#10 as_mysql_hourly_rollup (mysql_conn=0x7f16df417e00, cluster_name=0x207d6b0 "test_pppl", start=<value optimized out>, end=1562940000, archive_data=1) at as_mysql_rollup.c:1076
        last_id = -1
        last_wckeyid = -1
        rc = <value optimized out>
        i = <value optimized out>
        now = 1562943253
        curr_start = 64623600
        curr_end = 64627200
        query = 0x0
        result = <value optimized out>
        row = <value optimized out>
        a_itr = 0x207d770
        c_itr = 0x207d790
        w_itr = 0x207d7b0
        r_itr = 0x207d7d0
        assoc_usage_list = 0x21184e0
        cluster_down_list = 0x2118530
        wckey_usage_list = 0x2118580
        resv_usage_list = 0x21185d0
        track_wckey = 0
        loc_c_usage = <value optimized out>
        c_usage = 0x0
        r_usage = 0x0
        a_usage = 0x0
        w_usage = 0x0
        job_req_inx = {0x7f16e081afa1 "job.job_db_inx", 0x7f16e081afb0 "job.id_assoc", 0x7f16e081afbd "job.id_wckey", 0x7f16e081afca "job.array_task_pending", 0x7f16e081afe1 "job.time_eligible", 0x7f16e081aff3 "job.time_start", 0x7f16e081b002 "job.time_end", 0x7f16e081b00f "job.time_suspended", 0x7f16e081b022 "job.cpus_req", 0x7f16e081b02f "job.id_resv", 0x7f16e081b03b "job.tres_alloc"}
        job_str = 0x7f16c8007140 "job.job_db_inx, job.id_assoc, job.id_wckey, job.array_task_pending, job.time_eligible, job.time_start, job.time_end, job.time_suspended, job.cpus_req, job.id_resv, job.tres_alloc"
        suspend_str = 0x7f16c80072a0 "time_start, time_end"
        resv_req_inx = {0x7f16e08190ac "id_resv", 0x7f16e081ade3 "assoclist", 0x7f16e0816082 "flags", 0x7f16e081ae76 "tres", 0x7f16e081ae58 "time_start", 0x7f16e081ae38 "time_end", 0x7f16e080d282 "unused_wall"}
        resv_str = 0x7f16c8007310 "id_resv, assoclist, flags, tres, time_start, time_end, unused_wall"
        __func__ = "as_mysql_hourly_rollup"
#11 0x00007f16e08027b7 in _cluster_rollup_usage (arg=<value optimized out>) at as_mysql_usage.c:292
        local_rollup = 0x7f16d0000bd0
        i = <value optimized out>
        rc = 0
        timer_str = '\000' <repeats 127 times>
        mysql_conn = {cluster_deleted = false, cluster_name = 0x0, db_conn = 0x7f16c8000970, lock = {__data = {__lock = 1, __count = 0, __owner = 3555, __nusers = 1, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\001\000\000\000\000\000\000\000\343\r\000\000\001", '\000' <repeats 26 times>, __align = 1}, pre_commit_query = 0x0, rollback = true, update_list = 0x0, conn = 0}
        result = <value optimized out>
        row = <value optimized out>
        query = 0x0
        start_tm = {tm_sec = 0, tm_min = 0, tm_hour = 0, tm_mday = 1, tm_mon = 11, tm_year = 69, tm_wday = 1, tm_yday = 334, tm_isdst = 0, tm_gmtoff = -18000, tm_zone = 0x2061180 "EST"}
        end_tm = {tm_sec = 0, tm_min = 0, tm_hour = 0, tm_mday = 1, tm_mon = 6, tm_year = 119, tm_wday = 1, tm_yday = 181, tm_isdst = 1, tm_gmtoff = -14400, tm_zone = 0x2064740 "EDT"}
        my_time = 1562943253
        last_hour = 0
        last_day = 0
        last_month = 0
        hour_start = 0
        hour_end = 1562940000
        day_start = -68400
        day_end = 1562904000
        month_start = -2660400
        month_end = 1561953600
        rollup_time = {0, 0, 0}
        tv1 = {tv_sec = 1562943253, tv_usec = 380669}
        tv2 = {tv_sec = 0, tv_usec = 0}
        tv_str = '\000' <repeats 19 times>
        delta_t = <value optimized out>
        update_req_inx = {0x7f16e080ccf6 "hourly_rollup", 0x7f16e080cd04 "daily_rollup", 0x7f16e080cd11 "monthly_rollup"}
        __func__ = "_cluster_rollup_usage"
#12 0x0000003582207aa1 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#13 0x0000003581ee8c4d in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 2 (Thread 0x7f16df61a700 (LWP 3553)):
#0  0x000000358220b68c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00007f16e080157e in as_mysql_roll_usage (mysql_conn=<value optimized out>, sent_start=0, sent_end=0, archive_data=1, rollup_stats=0x7f16df619e60) at as_mysql_usage.c:977
        err = <value optimized out>
        rc = 0
        rolledup = 1
        roll_started = 2
        cluster_name = <value optimized out>
        itr = 0x207d770
        rolledup_lock = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 1, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 12 times>, "\001", '\000' <repeats 26 times>, __align = 0}
        rolledup_cond = {__data = {__lock = 0, __futex = 3, __total_seq = 2, __wakeup_seq = 1, __woken_seq = 1, __mutex = 0x7f16df619d70, __nwaiters = 2, __broadcast_seq = 0}, __size = "\000\000\000\000\003\000\000\000\002\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000p\235a\337\026\177\000\000\002\000\000\000\000\000\000", __align = 12884901888}
        __func__ = "as_mysql_roll_usage"
#2  0x00007f16e07cb464 in acct_storage_p_roll_usage (mysql_conn=<value optimized out>, sent_start=<value optimized out>, sent_end=<value optimized out>, archive_data=<value optimized out>, rollup_stats=<value optimized out>) at accounting_storage_mysql.c:3150
No locals.
#3  0x0000003c370c3afb in acct_storage_g_roll_usage (db_conn=0x2064f10, sent_start=0, sent_end=0, archive_data=1, rollup_stats=0x7f16df619e60) at slurm_accounting_storage.c:788
No locals.
#4  0x0000000000411373 in _rollup_handler (db_conn=0x2064f10) at slurmdbd.c:707
        start_time = 1562943253
        next_time = <value optimized out>
        tm = {tm_sec = 13, tm_min = 54, tm_hour = 10, tm_mday = 12, tm_mon = 6, tm_year = 119, tm_wday = 5, tm_yday = 192, tm_isdst = 1, tm_gmtoff = -14400, tm_zone = 0x2064740 "EDT"}
        rollup_stats = {rollup_time = {0, 0, 0}}
        __func__ = "_rollup_handler"
#5  0x0000003582207aa1 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#6  0x0000003581ee8c4d in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 1 (Thread 0x7f16df91d700 (LWP 3532)):
#0  0x000000358220e334 in __lll_lock_wait () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00000035822095d8 in _L_lock_854 () from /lib64/libpthread.so.0
No symbol table info available.
#2  0x00000035822094a7 in pthread_mutex_lock () from /lib64/libpthread.so.0
No symbol table info available.
#3  0x000000000040f975 in _rollup_handler_cancel () at slurmdbd.c:673
        err = <value optimized out>
#4  shutdown_threads () at slurmdbd.c:343
No locals.
#5  0x000000000040fed7 in _signal_handler (no_data=<value optimized out>) at slurmdbd.c:864
        rc = <value optimized out>
        sig = 15
        sig_array = {2, 15, 1, 6, 12, 0}
        set = {__val = {18467, 0 <repeats 15 times>}}
#6  0x0000003582207aa1 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#7  0x0000003581ee8c4d in clone () from /lib64/libc.so.6
No symbol table info available.
(gdb) 


I don't have debug versions of slurm and all the libraries installed. I hope the output above is still helpful to you.
Comment 17 Nate Rini 2019-07-12 10:16:46 MDT
(In reply to pbisbal from comment #16)
> Sure. Here ya go: 
Looks like slurmdbd is waiting on mysql.

Can you please provide the mysql, slurmctld and slurmdbd logs.

> I don't have debug versions of slurm and all the libraries installed. I hope
> the output above is still helpful to you.
This is sufficient.

Can you please call the following in mysql:
> SHOW FULL PROCESSLIST\G
> SHOW ENGINE INNODB STATUS
Comment 22 pbisbal 2019-07-12 11:13:43 MDT
mysql> SHOW FULL PROCESSLIST\G
*************************** 1. row ***************************
     Id: 3
   User: slurm
   Host: localhost
     db: slurm_acct_db
Command: Sleep
   Time: 8348
  State: 
   Info: NULL
*************************** 2. row ***************************
     Id: 5
   User: slurm
   Host: localhost
     db: slurm_acct_db
Command: Query
   Time: 0
  State: Sorting result
   Info: select node_name, time_start, time_end, state, tres from "test_pppl_event_table" where !(state & 32768) && (time_start < 183783600 && (time_end >= 183780000 || time_end = 0)) order by node_name, time_start
*************************** 3. row ***************************
     Id: 16
   User: slurm
   Host: localhost
     db: NULL
Command: Query
   Time: 0
  State: NULL
   Info: SHOW FULL PROCESSLIST
3 rows in set (0.00 sec)
Comment 23 pbisbal 2019-07-12 11:22:44 MDT
mysql> SHOW ENGINE INNODB STATUS\G
ERROR 1227 (42000): Access denied; you need the PROCESS privilege for this operation

Looks like the slurm user doesn't have the privileges to run that command, and I don't know the root password for MySQL. I could reset it, but that would mean stopping the MySQL server right now, and I'm not sure if that's what you want me to do right now.
Comment 24 Broderick Gardner 2019-07-12 13:41:09 MDT
(Nate is temporarily unavailable, so I will step in for now)

We need that mysql information, so I recommend stopping the slurmdbd if you can (if it's hung and ignoring SIGTERM, ignore it for now). Reset the mysql root password and restart mysql. 

Then restart slurmdbd, KILLing the old hung one if necessary. Let me know what you do and what the result is. If the slurmdbd hangs again, then give me the mysql queries from before:
mysql> SHOW FULL PROCESSLIST\G
mysql> SHOW ENGINE INNODB STATUS\G

If after the mysql restart the slurmdbd does not hang, watch the slurmdbd agent queue in `sdiag` until it reaches zero.

Let me know if you have any questions
Comment 25 Broderick Gardner 2019-07-12 13:43:23 MDT
Also, what is your mysql/mariadb version?
Comment 26 Broderick Gardner 2019-07-12 13:47:35 MDT
Also resend the slurmdbd.log and slurmctld.log if the slurmdbd hangs again after mysql restart.
Comment 27 pbisbal 2019-07-12 14:03:13 MDT
> Then restart slurmdbd, KILLing the old hung one if necessary. 

So I Can use a 'kill -9 <PID>' if necessary?
Comment 28 pbisbal 2019-07-12 14:06:05 MDT
kill <PID> hasn't been able to kill slurmdbd. I stopped mysql and reset the root password, but left slurmdbd running for now. Here is the mysql information you requested: 

mysql> SHOW FULL PROCESSLIST\G
*************************** 1. row ***************************
     Id: 3
   User: root
   Host: localhost
     db: NULL
Command: Query
   Time: 0
  State: NULL
   Info: SHOW FULL PROCESSLIST
1 row in set (0.00 sec)

mysql> SHOW ENGINE INNODB STATUS\G
*************************** 1. row ***************************
  Type: InnoDB
  Name: 
Status: 
=====================================
190712 16:03:46 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 38 seconds
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 4, signal count 4
Mutex spin waits 0, rounds 20, OS waits 1
RW-shared spins 4, OS waits 2; RW-excl spins 1, OS waits 1
------------
TRANSACTIONS
------------
Trx id counter 0 9069824
Purge done for trx's n:o < 0 9069190 undo n:o < 0 0
History list length 1
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 0 0, not started, process no 22143, OS thread id 139921109796608
MySQL thread id 3, query id 6 localhost root
SHOW ENGINE INNODB STATUS
--------
FILE I/O
--------
I/O thread 0 state: waiting for i/o request (insert buffer thread)
I/O thread 1 state: waiting for i/o request (log thread)
I/O thread 2 state: waiting for i/o request (read thread)
I/O thread 3 state: waiting for i/o request (write thread)
Pending normal aio reads: 0, aio writes: 0,
 ibuf aio reads: 0, log i/o's: 0, sync i/o's: 0
Pending flushes (fsync) log: 0; buffer pool: 0
28 OS file reads, 3 OS file writes, 3 OS fsyncs
0.00 reads/s, 0 avg bytes/read, 0.00 writes/s, 0.00 fsyncs/s
-------------------------------------
INSERT BUFFER AND ADAPTIVE HASH INDEX
-------------------------------------
Ibuf: size 1, free list len 126, seg size 128,
0 inserts, 0 merged recs, 0 merges
Hash table size 17393, node heap has 0 buffer(s)
0.00 hash searches/s, 0.00 non-hash searches/s
---
LOG
---
Log sequence number 1 1649626360
Log flushed up to   1 1649626360
Last checkpoint at  1 1649626360
0 pending log writes, 0 pending chkp writes
8 log i/o's done, 0.00 log i/o's/second
----------------------
BUFFER POOL AND MEMORY
----------------------
Total memory allocated 20358952; in additional pool allocated 653824
Dictionary memory allocated 33320
Buffer pool size   512
Free buffers       490
Database pages     22
Modified db pages  0
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages read 22, created 0, written 0
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
No buffer pool page gets since the last printout
--------------
ROW OPERATIONS
--------------
0 queries inside InnoDB, 0 queries in queue
1 read views open inside InnoDB
Main thread process no. 22143, id 139920938329856, state: waiting for server activity
Number of rows inserted 0, updated 0, deleted 0, read 0
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
----------------------------
END OF INNODB MONITOR OUTPUT
============================

1 row in set (0.00 sec)

mysql> 


I'll attach the new log files next.
Comment 29 pbisbal 2019-07-12 14:07:00 MDT
MySQL version is 5.1.73: 

$ rpm -qa | grep mysql 
mysql-devel-5.1.73-8.el6_8.x86_64
mysql-5.1.73-8.el6_8.x86_64
mysql-libs-5.1.73-8.el6_8.x86_64
mysql-server-5.1.73-8.el6_8.x86_64
Comment 31 Broderick Gardner 2019-07-12 14:14:52 MDT
(In reply to pbisbal from comment #27)
> > Then restart slurmdbd, KILLing the old hung one if necessary. 
> 
> So I Can use a 'kill -9 <PID>' if necessary?

In this case, yes. It's not a good action until we have a backtrace and have tried other methods. As Nate said, it is a good way to lose data.
Comment 34 pbisbal 2019-07-12 14:32:54 MDT
Latest log files uploaded. Had to use 'kill -9' to stop slurmdbd.
Comment 35 Nate Rini 2019-07-12 14:41:51 MDT
(In reply to pbisbal from comment #20)
> Created attachment 10902 [details]
> /var/log/slurm/slurmdbd.log

Looks like this is a duplicate of bug#7155. Please apply this patch:
https://github.com/SchedMD/slurm/commit/89910fd50a3120bd1d8b29fcf6d543e53dbfe732
Comment 36 pbisbal 2019-07-12 14:46:58 MDT
Is this patch in 18.08.8? If so, can I just upgrade to 18.08.8? 

Prentice
Comment 37 Nate Rini 2019-07-12 14:58:02 MDT
(In reply to pbisbal from comment #36)
> Is this patch in 18.08.8? If so, can I just upgrade to 18.08.8? 

It is currently only in 19.05.1. It should cleanly patch into 18.08 though.
Comment 38 Nate Rini 2019-07-12 17:58:40 MDT
(In reply to Nate Rini from comment #37)
> (In reply to pbisbal from comment #36)
> > Is this patch in 18.08.8? If so, can I just upgrade to 18.08.8? 
> 
> It is currently only in 19.05.1. It should cleanly patch into 18.08 though.

Were you able to apply the patch? Do you need instructions?
Comment 39 pbisbal 2019-07-15 08:06:19 MDT
Yes, I was able to apply the patch to 18.08.8, build a new set of RPMs, and get them installed, and it appears to have fixed my issues. 

I started the build Friday evening, but I had to leave before the build was complete, and wasn't able to apply the patch until Saturday. 

Thanks so much for all your help with this. 

Prentice
Comment 40 pbisbal 2019-07-15 08:07:06 MDT
That last comment should read "wasn't able to apply the patched RPMs until Saturday"