On June 18th, we upgraded Slurm from version 18.08.6 to 18.08.7. Yesterday, discovered that sreport that every day since then has had 100% utilization, which we know is not correct. For example: $ sreport -a -tMinPer cluster Utilization Start=2019-06-18 End=2019-06-18 -------------------------------------------------------------------------------- Cluster Utilization 2019-06-18T00:00:00 - 2019-06-18T00:59:59 Usage reported in CPU Minutes/Percentage of Total -------------------------------------------------------------------------------- Cluster Allocated Down PLND Down Idle Reserved Reported --------- ----------------- ----------------- ----------------- ----------------- ----------------- ----------------- test_pppl 289942(76.46%) 4800(1.27%) 0(0.00%) 84457(22.27%) 1(0.00%) 379200(100.00%) $ sreport -a -tMinPer cluster Utilization Start=2019-06-19 End=2019-06-19 -------------------------------------------------------------------------------- Cluster Utilization 2019-06-19T00:00:00 - 2019-06-19T00:59:59 Usage reported in CPU Minutes/Percentage of Total -------------------------------------------------------------------------------- Cluster Allocated Down PLND Down Idle Reserved Reported --------- ----------------- ----------------- ----------------- ----------------- ----------------- ----------------- test_pppl 417600(100.00%) 0(0.00%) 0(0.00%) 0(0.00%) 0(0.00%) 417600(100.00%) $ sreport -a -tMinPer cluster Utilization Start=2019-07-11 End=2019-07-11 -------------------------------------------------------------------------------- Cluster Utilization 2019-07-11T00:00:00 - 2019-07-11T00:59:59 Usage reported in CPU Minutes/Percentage of Total -------------------------------------------------------------------------------- Cluster Allocated Down PLND Down Idle Reserved Reported --------- ----------------- ----------------- ----------------- ----------------- ----------------- ----------------- test_pppl 413760(100.00%) 0(0.00%) 0(0.00%) 0(0.00%) 0(0.00%) 413760(100.00%) After conferring with my colleagues at Princeton University (Josko Plazonic and Bill Wichser), I ran 'sacctmgr show runawayjobs' and selected 'Y' at the prompt. However, that did not fix the issue (the sreport out above is from this morning). I stopped slurmctld, slurmdbd, and mysqld this morning and then restarted them in the correct order. I'm now getting errors like this: [2019-07-12T11:07:11.041] error: There is no reservation by id 38, time_start 1560852332, and cluster 'test_pppl' and this: Jul 12 10:54:19 kay slurmctld: *** RESTART SLURMDBD NOW *** And ideas? Please let me know what logging or configuration information I can provide you. Prentice
I should also add that when I start slurmdbd, I can not stop it with 'service slurm stop'. I can only stop it with a kill -9 <PID>
(In reply to pbisbal from comment #0) > Jul 12 10:54:19 kay slurmctld: *** RESTART SLURMDBD NOW *** This is bad and needs to be corrected ASAP. Please restart your slurmdbd daemon immediately. Your system is either losing job records or will be soon. Please call sdiag and attach the output. Please attach your slurmctld and slurmdbd logs (post slurmdbd restart). (In reply to pbisbal from comment #1) > I should also add that when I start slurmdbd, I can not stop it with > 'service slurm stop'. I can only stop it with a kill -9 <PID> This is a great way to guarantee data loss. Please avoid ever calling SIGKILL against any slurm daemon. Please call gcore (and provide 'thread apply all bt full' from gdb) against any daemon that refuses to stop and open a ticket instead. If gcore fails, then there is likely a kernel issue outside of Slurm's control.
Attached sdiag output working on restarting slurmdbd as requested and sending the logs for that.
(In reply to pbisbal from comment #3) > Created attachment 10894 [details] > sdiag output slurmctld is unable to send records to slurmdbd: > DBD Agent queue size: 21074 Luckily, that number is pretty low, so there may be no data loss yet.
Please also attach an updated copy of your slurm.conf.
I've attached slurm.conf, slurmdbd.conf, and the gcore file created from running gcore on the slurmdbd process. I'm not really experienced with gdb and don't know exaclty how to analyze the file properly. Below is my attempt to analyze the file how I think you wanted me to do it. $ gdb /usr/sbin/slurmdbd gcore.slurmdbd.3521 GNU gdb (GDB) Red Hat Enterprise Linux (7.2-92.el6) Copyright (C) 2010 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /usr/sbin/slurmdbd...done. [New Thread 3532] [New Thread 3553] [New Thread 3555] [New Thread 3521] Reading symbols from /usr/lib64/slurm/libslurmfull.so...done. Loaded symbols for /usr/lib64/slurm/libslurmfull.so Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libdl.so.2 Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done. [Thread debugging using libthread_db enabled] Loaded symbols for /lib64/libpthread.so.0 Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done. Loaded symbols for /lib64/libc.so.6 Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libnss_files.so.2 Reading symbols from /usr/lib64/slurm/auth_munge.so...done. Loaded symbols for /usr/lib64/slurm/auth_munge.so Reading symbols from /usr/lib64/libmunge.so.2...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libmunge.so.2 Reading symbols from /usr/lib64/slurm/accounting_storage_mysql.so...done. Loaded symbols for /usr/lib64/slurm/accounting_storage_mysql.so Reading symbols from /usr/lib64/mysql/libmysqlclient_r.so.16...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/mysql/libmysqlclient_r.so.16 Reading symbols from /lib64/libz.so.1...(no debugging symbols found)...done. Loaded symbols for /lib64/libz.so.1 Reading symbols from /lib64/libcrypt.so.1...(no debugging symbols found)...done. Loaded symbols for /lib64/libcrypt.so.1 Reading symbols from /lib64/libnsl.so.1...(no debugging symbols found)...done. Loaded symbols for /lib64/libnsl.so.1 Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done. Loaded symbols for /lib64/libm.so.6 Reading symbols from /usr/lib64/libssl.so.10...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libssl.so.10 Reading symbols from /usr/lib64/libcrypto.so.10...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libcrypto.so.10 Reading symbols from /usr/lib64/libfreebl3.so...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libfreebl3.so Reading symbols from /lib64/libgssapi_krb5.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libgssapi_krb5.so.2 Reading symbols from /lib64/libkrb5.so.3...(no debugging symbols found)...done. Loaded symbols for /lib64/libkrb5.so.3 Reading symbols from /lib64/libcom_err.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libcom_err.so.2 Reading symbols from /lib64/libk5crypto.so.3...(no debugging symbols found)...done. Loaded symbols for /lib64/libk5crypto.so.3 Reading symbols from /lib64/libresolv.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libresolv.so.2 Reading symbols from /lib64/libkrb5support.so.0...(no debugging symbols found)...done. Loaded symbols for /lib64/libkrb5support.so.0 Reading symbols from /lib64/libkeyutils.so.1...(no debugging symbols found)...done. Loaded symbols for /lib64/libkeyutils.so.1 Reading symbols from /lib64/libselinux.so.1...(no debugging symbols found)...done. Loaded symbols for /lib64/libselinux.so.1 Reading symbols from /lib64/libgcc_s.so.1...(no debugging symbols found)...done. Loaded symbols for /lib64/libgcc_s.so.1 Reading symbols from /lib64/libnss_ldap.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libnss_ldap.so.2 Core was generated by `/usr/sbin/slurmdbd'. #0 0x000000358220e334 in __lll_lock_wait () from /lib64/libpthread.so.0 Missing separate debuginfos, use: debuginfo-install slurm-slurmdbd-18.08.7-1.el6.x86_64 (gdb) thread apply all bt full Thread 4 (Thread 0x7f16e0e38700 (LWP 3521)): #0 0x00000035822082fd in pthread_join () from /lib64/libpthread.so.0 No symbol table info available. #1 0x0000000000410f63 in main (argc=<value optimized out>, argv=0x7fff4bc618b8) at slurmdbd.c:275 node_name_short = "kay\000\065\000\000\000\060\062\307K\377\177\000\000\320\027\306K\377\177\000\000\000\000\000\000\000\000\000\000\002", '\000' <repeats 23 times>"\250, *\302\201\065\000\000\000\260\027\306K\377\177\000\000\350\027\306K\377\177\000\000 '\302\201\065\000\000\000\000\030\306K\377\177\000\000\177U\335q\000\000\000\000j\242\240\201\065", '\000' <repeats 12 times>, "\030\306K\377\177\000" node_name_long = "kay.pppl.gov", '\000' <repeats 12 times>, "`\027\306K\377\177\000\000X1\307K\377\177\000\000\210\061\307K\377\177", '\000' <repeats 50 times>, " '\302\201\065", '\000' <repeats 12 times>, "\030\306K\377\177\000\000 '\302\201\065\000\000" db_conn = 0x2064f10 assoc_init_arg = {cache_level = 38, enforce = 0, running_cache = 0x0, add_license_notify = 0, resize_qos_notify = 0, remove_assoc_notify = 0, remove_license_notify = 0, remove_qos_notify = 0, state_save_location = 0x0, sync_license_notify = 0, update_assoc_notify = 0, update_cluster_tres = 0, update_license_notify = 0, update_qos_notify = 0, update_resvs = 0} __func__ = "main" Thread 3 (Thread 0x7f16df418700 (LWP 3555)): #0 0x000000358220e82d in read () from /lib64/libpthread.so.0 No symbol table info available. #1 0x0000003583a57599 in vio_read_buff () from /usr/lib64/mysql/libmysqlclient_r.so.16 ---Type <return> to continue, or q <return> to quit--- No symbol table info available. #2 0x0000003583a5874f in ?? () from /usr/lib64/mysql/libmysqlclient_r.so.16 No symbol table info available. #3 0x0000003583a58b55 in my_net_read () from /usr/lib64/mysql/libmysqlclient_r.so.16 No symbol table info available. #4 0x0000003583a5312a in cli_safe_read () from /usr/lib64/mysql/libmysqlclient_r.so.16 No symbol table info available. #5 0x0000003583a538a9 in ?? () from /usr/lib64/mysql/libmysqlclient_r.so.16 No symbol table info available. #6 0x0000003583a51c4c in mysql_real_query () from /usr/lib64/mysql/libmysqlclient_r.so.16 No symbol table info available. #7 0x00007f16e0808c14 in _mysql_query_internal (db_conn=0x7f16c8000970, query=0x7f16c80073e0 "select node_name, time_start, time_end, state, tres from \"test_pppl_event_table\" where !(state & 32768) && (time_start < 64627200 && (time_end >= 64623600 || time_end = 0)) order by node_name, time_st"...) at mysql_common.c:144 No locals. #8 0x00007f16e080902f in mysql_db_query_ret (mysql_conn=0x7f16df417e00, query=0x7f16c80073e0 "select node_name, time_start, time_end, state, tres from \"test_pppl_event_table\" where !(state & 32768) && (time_start < 64627200 && (time_end >= 64623600 || time_end = 0)) order by node_name, time_st"..., last=false) at mysql_common.c:888 result = 0x0 __func__ = "mysql_db_query_ret" #9 0x00007f16e07fe610 in _setup_cluster_usage (mysql_conn=0x7f16df417e00, ---Type <return> to continue, or q <return> to quit---q cluster_name=0x207d6b0 "test_pppl"Quit (gdb) q
(In reply to pbisbal from comment #11) > Created attachment 10898 [details] > gcore file > > File created by 'gcore -o gcore.slurmdbd <PID>. bzipped to save space. Can you please load the gcore in gdb and call the following: > gdb slurmdbd gcore.slurmdbd > set pagination off > thread apply all bt full We can't load a core into gdb unless you attach all of your binaries and libraries that are linked by slurmd.
(In reply to pbisbal from comment #12) > I've attached slurm.conf, slurmdbd.conf, and the gcore file created from > running gcore on the slurmdbd process. I'm not really experienced with gdb > and don't know exaclty how to analyze the file properly. Below is my attempt > to analyze the file how I think you wanted me to do it. That works perfectly, I just needed the output for now.
(In reply to Nate Rini from comment #14) > (In reply to pbisbal from comment #12) > > I've attached slurm.conf, slurmdbd.conf, and the gcore file created from > > running gcore on the slurmdbd process. I'm not really experienced with gdb > > and don't know exaclty how to analyze the file properly. Below is my attempt > > to analyze the file how I think you wanted me to do it. > That works perfectly, I just needed the output for now. Correction: looks like the pages got lost. If you could please make sure to attach the other pages or use 'set pagination off'.
Sure. Here ya go: $ gdb slurmdbd gcore.slurmdbd.3521 GNU gdb (GDB) Red Hat Enterprise Linux (7.2-92.el6) Copyright (C) 2010 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /usr/sbin/slurmdbd...done. [New Thread 3532] [New Thread 3553] [New Thread 3555] [New Thread 3521] Reading symbols from /usr/lib64/slurm/libslurmfull.so...done. Loaded symbols for /usr/lib64/slurm/libslurmfull.so Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libdl.so.2 Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done. [Thread debugging using libthread_db enabled] Loaded symbols for /lib64/libpthread.so.0 Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done. Loaded symbols for /lib64/libc.so.6 Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libnss_files.so.2 Reading symbols from /usr/lib64/slurm/auth_munge.so...done. Loaded symbols for /usr/lib64/slurm/auth_munge.so Reading symbols from /usr/lib64/libmunge.so.2...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libmunge.so.2 Reading symbols from /usr/lib64/slurm/accounting_storage_mysql.so...done. Loaded symbols for /usr/lib64/slurm/accounting_storage_mysql.so Reading symbols from /usr/lib64/mysql/libmysqlclient_r.so.16...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/mysql/libmysqlclient_r.so.16 Reading symbols from /lib64/libz.so.1...(no debugging symbols found)...done. Loaded symbols for /lib64/libz.so.1 Reading symbols from /lib64/libcrypt.so.1...(no debugging symbols found)...done. Loaded symbols for /lib64/libcrypt.so.1 Reading symbols from /lib64/libnsl.so.1...(no debugging symbols found)...done. Loaded symbols for /lib64/libnsl.so.1 Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done. Loaded symbols for /lib64/libm.so.6 Reading symbols from /usr/lib64/libssl.so.10...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libssl.so.10 Reading symbols from /usr/lib64/libcrypto.so.10...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libcrypto.so.10 Reading symbols from /usr/lib64/libfreebl3.so...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libfreebl3.so Reading symbols from /lib64/libgssapi_krb5.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libgssapi_krb5.so.2 Reading symbols from /lib64/libkrb5.so.3...(no debugging symbols found)...done. Loaded symbols for /lib64/libkrb5.so.3 Reading symbols from /lib64/libcom_err.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libcom_err.so.2 Reading symbols from /lib64/libk5crypto.so.3...(no debugging symbols found)...done. Loaded symbols for /lib64/libk5crypto.so.3 Reading symbols from /lib64/libresolv.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libresolv.so.2 Reading symbols from /lib64/libkrb5support.so.0...(no debugging symbols found)...done. Loaded symbols for /lib64/libkrb5support.so.0 Reading symbols from /lib64/libkeyutils.so.1...(no debugging symbols found)...done. Loaded symbols for /lib64/libkeyutils.so.1 Reading symbols from /lib64/libselinux.so.1...(no debugging symbols found)...done. Loaded symbols for /lib64/libselinux.so.1 Reading symbols from /lib64/libgcc_s.so.1...(no debugging symbols found)...done. Loaded symbols for /lib64/libgcc_s.so.1 Reading symbols from /lib64/libnss_ldap.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libnss_ldap.so.2 Core was generated by `/usr/sbin/slurmdbd'. #0 0x000000358220e334 in __lll_lock_wait () from /lib64/libpthread.so.0 Missing separate debuginfos, use: debuginfo-install slurm-slurmdbd-18.08.7-1.el6.x86_64 (gdb) set pagination off (gdb) thread apply all bt full Thread 4 (Thread 0x7f16e0e38700 (LWP 3521)): #0 0x00000035822082fd in pthread_join () from /lib64/libpthread.so.0 No symbol table info available. #1 0x0000000000410f63 in main (argc=<value optimized out>, argv=0x7fff4bc618b8) at slurmdbd.c:275 node_name_short = "kay\000\065\000\000\000\060\062\307K\377\177\000\000\320\027\306K\377\177\000\000\000\000\000\000\000\000\000\000\002", '\000' <repeats 23 times>"\250, *\302\201\065\000\000\000\260\027\306K\377\177\000\000\350\027\306K\377\177\000\000 '\302\201\065\000\000\000\000\030\306K\377\177\000\000\177U\335q\000\000\000\000j\242\240\201\065", '\000' <repeats 12 times>, "\030\306K\377\177\000" node_name_long = "kay.pppl.gov", '\000' <repeats 12 times>, "`\027\306K\377\177\000\000X1\307K\377\177\000\000\210\061\307K\377\177", '\000' <repeats 50 times>, " '\302\201\065", '\000' <repeats 12 times>, "\030\306K\377\177\000\000 '\302\201\065\000\000" db_conn = 0x2064f10 assoc_init_arg = {cache_level = 38, enforce = 0, running_cache = 0x0, add_license_notify = 0, resize_qos_notify = 0, remove_assoc_notify = 0, remove_license_notify = 0, remove_qos_notify = 0, state_save_location = 0x0, sync_license_notify = 0, update_assoc_notify = 0, update_cluster_tres = 0, update_license_notify = 0, update_qos_notify = 0, update_resvs = 0} __func__ = "main" Thread 3 (Thread 0x7f16df418700 (LWP 3555)): #0 0x000000358220e82d in read () from /lib64/libpthread.so.0 No symbol table info available. #1 0x0000003583a57599 in vio_read_buff () from /usr/lib64/mysql/libmysqlclient_r.so.16 No symbol table info available. #2 0x0000003583a5874f in ?? () from /usr/lib64/mysql/libmysqlclient_r.so.16 No symbol table info available. #3 0x0000003583a58b55 in my_net_read () from /usr/lib64/mysql/libmysqlclient_r.so.16 No symbol table info available. #4 0x0000003583a5312a in cli_safe_read () from /usr/lib64/mysql/libmysqlclient_r.so.16 No symbol table info available. #5 0x0000003583a538a9 in ?? () from /usr/lib64/mysql/libmysqlclient_r.so.16 No symbol table info available. #6 0x0000003583a51c4c in mysql_real_query () from /usr/lib64/mysql/libmysqlclient_r.so.16 No symbol table info available. #7 0x00007f16e0808c14 in _mysql_query_internal (db_conn=0x7f16c8000970, query=0x7f16c80073e0 "select node_name, time_start, time_end, state, tres from \"test_pppl_event_table\" where !(state & 32768) && (time_start < 64627200 && (time_end >= 64623600 || time_end = 0)) order by node_name, time_st"...) at mysql_common.c:144 No locals. #8 0x00007f16e080902f in mysql_db_query_ret (mysql_conn=0x7f16df417e00, query=0x7f16c80073e0 "select node_name, time_start, time_end, state, tres from \"test_pppl_event_table\" where !(state & 32768) && (time_start < 64627200 && (time_end >= 64623600 || time_end = 0)) order by node_name, time_st"..., last=false) at mysql_common.c:888 result = 0x0 __func__ = "mysql_db_query_ret" #9 0x00007f16e07fe610 in _setup_cluster_usage (mysql_conn=0x7f16df417e00, cluster_name=0x207d6b0 "test_pppl", start=<value optimized out>, end=1562940000, archive_data=1) at as_mysql_rollup.c:826 c_usage = 0x0 query = 0x7f16c80073e0 "select node_name, time_start, time_end, state, tres from \"test_pppl_event_table\" where !(state & 32768) && (time_start < 64627200 && (time_end >= 64623600 || time_end = 0)) order by node_name, time_st"... loc_c_usage = <value optimized out> result = 0x0 row = <value optimized out> i = 5 d_itr = 0x0 event_req_inx = {0x7f16e080d2bc "node_name", 0x7f16e081ae58 "time_start", 0x7f16e081ae38 "time_end", 0x7f16e0819142 "state", 0x7f16e081ae76 "tres"} event_str = 0x0 #10 as_mysql_hourly_rollup (mysql_conn=0x7f16df417e00, cluster_name=0x207d6b0 "test_pppl", start=<value optimized out>, end=1562940000, archive_data=1) at as_mysql_rollup.c:1076 last_id = -1 last_wckeyid = -1 rc = <value optimized out> i = <value optimized out> now = 1562943253 curr_start = 64623600 curr_end = 64627200 query = 0x0 result = <value optimized out> row = <value optimized out> a_itr = 0x207d770 c_itr = 0x207d790 w_itr = 0x207d7b0 r_itr = 0x207d7d0 assoc_usage_list = 0x21184e0 cluster_down_list = 0x2118530 wckey_usage_list = 0x2118580 resv_usage_list = 0x21185d0 track_wckey = 0 loc_c_usage = <value optimized out> c_usage = 0x0 r_usage = 0x0 a_usage = 0x0 w_usage = 0x0 job_req_inx = {0x7f16e081afa1 "job.job_db_inx", 0x7f16e081afb0 "job.id_assoc", 0x7f16e081afbd "job.id_wckey", 0x7f16e081afca "job.array_task_pending", 0x7f16e081afe1 "job.time_eligible", 0x7f16e081aff3 "job.time_start", 0x7f16e081b002 "job.time_end", 0x7f16e081b00f "job.time_suspended", 0x7f16e081b022 "job.cpus_req", 0x7f16e081b02f "job.id_resv", 0x7f16e081b03b "job.tres_alloc"} job_str = 0x7f16c8007140 "job.job_db_inx, job.id_assoc, job.id_wckey, job.array_task_pending, job.time_eligible, job.time_start, job.time_end, job.time_suspended, job.cpus_req, job.id_resv, job.tres_alloc" suspend_str = 0x7f16c80072a0 "time_start, time_end" resv_req_inx = {0x7f16e08190ac "id_resv", 0x7f16e081ade3 "assoclist", 0x7f16e0816082 "flags", 0x7f16e081ae76 "tres", 0x7f16e081ae58 "time_start", 0x7f16e081ae38 "time_end", 0x7f16e080d282 "unused_wall"} resv_str = 0x7f16c8007310 "id_resv, assoclist, flags, tres, time_start, time_end, unused_wall" __func__ = "as_mysql_hourly_rollup" #11 0x00007f16e08027b7 in _cluster_rollup_usage (arg=<value optimized out>) at as_mysql_usage.c:292 local_rollup = 0x7f16d0000bd0 i = <value optimized out> rc = 0 timer_str = '\000' <repeats 127 times> mysql_conn = {cluster_deleted = false, cluster_name = 0x0, db_conn = 0x7f16c8000970, lock = {__data = {__lock = 1, __count = 0, __owner = 3555, __nusers = 1, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\001\000\000\000\000\000\000\000\343\r\000\000\001", '\000' <repeats 26 times>, __align = 1}, pre_commit_query = 0x0, rollback = true, update_list = 0x0, conn = 0} result = <value optimized out> row = <value optimized out> query = 0x0 start_tm = {tm_sec = 0, tm_min = 0, tm_hour = 0, tm_mday = 1, tm_mon = 11, tm_year = 69, tm_wday = 1, tm_yday = 334, tm_isdst = 0, tm_gmtoff = -18000, tm_zone = 0x2061180 "EST"} end_tm = {tm_sec = 0, tm_min = 0, tm_hour = 0, tm_mday = 1, tm_mon = 6, tm_year = 119, tm_wday = 1, tm_yday = 181, tm_isdst = 1, tm_gmtoff = -14400, tm_zone = 0x2064740 "EDT"} my_time = 1562943253 last_hour = 0 last_day = 0 last_month = 0 hour_start = 0 hour_end = 1562940000 day_start = -68400 day_end = 1562904000 month_start = -2660400 month_end = 1561953600 rollup_time = {0, 0, 0} tv1 = {tv_sec = 1562943253, tv_usec = 380669} tv2 = {tv_sec = 0, tv_usec = 0} tv_str = '\000' <repeats 19 times> delta_t = <value optimized out> update_req_inx = {0x7f16e080ccf6 "hourly_rollup", 0x7f16e080cd04 "daily_rollup", 0x7f16e080cd11 "monthly_rollup"} __func__ = "_cluster_rollup_usage" #12 0x0000003582207aa1 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #13 0x0000003581ee8c4d in clone () from /lib64/libc.so.6 No symbol table info available. Thread 2 (Thread 0x7f16df61a700 (LWP 3553)): #0 0x000000358220b68c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 No symbol table info available. #1 0x00007f16e080157e in as_mysql_roll_usage (mysql_conn=<value optimized out>, sent_start=0, sent_end=0, archive_data=1, rollup_stats=0x7f16df619e60) at as_mysql_usage.c:977 err = <value optimized out> rc = 0 rolledup = 1 roll_started = 2 cluster_name = <value optimized out> itr = 0x207d770 rolledup_lock = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 1, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 12 times>, "\001", '\000' <repeats 26 times>, __align = 0} rolledup_cond = {__data = {__lock = 0, __futex = 3, __total_seq = 2, __wakeup_seq = 1, __woken_seq = 1, __mutex = 0x7f16df619d70, __nwaiters = 2, __broadcast_seq = 0}, __size = "\000\000\000\000\003\000\000\000\002\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000p\235a\337\026\177\000\000\002\000\000\000\000\000\000", __align = 12884901888} __func__ = "as_mysql_roll_usage" #2 0x00007f16e07cb464 in acct_storage_p_roll_usage (mysql_conn=<value optimized out>, sent_start=<value optimized out>, sent_end=<value optimized out>, archive_data=<value optimized out>, rollup_stats=<value optimized out>) at accounting_storage_mysql.c:3150 No locals. #3 0x0000003c370c3afb in acct_storage_g_roll_usage (db_conn=0x2064f10, sent_start=0, sent_end=0, archive_data=1, rollup_stats=0x7f16df619e60) at slurm_accounting_storage.c:788 No locals. #4 0x0000000000411373 in _rollup_handler (db_conn=0x2064f10) at slurmdbd.c:707 start_time = 1562943253 next_time = <value optimized out> tm = {tm_sec = 13, tm_min = 54, tm_hour = 10, tm_mday = 12, tm_mon = 6, tm_year = 119, tm_wday = 5, tm_yday = 192, tm_isdst = 1, tm_gmtoff = -14400, tm_zone = 0x2064740 "EDT"} rollup_stats = {rollup_time = {0, 0, 0}} __func__ = "_rollup_handler" #5 0x0000003582207aa1 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #6 0x0000003581ee8c4d in clone () from /lib64/libc.so.6 No symbol table info available. Thread 1 (Thread 0x7f16df91d700 (LWP 3532)): #0 0x000000358220e334 in __lll_lock_wait () from /lib64/libpthread.so.0 No symbol table info available. #1 0x00000035822095d8 in _L_lock_854 () from /lib64/libpthread.so.0 No symbol table info available. #2 0x00000035822094a7 in pthread_mutex_lock () from /lib64/libpthread.so.0 No symbol table info available. #3 0x000000000040f975 in _rollup_handler_cancel () at slurmdbd.c:673 err = <value optimized out> #4 shutdown_threads () at slurmdbd.c:343 No locals. #5 0x000000000040fed7 in _signal_handler (no_data=<value optimized out>) at slurmdbd.c:864 rc = <value optimized out> sig = 15 sig_array = {2, 15, 1, 6, 12, 0} set = {__val = {18467, 0 <repeats 15 times>}} #6 0x0000003582207aa1 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #7 0x0000003581ee8c4d in clone () from /lib64/libc.so.6 No symbol table info available. (gdb) I don't have debug versions of slurm and all the libraries installed. I hope the output above is still helpful to you.
(In reply to pbisbal from comment #16) > Sure. Here ya go: Looks like slurmdbd is waiting on mysql. Can you please provide the mysql, slurmctld and slurmdbd logs. > I don't have debug versions of slurm and all the libraries installed. I hope > the output above is still helpful to you. This is sufficient. Can you please call the following in mysql: > SHOW FULL PROCESSLIST\G > SHOW ENGINE INNODB STATUS
mysql> SHOW FULL PROCESSLIST\G *************************** 1. row *************************** Id: 3 User: slurm Host: localhost db: slurm_acct_db Command: Sleep Time: 8348 State: Info: NULL *************************** 2. row *************************** Id: 5 User: slurm Host: localhost db: slurm_acct_db Command: Query Time: 0 State: Sorting result Info: select node_name, time_start, time_end, state, tres from "test_pppl_event_table" where !(state & 32768) && (time_start < 183783600 && (time_end >= 183780000 || time_end = 0)) order by node_name, time_start *************************** 3. row *************************** Id: 16 User: slurm Host: localhost db: NULL Command: Query Time: 0 State: NULL Info: SHOW FULL PROCESSLIST 3 rows in set (0.00 sec)
mysql> SHOW ENGINE INNODB STATUS\G ERROR 1227 (42000): Access denied; you need the PROCESS privilege for this operation Looks like the slurm user doesn't have the privileges to run that command, and I don't know the root password for MySQL. I could reset it, but that would mean stopping the MySQL server right now, and I'm not sure if that's what you want me to do right now.
(Nate is temporarily unavailable, so I will step in for now) We need that mysql information, so I recommend stopping the slurmdbd if you can (if it's hung and ignoring SIGTERM, ignore it for now). Reset the mysql root password and restart mysql. Then restart slurmdbd, KILLing the old hung one if necessary. Let me know what you do and what the result is. If the slurmdbd hangs again, then give me the mysql queries from before: mysql> SHOW FULL PROCESSLIST\G mysql> SHOW ENGINE INNODB STATUS\G If after the mysql restart the slurmdbd does not hang, watch the slurmdbd agent queue in `sdiag` until it reaches zero. Let me know if you have any questions
Also, what is your mysql/mariadb version?
Also resend the slurmdbd.log and slurmctld.log if the slurmdbd hangs again after mysql restart.
> Then restart slurmdbd, KILLing the old hung one if necessary. So I Can use a 'kill -9 <PID>' if necessary?
kill <PID> hasn't been able to kill slurmdbd. I stopped mysql and reset the root password, but left slurmdbd running for now. Here is the mysql information you requested: mysql> SHOW FULL PROCESSLIST\G *************************** 1. row *************************** Id: 3 User: root Host: localhost db: NULL Command: Query Time: 0 State: NULL Info: SHOW FULL PROCESSLIST 1 row in set (0.00 sec) mysql> SHOW ENGINE INNODB STATUS\G *************************** 1. row *************************** Type: InnoDB Name: Status: ===================================== 190712 16:03:46 INNODB MONITOR OUTPUT ===================================== Per second averages calculated from the last 38 seconds ---------- SEMAPHORES ---------- OS WAIT ARRAY INFO: reservation count 4, signal count 4 Mutex spin waits 0, rounds 20, OS waits 1 RW-shared spins 4, OS waits 2; RW-excl spins 1, OS waits 1 ------------ TRANSACTIONS ------------ Trx id counter 0 9069824 Purge done for trx's n:o < 0 9069190 undo n:o < 0 0 History list length 1 LIST OF TRANSACTIONS FOR EACH SESSION: ---TRANSACTION 0 0, not started, process no 22143, OS thread id 139921109796608 MySQL thread id 3, query id 6 localhost root SHOW ENGINE INNODB STATUS -------- FILE I/O -------- I/O thread 0 state: waiting for i/o request (insert buffer thread) I/O thread 1 state: waiting for i/o request (log thread) I/O thread 2 state: waiting for i/o request (read thread) I/O thread 3 state: waiting for i/o request (write thread) Pending normal aio reads: 0, aio writes: 0, ibuf aio reads: 0, log i/o's: 0, sync i/o's: 0 Pending flushes (fsync) log: 0; buffer pool: 0 28 OS file reads, 3 OS file writes, 3 OS fsyncs 0.00 reads/s, 0 avg bytes/read, 0.00 writes/s, 0.00 fsyncs/s ------------------------------------- INSERT BUFFER AND ADAPTIVE HASH INDEX ------------------------------------- Ibuf: size 1, free list len 126, seg size 128, 0 inserts, 0 merged recs, 0 merges Hash table size 17393, node heap has 0 buffer(s) 0.00 hash searches/s, 0.00 non-hash searches/s --- LOG --- Log sequence number 1 1649626360 Log flushed up to 1 1649626360 Last checkpoint at 1 1649626360 0 pending log writes, 0 pending chkp writes 8 log i/o's done, 0.00 log i/o's/second ---------------------- BUFFER POOL AND MEMORY ---------------------- Total memory allocated 20358952; in additional pool allocated 653824 Dictionary memory allocated 33320 Buffer pool size 512 Free buffers 490 Database pages 22 Modified db pages 0 Pending reads 0 Pending writes: LRU 0, flush list 0, single page 0 Pages read 22, created 0, written 0 0.00 reads/s, 0.00 creates/s, 0.00 writes/s No buffer pool page gets since the last printout -------------- ROW OPERATIONS -------------- 0 queries inside InnoDB, 0 queries in queue 1 read views open inside InnoDB Main thread process no. 22143, id 139920938329856, state: waiting for server activity Number of rows inserted 0, updated 0, deleted 0, read 0 0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s ---------------------------- END OF INNODB MONITOR OUTPUT ============================ 1 row in set (0.00 sec) mysql> I'll attach the new log files next.
MySQL version is 5.1.73: $ rpm -qa | grep mysql mysql-devel-5.1.73-8.el6_8.x86_64 mysql-5.1.73-8.el6_8.x86_64 mysql-libs-5.1.73-8.el6_8.x86_64 mysql-server-5.1.73-8.el6_8.x86_64
(In reply to pbisbal from comment #27) > > Then restart slurmdbd, KILLing the old hung one if necessary. > > So I Can use a 'kill -9 <PID>' if necessary? In this case, yes. It's not a good action until we have a backtrace and have tried other methods. As Nate said, it is a good way to lose data.
Latest log files uploaded. Had to use 'kill -9' to stop slurmdbd.
(In reply to pbisbal from comment #20) > Created attachment 10902 [details] > /var/log/slurm/slurmdbd.log Looks like this is a duplicate of bug#7155. Please apply this patch: https://github.com/SchedMD/slurm/commit/89910fd50a3120bd1d8b29fcf6d543e53dbfe732
Is this patch in 18.08.8? If so, can I just upgrade to 18.08.8? Prentice
(In reply to pbisbal from comment #36) > Is this patch in 18.08.8? If so, can I just upgrade to 18.08.8? It is currently only in 19.05.1. It should cleanly patch into 18.08 though.
(In reply to Nate Rini from comment #37) > (In reply to pbisbal from comment #36) > > Is this patch in 18.08.8? If so, can I just upgrade to 18.08.8? > > It is currently only in 19.05.1. It should cleanly patch into 18.08 though. Were you able to apply the patch? Do you need instructions?
Yes, I was able to apply the patch to 18.08.8, build a new set of RPMs, and get them installed, and it appears to have fixed my issues. I started the build Friday evening, but I had to leave before the build was complete, and wasn't able to apply the patch until Saturday. Thanks so much for all your help with this. Prentice
That last comment should read "wasn't able to apply the patched RPMs until Saturday"