Ticket 1576 - slurmdbd crash during purge
Summary: slurmdbd crash during purge
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmdbd (show other tickets)
Version: 14.11.4
Hardware: Linux Linux
: --- 2 - High Impact
Assignee: Brian Christiansen
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2015-04-01 02:23 MDT by Paul Edmon
Modified: 2015-04-28 07:37 MDT (History)
2 users (show)

See Also:
Site: Harvard University
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 15.08.0pre4
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
don't unload plugin patch (369 bytes, patch)
2015-04-01 04:56 MDT, Brian Christiansen
Details | Diff
slurmdbd.mc (12.85 KB, text/xml)
2015-04-01 05:24 MDT, Paul Edmon
Details
slurmdbd.mc (37.34 KB, text/xml)
2015-04-02 02:44 MDT, Paul Edmon
Details
slurmdbd-0.mc (37.34 KB, text/xml)
2015-04-02 07:32 MDT, Paul Edmon
Details
debug patch 2 (1.81 KB, patch)
2015-04-02 10:02 MDT, Brian Christiansen
Details | Diff
slurmdbd-1.mc (806.58 KB, text/xml)
2015-04-06 03:36 MDT, Paul Edmon
Details
archive patch (28.42 KB, patch)
2015-04-15 10:31 MDT, Brian Christiansen
Details | Diff
purge patch (5.37 KB, patch)
2015-04-20 06:49 MDT, Brian Christiansen
Details | Diff
deadlock patch (571 bytes, patch)
2015-04-22 08:55 MDT, Brian Christiansen
Details | Diff
archive patch 2 (1.22 KB, patch)
2015-04-24 06:16 MDT, Brian Christiansen
Details | Diff
archive patch 3 (555 bytes, patch)
2015-04-27 05:03 MDT, Brian Christiansen
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Paul Edmon 2015-04-01 02:23:12 MDT
Last night we had scheduled a purge of our 40 GB, 2 year old job database.  Just prior to this it appears that the slurmdbd suffered the memory leak noted in:

bugs.schedmd.com/show_bug.cgi?id=1534

Inspite of being in this state it did successfully begin the purge.  It successfully dumped the date in events and suspend, but when it got to jobs it threw the following errors:

Apr  1 00:04:30 holy-slurm01 slurmdbd[41341]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program
Apr  1 00:04:30 holy-slurm01 slurmdbd[41341]: error: pack32: Buffer size limit exceeded (-61117 > -65536)
Apr  1 00:04:30 holy-slurm01 slurmdbd[41341]: error: packmem: Buffer size limit exceeded (-61116 > -65536)
Apr  1 00:04:30 holy-slurm01 slurmdbd[41341]: error: pack32: Buffer size limit exceeded (-61117 > -65536)
Apr  1 00:04:30 holy-slurm01 slurmdbd[41341]: error: packmem: Buffer size limit exceeded (-61103 > -65536)
Apr  1 00:04:30 holy-slurm01 slurmdbd[41341]: error: pack32: Buffer size limit exceeded (-61117 > -65536)
Apr  1 00:04:30 holy-slurm01 slurmdbd[41341]: error: packmem: Buffer size limit exceeded (-61116 > -65536)
Apr  1 00:04:30 holy-slurm01 slurmdbd[41341]: error: pack32: Buffer size limit exceeded (-61117 > -65536)
Apr  1 00:04:30 holy-slurm01 slurmdbd[41341]: error: pack32: Buffer size limit exceeded (-61117 > -65536)
Apr  1 00:04:30 holy-slurm01 slurmdbd[41341]: error: pack32: Buffer size limit exceeded (-61117 > -65536)
Apr  1 00:04:30 holy-slurm01 slurmdbd[41341]: error: pack32: Buffer size limit exceeded (-61117 > -65536)
Apr  1 00:04:30 holy-slurm01 slurmdbd[41341]: error: packmem: Buffer size limit exceeded (-61117 > -65536)
Apr  1 00:04:30 holy-slurm01 slurmdbd[41341]: error: pack32: Buffer size limit exceeded (-61117 > -65536)
Apr  1 00:04:30 holy-slurm01 slurmdbd[41341]: error: packmem: Buffer size limit exceeded (-61117 > -65536)
Apr  1 00:04:30 holy-slurm01 slurmdbd[41341]: error: packmem: Buffer size limit exceeded (-61109 > -65536)
Apr  1 00:04:30 holy-slurm01 slurmdbd[41341]: error: pack32: Buffer size limit exceeded (-61117 > -65536)
Apr  1 00:04:30 holy-slurm01 slurmdbd[41341]: error: packmem: Buffer size limit exceeded (-61117 > -65536)
Apr  1 00:04:37 holy-slurm01 slurmctld[55137]: error: slurmdbd: Getting response to message type 1407
Apr  1 00:04:37 holy-slurm01 slurmctld[55137]: slurmdbd: reopening connection
Apr  1 00:04:42 holy-slurm01 slurmctld[55137]: error: slurmdbd: Sending message type 1472: 11: Connection refused
Apr  1 00:04:42 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused
Apr  1 00:05:22 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused
Apr  1 00:05:44 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused
Apr  1 00:05:56 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused
Apr  1 00:06:28 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused
Apr  1 00:06:37 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused
Apr  1 00:06:43 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused
Apr  1 00:06:58 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused
Apr  1 00:07:38 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused
Apr  1 00:07:53 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused
Apr  1 00:08:15 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused
Apr  1 00:08:32 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused
Apr  1 00:08:55 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused
Apr  1 00:09:11 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused
Apr  1 00:10:07 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused
Apr  1 00:10:18 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused
Apr  1 00:13:29 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused
Apr  1 00:14:27 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused

At which point slurmdbd crashed.  It tried to restart again when our puppet software tried to auto restart it but it crashed immediately:

Apr  1 01:23:22 holy-slurm01 puppet-agent[12548]: (/Stage[main]/Slurm::Master/Service[slurmdbd]/ensure) ensure changed 'stopped' to 'running'
Apr  1 01:24:07 holy-slurm01 slurmdbd[14045]: slurmdbd version 14.11.4 started
Apr  1 01:24:07 holy-slurm01 slurmctld[55137]: Registering slurmctld at port 6820 with slurmdbd.
Apr  1 01:24:13 holy-slurm01 slurmctld[55137]: slurmdbd: agent queue size 11800
Apr  1 01:24:20 holy-slurm01 slurmctld[55137]: slurmdbd: agent queue size 2850
Apr  1 01:32:19 holy-slurm01 slurmdbd[14045]: error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction
Apr  1 01:32:19 holy-slurm01 slurmdbd[14045]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program
Apr  1 01:32:20 holy-slurm01 slurmctld[55137]: error: slurmdbd: Getting response to message type 1407
Apr  1 01:32:20 holy-slurm01 slurmctld[55137]: slurmdbd: reopening connection
Apr  1 01:32:25 holy-slurm01 slurmctld[55137]: error: slurmdbd: Sending message type 1472: 11: Connection refused
Apr  1 01:32:25 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused
Apr  1 01:32:41 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused
Apr  1 01:32:49 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused

[root@holy-slurm01 log]# grep fatal messages
Apr  1 00:04:30 holy-slurm01 slurmdbd[41341]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program
Apr  1 01:32:19 holy-slurm01 slurmdbd[14045]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program
Apr  1 03:46:31 holy-slurm01 slurmdbd[51927]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program
Apr  1 05:52:05 holy-slurm01 slurmdbd[63753]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program
Apr  1 07:52:29 holy-slurm01 slurmdbd[7120]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program
Apr  1 09:55:48 holy-slurm01 slurmdbd[58235]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program
[root@holy-slurm01 log]# grep fatal mysql
grep: mysql: No such file or directory
[root@holy-slurm01 log]# grep mysql messages
Apr  1 00:04:30 holy-slurm01 slurmdbd[41341]: error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction
Apr  1 00:04:30 holy-slurm01 slurmdbd[41341]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program
Apr  1 01:32:19 holy-slurm01 slurmdbd[14045]: error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction
Apr  1 01:32:19 holy-slurm01 slurmdbd[14045]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program
Apr  1 03:46:31 holy-slurm01 slurmdbd[51927]: error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction
Apr  1 03:46:31 holy-slurm01 slurmdbd[51927]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program
Apr  1 05:52:05 holy-slurm01 slurmdbd[63753]: error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction
Apr  1 05:52:05 holy-slurm01 slurmdbd[63753]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program
Apr  1 07:52:29 holy-slurm01 slurmdbd[7120]: error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction
Apr  1 07:52:29 holy-slurm01 slurmdbd[7120]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program
Apr  1 09:55:48 holy-slurm01 slurmdbd[58235]: error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction
Apr  1 09:55:48 holy-slurm01 slurmdbd[58235]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program

I have the full logs but they are massive given the other errors related to the 
the memory leak.  Messages is 5.8G, and greping out only the slurmdbd errors gives 5.5G.  These are larger than I can attach.

I suspsect that purging a database this size is no go.  I had attempted before to force it from the command line but it timed out.  I was told that allowing it to do so on the 1st of the month automatically would avoid this problem.  There is either a bug or the timeouts are not sufficient to purge a 36 million entry database.

Things are back up and stable, but our original goal of shrinking our job database has not been met.  It would be good if we could safely do this.
Comment 1 Paul Edmon 2015-04-01 02:42:30 MDT
Actually scratch that.  slurmdbd is still crashing.  Everytime it restarts it has that error chain I said below.  To reiterate:

Apr  1 10:01:41 holy-slurm01 slurmdbd[50263]: slurmdbd version 14.11.4 started
Apr  1 10:01:42 holy-slurm01 slurmctld[55137]: Registering slurmctld at port 6820 with slurmdbd.
Apr  1 10:36:39 holy-slurm01 slurmdbd[50263]: error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction
Apr  1 10:36:39 holy-slurm01 slurmdbd[50263]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program
Apr  1 10:36:40 holy-slurm01 slurmctld[55137]: error: slurmdbd: Getting response to message type 1407
Apr  1 10:36:40 holy-slurm01 slurmctld[55137]: slurmdbd: reopening connection
Apr  1 10:36:46 holy-slurm01 slurmctld[55137]: error: slurmdbd: Sending message type 1472: 11: Connection refused
Apr  1 10:36:46 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused
Apr  1 10:37:02 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused
Apr  1 10:37:29 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused
Apr  1 10:38:19 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused
Apr  1 10:38:32 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused
Apr  1 10:38:39 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused
Apr  1 10:38:44 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused
Apr  1 10:38:51 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused
Apr  1 10:38:59 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused
Apr  1 10:39:06 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused
Apr  1 10:39:23 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused
Apr  1 10:39:35 holy-slurm01 slurmdbd[46066]: chdir to /var/tmp
Apr  1 10:39:35 holy-slurm01 slurmdbd[46066]: slurmdbd version 14.11.4 started
Apr  1 10:39:36 holy-slurm01 slurmctld[55137]: Registering slurmctld at port 6820 with slurmdbd.

The dbd is crashing each time.  Essentially we can't keep a stable slurmdbd currently.  I can keep restarting it but it keeps crashing.
Comment 2 Danny Auble 2015-04-01 02:44:00 MDT
Sorry you are going through this Paul.  I wouldn't expect this kind of behavior.

I am guessing when you say "Things are back up and stable" you have taken out the purge options from the slurmdbd.conf.

Would it be possible to try again?

What does 'SHOW INNODB STATUS' give you when ran in a mysql prompt during the rollup?

Is is possible you have another process like mysqldump running in the background at the same time?  This is usually the issue with things.  If that is the case we can do something to that to make things better.  Bug 1304 sounds very similar to this.

One way you can do to get around the ER_LOCK_WAIT_TIMEOUT is to increase 

innodb_lock_wait_timeout

in you my.cnf

I remember the default used to be a 15 minute timeout, but according the the docs it appears mysql will default to 50 seconds instead (http://dev.mysql.com/doc/refman/5.5/en/innodb-parameters.html#sysvar_innodb_lock_wait_timeout).

If this is the case and other things are trying to do things to the database I can easily see how 50 seconds wouldn't be long enough.

Could you try to increase this to 900 or greater and restart the mysqld and the dbd?
Comment 3 Paul Edmon 2015-04-01 02:49:54 MDT
I haven't changed the DB yet so we are still running with the settings 
that automatically initiated the purge.  I can remove those if you think 
it will keep it from crashing.  Here is the show innodb status:

[root@holy-slurm01 log]# mysql
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 1710474
Server version: 5.1.69-log Source distribution

Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input 
statement.

mysql> show innodb status;
+--------+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Type   | Name | Status |
+--------+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| InnoDB |      |
=====================================
150401 10:48:05 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 22 seconds
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 421352606, signal count 411903012
Mutex spin waits 0, rounds 11433585640, OS waits 38695224
RW-shared spins 641564306, OS waits 312022856; RW-excl spins 72794209, 
OS waits 49283733
------------
TRANSACTIONS
------------
Trx id counter 0 371325858
Purge done for trx's n:o < 0 371325133 undo n:o < 0 0
History list length 132
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 0 0, not started, process no 4821, OS thread id 
140336043128576
MySQL thread id 1710474, query id 235838008 localhost root
show innodb status
---TRANSACTION 0 371325852, not started, process no 4821, OS thread id 
140348404791040
MySQL thread id 1710029, query id 235837350 localhost slurm
---TRANSACTION 0 371325128, not started, process no 4821, OS thread id 
140346243675904
MySQL thread id 1710025, query id 235823044 localhost slurm
---TRANSACTION 0 371325815, ACTIVE 53 sec, process no 4821, OS thread id 
140348415973120
mysql tables in use 1, locked 0
MySQL thread id 1710451, query id 235838009 localhost slurm
select t1.id_step, t1.time_start, t1.time_end, t1.time_suspended, 
t1.step_name, t1.nodelist, t1.node_inx, t1.state, t1.kill_requid, 
t1.exit_code, t1.nodes_alloc, t1.cpus_alloc, t1.task_cnt, t1.task_dist, 
t1.user_sec, t1.user_usec, t1.sys_sec, t1.sys_usec, t1.max_disk_read, 
t1.max_disk_read_task, t1.max_disk_read_node, t1.ave_disk_read, 
t1.max_disk_write, t1.max_disk_write_task, t1.max_disk_write_node, 
t1.ave_disk_write, t1.max_vsize, t1.max_vsize_task, t1.max_vsize_node, 
t1.ave_vsize, t1.max_rss, t1.max_rss_task, t1.max_rss_node, t1.ave_rss, 
t1.max_pages, t1.max_pages_task, t1.max_pages_node,
Trx read view will not see trx with id >= 0 371325816, sees < 0 371325130
---TRANSACTION 0 371325257, ACTIVE 453 sec, process no 4821, OS thread 
id 140348424492800
MySQL thread id 1710109, query id 235832975 localhost slurm
Trx read view will not see trx with id >= 0 371325258, sees < 0 371325130
---TRANSACTION 0 371325130, ACTIVE 510 sec, process no 4821, OS thread 
id 140348436739840 fetching rows, thread declared inside InnoDB 58
mysql tables in use 2, locked 0
267 lock struct(s), heap size 30704, 1570 row lock(s), undo log entries 1570
MySQL thread id 1710027, query id 235837349 localhost slurm Copying to 
tmp table
select job.job_db_inx, job.id_job, job.id_assoc, job.id_wckey, 
job.array_task_pending, job.time_eligible, job.time_start, job.time_end, 
job.time_suspended, job.cpus_alloc, job.cpus_req, job.id_resv, 
SUM(step.consumed_energy) from "odyssey_job_table" as job left outer 
join "odyssey_step_table" as step on job.job_db_inx=step.job_db_inx and 
(step.id_step>=0) where (job.time_eligible < 1427850000 && (job.time_end 
 >= 1427846400 || job.time_end = 0)) group by job.job_db_inx order by 
job.id_assoc, job.time_eligible
Trx read view will not see trx with id >= 0 371325131, sees < 0 371325128
--------
FILE I/O
--------
I/O thread 0 state: waiting for i/o request (insert buffer thread)
I/O thread 1 state: waiting for i/o request (log thread)
I/O thread 2 state: waiting for i/o request (read thread)
I/O thread 3 state: waiting for i/o request (write thread)
Pending normal aio reads: 0, aio writes: 0,
  ibuf aio reads: 0, log i/o's: 0, sync i/o's: 0
Pending flushes (fsync) log: 0; buffer pool: 0
1087776614 OS file reads, 23847621 OS file writes, 14255701 OS fsyncs
4181.90 reads/s, 55273 avg bytes/read, 12.45 writes/s, 1.95 fsyncs/s
-------------------------------------
INSERT BUFFER AND ADAPTIVE HASH INDEX
-------------------------------------
Ibuf: size 1, free list len 398, seg size 400,
5807829 inserts, 5807829 merged recs, 957399 merges
Hash table size 17393, node heap has 1 buffer(s)
15750.51 hash searches/s, 29272.12 non-hash searches/s
---
LOG
---
Log sequence number 144 2401285555
Log flushed up to   144 2401285555
Last checkpoint at  144 2401285555
0 pending log writes, 0 pending chkp writes
12545380 log i/o's done, 0.86 log i/o's/second
----------------------
BUFFER POOL AND MEMORY
----------------------
Total memory allocated 21146200; in additional pool allocated 1047552
Dictionary memory allocated 176120
Buffer pool size   512
Free buffers       0
Database pages     510
Modified db pages  0
Pending reads 1
Pending writes: LRU 0, flush list 0, single page 0
Pages read 2993816805, created 1381330, written 13753708
14108.54 reads/s, 0.05 creates/s, 13.73 writes/s
Buffer pool hit rate 955 / 1000
--------------
ROW OPERATIONS
--------------
1 queries inside InnoDB, 0 queries in queue
4 read views open inside InnoDB
Main thread process no. 4821, id 140346901477120, state: sleeping
Number of rows inserted 4942516, updated 5005110, deleted 9622643, read 
83360981565
0.59 inserts/s, 10.09 updates/s, 0.00 deletes/s, 542952.55 reads/s
----------------------------
END OF INNODB MONITOR OUTPUT
============================
  |
+--------+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set, 1 warning (0.01 sec)


I will answer the rest of the questions in a separate note.

-Paul Edmon-


On 04/01/2015 10:44 AM, bugs@schedmd.com wrote:
>
> *Comment # 2 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c2> on bug 
> 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Danny Auble 
> <mailto:da@schedmd.com> *
> Sorry you are going through this Paul.  I wouldn't expect this kind of
> behavior.
>
> I am guessing when you say "Things are back up and stable" you have taken out
> the purge options from the slurmdbd.conf.
>
> Would it be possible to try again?
>
> What does 'SHOW INNODB STATUS' give you when ran in a mysql prompt during the
> rollup?
>
> Is is possible you have another process like mysqldump running in the
> background at the same time?  This is usually the issue with things.  If that
> is the case we can do something to that to make things better.Bug 1304  <show_bug.cgi?id=1304>  sounds
> very similar to this.
>
> One way you can do to get around the ER_LOCK_WAIT_TIMEOUT is to increase
>
> innodb_lock_wait_timeout
>
> in you my.cnf
>
> I remember the default used to be a 15 minute timeout, but according the the
> docs it appears mysql will default to 50 seconds instead
> (http://dev.mysql.com/doc/refman/5.5/en/innodb-parameters.html#sysvar_innodb_lock_wait_timeout).
>
> If this is the case and other things are trying to do things to the database I
> can easily see how 50 seconds wouldn't be long enough.
>
> Could you try to increase this to 900 or greater and restart the mysqld and the
> dbd?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 4 Paul Edmon 2015-04-01 02:54:38 MDT
We do have another query going:

mysql> show processlist;
+---------+-------------+--------------------------------------+-------+-------------+---------+----------------------------------------------------------------+------------------------------------------------------------------------------------------------------+
| Id      | User        | Host                                 | db    | 
Command     | Time    | 
State                                                          | Info |
+---------+-------------+--------------------------------------+-------+-------------+---------+----------------------------------------------------------------+------------------------------------------------------------------------------------------------------+
|       7 | replication | db-internal.rc.fas.harvard.edu:51846 | NULL  | 
Binlog Dump | 1645344 | Has sent all binlog to slave; waiting for binlog 
to be updated | NULL |
| 1710025 | slurm       | localhost                            | slurm | 
Sleep       |     626 
|                                                                | NULL |
| 1710027 | slurm       | localhost                            | slurm | 
Query       |      59 | Copying to tmp 
table                                           | select job.job_db_inx, 
job.id_job, job.id_assoc, job.id_wckey, job.array_task_pending, 
job.time_elig |
| 1710029 | slurm       | localhost                            | slurm | 
Sleep       |       1 
|                                                                | NULL |
| 1710109 | slurm       | localhost                            | slurm | 
Sleep       |     564 
|                                                                | NULL |
| 1710474 | root        | localhost                            | NULL  | 
Query       |       0 | 
NULL                                                           | show 
processlist |
| 1710502 | slurm       | localhost                            | slurm | 
Sleep       |      65 
|                                                                | NULL |
+---------+-------------+--------------------------------------+-------+-------------+---------+----------------------------------------------------------------+------------------------------------------------------------------------------------------------------+
7 rows in set (0.00 sec)

But it wasn't a dump and shouldn't have caused a delay in dumping.

Our my.cnf looks like:

[root@holy-slurm01 etc]# cat my.cnf
[mysqld]
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
user=mysql
# Disabling symbolic-links is recommended to prevent assorted security risks
symbolic-links=0

!includedir /etc/mysql.d
[mysqld_safe]
log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid

So it looks like we are using the defaults.  I remember asking Danny 
about this and he recommended against it if we didn't have to.  Do you 
think we should increase the time out?

For reference our version of mysql is:

[root@holy-slurm01 etc]# rpm -qa | grep mysql
mysql-5.1.69-1.el6_4.x86_64
mysql-devel-5.1.69-1.el6_4.x86_64
mysql-libs-5.1.69-1.el6_4.x86_64
nagios-plugins-mysql-1.4.16-5.el6.x86_64
mysql-server-5.1.69-1.el6_4.x86_64


-Paul Edmon-

On 04/01/2015 10:44 AM, bugs@schedmd.com wrote:
>
> *Comment # 2 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c2> on bug 
> 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Danny Auble 
> <mailto:da@schedmd.com> *
> Sorry you are going through this Paul.  I wouldn't expect this kind of
> behavior.
>
> I am guessing when you say "Things are back up and stable" you have taken out
> the purge options from the slurmdbd.conf.
>
> Would it be possible to try again?
>
> What does 'SHOW INNODB STATUS' give you when ran in a mysql prompt during the
> rollup?
>
> Is is possible you have another process like mysqldump running in the
> background at the same time?  This is usually the issue with things.  If that
> is the case we can do something to that to make things better.Bug 1304  <show_bug.cgi?id=1304>  sounds
> very similar to this.
>
> One way you can do to get around the ER_LOCK_WAIT_TIMEOUT is to increase
>
> innodb_lock_wait_timeout
>
> in you my.cnf
>
> I remember the default used to be a 15 minute timeout, but according the the
> docs it appears mysql will default to 50 seconds instead
> (http://dev.mysql.com/doc/refman/5.5/en/innodb-parameters.html#sysvar_innodb_lock_wait_timeout).
>
> If this is the case and other things are trying to do things to the database I
> can easily see how 50 seconds wouldn't be long enough.
>
> Could you try to increase this to 900 or greater and restart the mysqld and the
> dbd?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 5 Danny Auble 2015-04-01 03:09:54 MDT
No, don't remove anything from the slurmdbd.conf unless we can't fix the issue readily.

Thanks for the information Paul. Looks like that select there is taking a long time which can be expected I suppose.

Could you try increasing the timeout in my.cnf?  I am not sure what 'Binlog Dump' does.  It looks harmless, but perhaps it is causing something that isn't obvious.

You will also want to have at least innodb_buffer_pool_size=64M.  This also might fix the issue since we are dealing with large data.

From the docs...

Note that at least the first time running the slurmdbd with MySQL you need to make sure your my.cnf file has innodb_buffer_pool_size equal to at least 64 megabytes. You can accomplish this by adding the line:
innodb_buffer_pool_size=64M
under the [mysqld] reference in the my.cnf file and restarting the mysqld. This is needed when converting large tables over to the new database schema.

While this doesn't spell out this scenario it could be relevant.

Please add these two options to the my.cnf and restart the daemons and report back.
Comment 6 Paul Edmon 2015-04-01 03:23:14 MDT
[root@holy-slurm01 etc]# cat my.cnf
[mysqld]
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
user=mysql
# Disabling symbolic-links is recommended to prevent assorted security risks
symbolic-links=0
innodb_lock_wait_timeout=900
innodb_buffer_pool_size=64M

!includedir /etc/mysql.d
[mysqld_safe]
log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid

This is the modified my.cnf just for the record.

-Paul Edmon-

On 04/01/2015 11:09 AM, bugs@schedmd.com wrote:
>
> *Comment # 5 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c5> on bug 
> 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Danny Auble 
> <mailto:da@schedmd.com> *
> No, don't remove anything from the slurmdbd.conf unless we can't fix the issue
> readily.
>
> Thanks for the information Paul. Looks like that select there is taking a long
> time which can be expected I suppose.
>
> Could you try increasing the timeout in my.cnf?  I am not sure what 'Binlog
> Dump' does.  It looks harmless, but perhaps it is causing something that isn't
> obvious.
>
> You will also want to have at least innodb_buffer_pool_size=64M.  This also
> might fix the issue since we are dealing with large data.
>
>  From the docs...
>
> Note that at least the first time running the slurmdbd with MySQL you need to
> make sure your my.cnf file has innodb_buffer_pool_size equal to at least 64
> megabytes. You can accomplish this by adding the line:
> innodb_buffer_pool_size=64M
> under the [mysqld] reference in the my.cnf file and restarting the mysqld. This
> is needed when converting large tables over to the new database schema.
>
> While this doesn't spell out this scenario it could be relevant.
>
> Please add these two options to the my.cnf and restart the daemons and report
> back.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 7 Paul Edmon 2015-04-01 04:03:05 MDT
So we had the memory leak again noted in

bugs.schedmd.com/show_bug.cgi?id=1534

I don't know if the change in my.cnf caused the issue or if the my.cnf 
change excacerbated this problem.  I've restarted slurmdbd to see if it 
happens again.

-Paul Edmon-

On 04/01/2015 11:09 AM, bugs@schedmd.com wrote:
>
> *Comment # 5 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c5> on bug 
> 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Danny Auble 
> <mailto:da@schedmd.com> *
> No, don't remove anything from the slurmdbd.conf unless we can't fix the issue
> readily.
>
> Thanks for the information Paul. Looks like that select there is taking a long
> time which can be expected I suppose.
>
> Could you try increasing the timeout in my.cnf?  I am not sure what 'Binlog
> Dump' does.  It looks harmless, but perhaps it is causing something that isn't
> obvious.
>
> You will also want to have at least innodb_buffer_pool_size=64M.  This also
> might fix the issue since we are dealing with large data.
>
>  From the docs...
>
> Note that at least the first time running the slurmdbd with MySQL you need to
> make sure your my.cnf file has innodb_buffer_pool_size equal to at least 64
> megabytes. You can accomplish this by adding the line:
> innodb_buffer_pool_size=64M
> under the [mysqld] reference in the my.cnf file and restarting the mysqld. This
> is needed when converting large tables over to the new database schema.
>
> While this doesn't spell out this scenario it could be relevant.
>
> Please add these two options to the my.cnf and restart the daemons and report
> back.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 8 Paul Edmon 2015-04-01 04:38:29 MDT
Yeah, that definitely exacerbated the memory leak in 
(http://bugs.schedmd.com/show_bug.cgi?id=1534):

top - 12:35:40 up 19 days,  2:49,  5 users,  load average: 2.43, 3.14, 3.12
Tasks: 1568 total,   2 running, 1566 sleeping,   0 stopped,   0 zombie
Cpu(s):  1.1%us,  1.7%sy,  0.0%ni, 97.2%id,  0.0%wa,  0.0%hi, 0.0%si,  
0.0%st
Mem:  264498708k total, 189166220k used, 75332488k free,   662488k buffers
Swap:  4194296k total,        0k used,  4194296k free, 162701464k cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+ COMMAND
34477 root      20   0  264m 225m 1672 R 99.8  0.1  79:55.18 syslog-ng
28465 slurm     20   0 18.2g  17g 2552 S 65.0  7.0   1:42.46 slurmdbd

the slurmdbd is never that big and its growing.  It's also spewing:

Apr  1 12:35:36 holy-slurm01 slurmdbd[28465]: error: packmem: Buffer 
size limit exceeded (-56581 > -65536)
Apr  1 12:35:36 holy-slurm01 slurmdbd[28465]: error: packmem: Buffer 
size limit exceeded (-56588 > -65536)
Apr  1 12:35:36 holy-slurm01 slurmdbd[28465]: error: packmem: Buffer 
size limit exceeded (-56588 > -65536)
Apr  1 12:35:36 holy-slurm01 slurmdbd[28465]: error: packmem: Buffer 
size limit exceeded (-56581 > -65536)
Apr  1 12:35:36 holy-slurm01 slurmdbd[28465]: error: packmem: Buffer 
size limit exceeded (-56588 > -65536)
Apr  1 12:35:36 holy-slurm01 slurmdbd[28465]: error: packmem: Buffer 
size limit exceeded (-56588 > -65536)
Apr  1 12:35:36 holy-slurm01 slurmdbd[28465]: error: packmem: Buffer 
size limit exceeded (-56588 > -65536)
Apr  1 12:35:36 holy-slurm01 slurmdbd[28465]: error: packmem: Buffer 
size limit exceeded (-56588 > -65536)
Apr  1 12:35:36 holy-slurm01 slurmdbd[28465]: error: packmem: Buffer 
size limit exceeded (-56588 > -65536)
Apr  1 12:35:36 holy-slurm01 slurmdbd[28465]: error: packmem: Buffer 
size limit exceeded (-56584 > -65536)

Do you think this is due to us running an older version of mysql (we are 
on 5.1.69?  What do you recommend we do? Change the my.cnf back to what 
we had?

I've had to restart slurmdbd again to get it to reset itself.

-Paul Edmon-

On 04/01/2015 11:09 AM, bugs@schedmd.com wrote:
>
> *Comment # 5 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c5> on bug 
> 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Danny Auble 
> <mailto:da@schedmd.com> *
> No, don't remove anything from the slurmdbd.conf unless we can't fix the issue
> readily.
>
> Thanks for the information Paul. Looks like that select there is taking a long
> time which can be expected I suppose.
>
> Could you try increasing the timeout in my.cnf?  I am not sure what 'Binlog
> Dump' does.  It looks harmless, but perhaps it is causing something that isn't
> obvious.
>
> You will also want to have at least innodb_buffer_pool_size=64M.  This also
> might fix the issue since we are dealing with large data.
>
>  From the docs...
>
> Note that at least the first time running the slurmdbd with MySQL you need to
> make sure your my.cnf file has innodb_buffer_pool_size equal to at least 64
> megabytes. You can accomplish this by adding the line:
> innodb_buffer_pool_size=64M
> under the [mysqld] reference in the my.cnf file and restarting the mysqld. This
> is needed when converting large tables over to the new database schema.
>
> While this doesn't spell out this scenario it could be relevant.
>
> Please add these two options to the my.cnf and restart the daemons and report
> back.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 9 Brian Christiansen 2015-04-01 04:56:12 MDT
Created attachment 1792 [details]
don't unload plugin patch

Will you apply this patch and then run the slurmdbd under valgrind again? The patch just prevents the plugin from being unloaded so that we can see where the memory is going. 

Will you run it like this?

valgrind --tool=memcheck --xml=yes --xml-file=slurmdbd.mc --error-limit=no --leak-check=full --show-reachable=yes --max-stackframe=16777216 --num-callers=20 --child-silent-after-fork=yes --track-origins=yes slurmdbd -Dvv
Comment 10 Paul Edmon 2015-04-01 05:05:06 MDT
Sure.  I will do so.

-Paul Edmon-

On 04/01/2015 12:56 PM, bugs@schedmd.com wrote:
>
> *Comment # 9 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c9> on bug 
> 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> Createdattachment 1792  <attachment.cgi?id=1792&action=diff>  [details]  <attachment.cgi?id=1792&action=edit>
> don't unload plugin patch
>
> Will you apply this patch and then run the slurmdbd under valgrind again? The
> patch just prevents the plugin from being unloaded so that we can see where the
> memory is going.
>
> Will you run it like this?
>
> valgrind --tool=memcheck --xml=yes --xml-file=slurmdbd.mc --error-limit=no
> --leak-check=full --show-reachable=yes --max-stackframe=16777216
> --num-callers=20 --child-silent-after-fork=yes --track-origins=yes slurmdbd
> -Dvv
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 11 Brian Christiansen 2015-04-01 05:07:22 MDT
Thanks. The valgrind options will spit out a "slurmdbd..0.mc" file. Will you attach that after it's done?
Comment 12 Paul Edmon 2015-04-01 05:15:08 MDT
Will do.  I'm rebuilding the rpms now.

-Paul Edmon-

On 04/01/2015 01:07 PM, bugs@schedmd.com wrote:
>
> *Comment # 11 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c11> on 
> bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> Thanks. The valgrind options will spit out a "slurmdbd..0.mc" file. Will you
> attach that after it's done?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 13 Paul Edmon 2015-04-01 05:24:43 MDT
Created attachment 1793 [details]
slurmdbd.mc

Okay, so I've started valgrind on slurmdbd, its spit out the initial mc 
file but it's still working here's the initial data:

[root@holy-slurm01 valgrind]# valgrind --tool=memcheck --xml=yes 
--xml-file=slurmdbd.mc --error-limit=no --leak-check=full 
--show-reachable=yes --max-stackframe=16777216 --num-callers=20 
--child-silent-after-fork=yes --track-origins=yes slurmdbd -Dvv
slurmdbd: debug:  auth plugin for Munge 
(http://code.google.com/p/munge/) loaded
slurmdbd: Accounting storage MYSQL plugin loaded
slurmdbd: slurmdbd version 14.11.4 started
slurmdbd: debug:  Reading slurm.conf file: /etc/slurm/slurm.conf
slurmdbd: debug:  DBD_INIT: CLUSTER:odyssey VERSION:7168 UID:57812 
IP:10.31.20.139 CONN:8
slurmdbd: debug:  DBD_INIT: CLUSTER:odyssey VERSION:7168 UID:57812 
IP:10.31.20.139 CONN:6
slurmdbd: debug:  DBD_INIT: CLUSTER:odyssey VERSION:7168 UID:56875 
IP:10.242.104.140 CONN:10
slurmdbd: debug:  DBD_INIT: CLUSTER:odyssey VERSION:7168 UID:57812 
IP:10.31.20.139 CONN:6
slurmdbd: debug:  DBD_INIT: CLUSTER:odyssey VERSION:7168 UID:59719 
IP:10.242.86.133 CONN:10
slurmdbd: debug:  DBD_INIT: CLUSTER:odyssey VERSION:7168 UID:100663 
IP:10.242.104.137 CONN:6
slurmdbd: debug:  DBD_INIT: CLUSTER:odyssey VERSION:7168 UID:59719 
IP:10.242.86.133 CONN:10
slurmdbd: debug:  DBD_INIT: CLUSTER:odyssey VERSION:7168 UID:56875 
IP:10.242.104.140 CONN:6
slurmdbd: debug:  DBD_INIT: CLUSTER:odyssey VERSION:7168 UID:557658 
IP:10.242.128.22 CONN:12

I will let you know when its done.

-Paul Edmon-

On 04/01/2015 01:07 PM, bugs@schedmd.com wrote:
>
> *Comment # 11 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c11> on 
> bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> Thanks. The valgrind options will spit out a "slurmdbd..0.mc" file. Will you
> attach that after it's done?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 14 Brian Christiansen 2015-04-01 09:06:57 MDT
This is one more thing to consider for mysql. Setting innodb_buffer_pool to 64 was the minimum that we've seen needed before. The documentation for the parameter says that "On a dedicated database server, you may set this to up to 80% of the machine physical memory size." Mine was set at 256M on an 8GB machine. You may consider increasing the value.

http://dev.mysql.com/doc/refman/5.0/en/innodb-parameters.html#sysvar_innodb_buffer_pool_size
Comment 15 Paul Edmon 2015-04-01 13:52:22 MDT
Okay, good to know.  I haven't see the slurmdbd leak again, it could be 
that slowing it down with valgrind is suppressing it somewhat. It also 
hasn't crashed either.  That said, we can't keep running it with 
valgrind forever.  I will check again in the morning to see its state.

For the record our DB is on the same machine as the master (we found 
that it was best for performance).  We have 256GB available for memory, 
so we have plenty of room if we want to up the buffer.

-Paul Edmon-

On 4/1/2015 5:06 PM, bugs@schedmd.com wrote:
> Brian Christiansen <mailto:brian@schedmd.com> changed bug 1576 
> <http://bugs.schedmd.com/show_bug.cgi?id=1576>
> What 	Removed 	Added
> Status 	UNCONFIRMED 	IN_PROGRESS
> Ever confirmed 		1
>
> *Comment # 14 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c14> on 
> bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> This is one more thing to consider for mysql. Setting innodb_buffer_pool to 64
> was the minimum that we've seen needed before. The documentation for the
> parameter says that "On a dedicated database server, you may set this to up to
> 80% of the machine physical memory size." Mine was set at 256M on an 8GB
> machine. You may consider increasing the value.
>
> http://dev.mysql.com/doc/refman/5.0/en/innodb-parameters.html#sysvar_innodb_buffer_pool_size
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 16 Paul Edmon 2015-04-02 02:44:28 MDT
Created attachment 1796 [details]
slurmdbd.mc

Update.  So it hasn't memory leaked but the memory footprint of slurmdbd 
has increased by a factor of two (used to be 300 MB now 700 MB).  I've 
attached the current valgrind results.  slurmdbd is still running with 
valgrind.  Let me know if you want me to change that.

On a side note we were planning on upgrading from 14.11.4 to 14.11.5 on 
Monday.  Do you think we still should or should we wait?

-Paul Edmon-

On 04/01/2015 05:06 PM, bugs@schedmd.com wrote:
> Brian Christiansen <mailto:brian@schedmd.com> changed bug 1576 
> <http://bugs.schedmd.com/show_bug.cgi?id=1576>
> What 	Removed 	Added
> Status 	UNCONFIRMED 	IN_PROGRESS
> Ever confirmed 		1
>
> *Comment # 14 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c14> on 
> bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> This is one more thing to consider for mysql. Setting innodb_buffer_pool to 64
> was the minimum that we've seen needed before. The documentation for the
> parameter says that "On a dedicated database server, you may set this to up to
> 80% of the machine physical memory size." Mine was set at 256M on an 8GB
> machine. You may consider increasing the value.
>
> http://dev.mysql.com/doc/refman/5.0/en/innodb-parameters.html#sysvar_innodb_buffer_pool_size
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 17 Brian Christiansen 2015-04-02 05:59:16 MDT
I don't see harm upgrading on Monday. 14.11.6 will probably be released in 2-3 weeks. 

Maybe increasing the buffer pool size will give it the room to handle the large numbers of records.

Could you try running the slurmdbd under valgrind over the weekend?
Comment 18 Paul Edmon 2015-04-02 06:09:02 MDT
Sure can do.  We will plan on doing the upgrade then.

-Paul Edmon-

On 04/02/2015 01:59 PM, bugs@schedmd.com wrote:
>
> *Comment # 17 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c17> on 
> bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> I don't see harm upgrading on Monday. 14.11.6 will probably be released in 2-3
> weeks.
>
> Maybe increasing the buffer pool size will give it the room to handle the large
> numbers of records.
>
> Could you try running the slurmdbd under valgrind over the weekend?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 19 Brian Christiansen 2015-04-02 06:09:44 MDT
Sounds good.
Comment 20 Paul Edmon 2015-04-02 07:00:05 MDT
So I talked with my manager and we are not going to do a slurm upgrade 
until this database issue is resolved.  So anything else we want to try 
in terms of fixing this I'm game.  I'm out tomorrow but will be back on 
Monday.

-Paul Edmon-

On 04/02/2015 02:09 PM, bugs@schedmd.com wrote:
>
> *Comment # 19 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c19> on 
> bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> Sounds good.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 21 Paul Edmon 2015-04-02 07:32:16 MDT
Created attachment 1798 [details]
slurmdbd-0.mc

So after talking about it here we thought that maybe issuing the archive 
dump caused the memory leak.  As we did that shortly after we upgraded 
to 14.11.  So I ran it again.

[root@holy-slurm01 valgrind]# sacctmgr -i archive dump
sview: error: slurm_receive_msg: Socket timed out on send/recv operation
sview: error: slurm_receive_msg: Socket timed out on send/recv operation
sview: error: slurm_receive_msg: Socket timed out on send/recv operation
sview: error: slurm_receive_msg: Socket timed out on send/recv operation
sview: error: slurm_receive_msg: Socket timed out on send/recv operation
sview: error: slurm_receive_msg: Socket timed out on send/recv operation
sacctmgr: error: slurmdbd: Getting response to message type 1459
sacctmgr: error: slurmdbd: DBD_ARCHIVE_DUMP failure: No error
  Problem dumping archive: Unspecified error

It failed but while it was running slurmctld slowed to a crawl and 
slurmdbd was still running at 100% load after the purge had failed. I 
had to kill slurmdbd to get it back to normal.  I've included the last 
valgrind data.  I'm thinking that it's possible that our DB is just too 
big to purge.  We may have to get more creative to purge this as it is 
36 million lines and 40 GB.  Have you ever had anyone purge something 
this big?

-Paul Edmon-

On 04/02/2015 02:09 PM, bugs@schedmd.com wrote:
>
> *Comment # 19 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c19> on 
> bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> Sounds good.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 22 Brian Christiansen 2015-04-02 10:02:31 MDT
Created attachment 1799 [details]
debug patch 2

We're not aware of anyone purging records of that size. It's possible that it's been done, but they haven't told us. 

Questions:

What purge failed? The one initiated with sacctmgr or the one periodically being run by the slurmdbd? I'm assuming the sacctmgr one.

Do you see any errors in the slurmdbd log?

How long did you wait before killing the slurmdbd? Did you kill the slurmdbd or valgrind? I ask because the valgrind output was the same size as the previous one.

Can you tell if the slurmdbd is doing it's own archive --without issuing the sacctmgr command -- (Eg. do you see archive files being created, or log statements)? I've attached a patch will print out logs at "info" when the roll up and archiving is happening. 

ex.
slurmdbd: running rollup at Thu Apr 02 14:51:21 2015
slurmdbd: archiving cluster 'compy' at Thu Apr 02 14:51:21 2015
slurmdbd: done archiving cluster 'compy' at Thu Apr 02 14:51:21 2015
slurmdbd: rollup finished at Thu Apr 02 14:51:21 2015



Let's try this:
1. apply the debug patch.
2. let's run without valgrind. It's probably slowing it down too much. 
3. increase the innodb_buffer_pool_size to something large like 80GB.
4. set innodb_lock_wait_timeout to 60 minutes.
5. restart mysql and verify that they settings are in place.
ex.
MariaDB [(none)]> show variables like 'innodb_lock_wait_timeout';
+--------------------------+-------+
| Variable_name            | Value |
+--------------------------+-------+
| innodb_lock_wait_timeout | 300   |
+--------------------------+-------+
1 row in set (0.00 sec)

MariaDB [(none)]> show variables like 'innodb_buffer_pool_size';
+-------------------------+-----------+
| Variable_name           | Value     |
+-------------------------+-----------+
| innodb_buffer_pool_size | 134217728 |
+-------------------------+-----------+
1 row in set (0.00 sec)
6. restart slurmdbd

Then be patient and see if it can get through all of the records. Maybe you could do this during a system maintenance time so that users don't see the slowness.

How does that sound?
Comment 23 Brian Christiansen 2015-04-02 10:10:21 MDT
Actually, the idea about during a system maintenance window is a moot point because the dbd should start rolling up at start up. If you can too, try not to kill the slurmdbd. The rollup is being done in a transaction and killing it will just throw away everything that was done and will have to be redone.
Comment 24 Paul Edmon 2015-04-06 03:36:48 MDT
Created attachment 1801 [details]
slurmdbd-1.mc

Yeah, sorry about that I was out on Friday.  I've attached the latest 
valgrind results from running for the past 3 days.  For the patch you 
just sent should I remove the previous patch?  I will get the changes 
you suggested put in later today.

As to your questions.  The purge the failed was both the automatic purge 
that initiated this ticket, and then the forced purge I ran via 
sacctmgr.  I waited about 15 minutes before killing the db as ctld was 
inoperable and so I had to get things back to a stable state.  I did a 
kill -9 of the pid for slurmdbd.

It does create the archive for the events and suspend.  However if fails 
when it gets to jobs, I think due to the sheer size of our job table.

Our maintenance window is over now, so anything will have to be done on 
the live system.

-Paul Edmon-

On 04/02/2015 06:10 PM, bugs@schedmd.com wrote:
>
> *Comment # 23 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c23> on 
> bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> Actually, the idea about during a system maintenance window is a moot point
> because the dbd should start rolling up at start up. If you can too, try not to
> kill the slurmdbd. The rollup is being done in a transaction and killing it
> will just throw away everything that was done and will have to be redone.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 25 Brian Christiansen 2015-04-06 06:28:14 MDT
No problem. You can apply the patch on top of the other. I also committed a patch that purges 50,000 records at a time so that the locks can be periodically released. You can apply this as well.

https://github.com/SchedMD/slurm/commit/2f477b42d36b428f048952b053d59aa5f11c69da.patch

Then when you restart, look for the log statements.

ex.
slurmdbd: running rollup at Thu Apr 02 14:51:21 2015
slurmdbd: archiving cluster 'compy' at Thu Apr 02 14:51:21 2015
slurmdbd: done archiving cluster 'compy' at Thu Apr 02 14:51:21 2015
slurmdbd: rollup finished at Thu Apr 02 14:51:21 2015
Comment 26 Paul Edmon 2015-04-06 12:42:34 MDT
Okay, I will do so.  Currently slurmdbd is running stably with the 
patch.  I'm not running it in interactive mode but in daemon.  What 
debug level do you want me to have it log at?

-Paul Edmon-

On 4/6/2015 2:28 PM, bugs@schedmd.com wrote:
>
> *Comment # 25 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c25> on 
> bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> No problem. You can apply the patch on top of the other. I also committed a
> patch that purges 50,000 records at a time so that the locks can be
> periodically released. You can apply this as well.
>
> https://github.com/SchedMD/slurm/commit/2f477b42d36b428f048952b053d59aa5f11c69da.patch
>
> Then when you restart, look for the log statements.
>
> ex.
> slurmdbd: running rollup at Thu Apr 02 14:51:21 2015
> slurmdbd: archiving cluster 'compy' at Thu Apr 02 14:51:21 2015
> slurmdbd: done archiving cluster 'compy' at Thu Apr 02 14:51:21 2015
> slurmdbd: rollup finished at Thu Apr 02 14:51:21 2015
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 27 Danny Auble 2015-04-06 12:56:11 MDT
If it is running fine just leave it and report when the rollup finishes.  If for some reason it fails then introduce the new patch. 

On April 6, 2015 5:42:34 PM PDT, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=1576
>
>--- Comment #26 from Paul Edmon <pedmon@cfa.harvard.edu> ---
>Okay, I will do so.  Currently slurmdbd is running stably with the 
>patch.  I'm not running it in interactive mode but in daemon.  What 
>debug level do you want me to have it log at?
>
>-Paul Edmon-
>
>On 4/6/2015 2:28 PM, bugs@schedmd.com wrote:
>>
>> *Comment # 25 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c25> on 
>> bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian 
>> Christiansen <mailto:brian@schedmd.com> *
>> No problem. You can apply the patch on top of the other. I also
>committed a
>> patch that purges 50,000 records at a time so that the locks can be
>> periodically released. You can apply this as well.
>>
>>
>https://github.com/SchedMD/slurm/commit/2f477b42d36b428f048952b053d59aa5f11c69da.patch
>>
>> Then when you restart, look for the log statements.
>>
>> ex.
>> slurmdbd: running rollup at Thu Apr 02 14:51:21 2015
>> slurmdbd: archiving cluster 'compy' at Thu Apr 02 14:51:21 2015
>> slurmdbd: done archiving cluster 'compy' at Thu Apr 02 14:51:21 2015
>> slurmdbd: rollup finished at Thu Apr 02 14:51:21 2015
>>
>------------------------------------------------------------------------
>> You are receiving this mail because:
>>
>>   * You reported the bug.
>>
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.
Comment 28 Paul Edmon 2015-04-07 01:57:50 MDT
So it did do the roll up and it took about  50 minutes, but it did not 
archive.  Will it automatically initiate that?  Or should I try to force 
it to archive?

I haven't applied the patch you sent yet.

-Paul Edmon-

On 04/06/2015 08:56 PM, bugs@schedmd.com wrote:
>
> *Comment # 27 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c27> on 
> bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Danny 
> Auble <mailto:da@schedmd.com> *
> If it is running fine just leave it and report when the rollup finishes.  If
> for some reason it fails then introduce the new patch.
>
> On April 6, 2015 5:42:34 PM PDT,bugs@schedmd.com  <mailto:bugs@schedmd.com>  wrote:
> >http://bugs.schedmd.com/show_bug.cgi?id=1576  <show_bug.cgi?id=1576>
> >
> >---Comment #26  <show_bug.cgi?id=1576#c26>  from Paul Edmon <pedmon@cfa.harvard.edu  <mailto:pedmon@cfa.harvard.edu>> ---
> >Okay, I will do so.  Currently slurmdbd is running stably with the
> >patch.  I'm not running it in interactive mode but in daemon.  What
> >debug level do you want me to have it log at?
> >
> >-Paul Edmon-
> >
> >On 4/6/2015 2:28 PM,bugs@schedmd.com  <mailto:bugs@schedmd.com>  wrote:
> >>
> >> *Comment # 25  <show_bug.cgi?id=1576#c25>  <http://bugs.schedmd.com/show_bug.cgi?id=1576#c25  <show_bug.cgi?id=1576#c25>> on
> >>bug 1576  <show_bug.cgi?id=1576>  <http://bugs.schedmd.com/show_bug.cgi?id=1576  <show_bug.cgi?id=1576>> from Brian
> >> Christiansen <mailto:brian@schedmd.com> *
> >> No problem. You can apply the patch on top of the other. I also
> >committed a
> >> patch that purges 50,000 records at a time so that the locks can be
> >> periodically released. You can apply this as well.
> >>
> >>
> >https://github.com/SchedMD/slurm/commit/2f477b42d36b428f048952b053d59aa5f11c69da.patch
> >>
> >> Then when you restart, look for the log statements.
> >>
> >> ex.
> >> slurmdbd: running rollup at Thu Apr 02 14:51:21 2015
> >> slurmdbd: archiving cluster 'compy' at Thu Apr 02 14:51:21 2015
> >> slurmdbd: done archiving cluster 'compy' at Thu Apr 02 14:51:21 2015
> >> slurmdbd: rollup finished at Thu Apr 02 14:51:21 2015
> >>
> >------------------------------------------------------------------------
> >> You are receiving this mail because:
> >>
> >>   * You reported the bug.
> >>
> >
> >--
> >You are receiving this mail because:
> >You are on the CC list for the bug.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 29 Brian Christiansen 2015-04-07 02:51:47 MDT
Will you send your slurmdbd.conf?
Comment 30 Paul Edmon 2015-04-07 03:12:05 MDT
# Archive info
ArchiveJobs=yes
ArchiveDir=/slurm/archive
ArchiveSteps=yes
ArchiveResvs=yes
ArchiveEvents=yes
ArchiveSuspend=yes
#ArchiveScript=
PurgeEventAfter=6month
PurgeJobAfter=6month
PurgeResvAfter=6month
PurgeStepAfter=6month
PurgeSuspendAfter=6month

# Authentication info
AuthType=auth/munge
#AuthInfo=/var/run/munge/munge.socket.2

# slurmDBD info
CommitDelay=1
DbdHost=localhost
SlurmUser=slurm
#MessageTimeout=300
DebugLevel=info
#DefaultQOS=normal,standby
PidFile=/var/slurmd/run/slurmdbd.pid
#PluginDir=/usr/lib/slurm
#PrivateData=accounts,users,usage,jobs
#TrackWCKey=yes

# Database info
StorageType=accounting_storage/mysql
StorageHost=localhost
StorageUser=slurm
StoragePass=[censored]
StorageLoc=slurm

On 04/07/2015 10:51 AM, bugs@schedmd.com wrote:
>
> *Comment # 29 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c29> on 
> bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> Will you send your slurmdbd.conf?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 31 Brian Christiansen 2015-04-07 06:31:50 MDT
More questions:
Do you see that archive files were created? How big are they?
Do see that records were purged from the database? (eg. is the db smaller, does sacct return jobs from 6 months ago -- sacct -E 2014-09-01)
Do you see any errors in the slurmdbd log?

I'm looking at some other possibilities that might be a problem with large databases -- particularly when archiving. I would just let it keep running as is in the mean time.
Comment 32 Paul Edmon 2015-04-07 06:36:49 MDT
We've actually never had a successful purge.  The errors at the start of 
this bug are what happened when the purge tried to run. Nothing has 
actually been removed from the database as it never actually dumped any 
files but these:

[root@holy-slurm01 ~]# ls -ltr /slurm/archive
total 506224
-rw------- 1 slurm slurm_users 163847476 Mar 10 09:16 
odyssey_event_archive_2013-08-01T00:00:00_2014-08-31T23:59:59
-rw------- 1 slurm slurm_users    253639 Mar 10 09:17 
odyssey_suspend_archive_2013-08-01T00:00:00_2014-08-31T23:59:59
-rw------- 1 slurm slurm_users 176829079 Apr  1 13:01 
odyssey_event_archive_2013-08-01T00:00:00_2014-09-30T23:59:59.old
-rw------- 1 slurm slurm_users    298515 Apr  1 13:01 
odyssey_suspend_archive_2013-08-01T00:00:00_2014-09-30T23:59:59.old
-rw------- 1 slurm slurm_users 176829079 Apr  6 14:33 
odyssey_event_archive_2013-08-01T00:00:00_2014-09-30T23:59:59
-rw------- 1 slurm slurm_users    298515 Apr  6 14:33 
odyssey_suspend_archive_2013-08-01T00:00:00_2014-09-30T23:59:59

As I noted at the beginning, we've never purged this database so it is 
gigantic.  It should be managable once we get it down to a 6 month 
size.  Right now it has everything in it still.

-Paul Edmon-

On 04/07/2015 02:31 PM, bugs@schedmd.com wrote:
>
> *Comment # 31 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c31> on 
> bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> More questions:
> Do you see that archive files were created? How big are they?
> Do see that records were purged from the database? (eg. is the db smaller, does
> sacct return jobs from 6 months ago -- sacct -E 2014-09-01)
> Do you see any errors in the slurmdbd log?
>
> I'm looking at some other possibilities that might be a problem with large
> databases -- particularly when archiving. I would just let it keep running as
> is in the mean time.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 33 Brian Christiansen 2015-04-07 11:00:42 MDT
Ok. Thanks for your patience and help. My hypothesis is that the slurmdbd is failing to archive the steps which would then short circuit the rest of the archiving and purging. I'm trying to confirm this.

Do you see any errors in the slurmdbd log file? Specifically, 
"Can't save archive, create file"
or
"Error writing file "

can you upload the slurmdbd log file for when it tried to archive?


Will you show the last run times from the database? 
e.g.
$mysql slurm
mysql> select * from <clustername>_last_ran_table;
+---------------+--------------+----------------+
| hourly_rollup | daily_rollup | monthly_rollup |
+---------------+--------------+----------------+
|    1428012000 |   1427958000 |     1427871600 |
+---------------+--------------+----------------+

If I can't see anything from the logs I'll get you another patch with more logging.

Thanks,
Brian
Comment 34 Paul Edmon 2015-04-08 02:54:18 MDT
So I tried to force a dump and got this error:

[root@holy-slurm01 ~]# sacctmgr -i archive dump
sacctmgr: error: slurmdbd: Getting response to message type 1459
sacctmgr: error: slurmdbd: DBD_ARCHIVE_DUMP failure: No error
Problem dumping archive: Unspecified error

In the logs I got:

Apr  8 10:41:38 holy-slurm01 slurmdbd[46432]: error: packmem: Buffer 
size limit exceeded (-59825 > -65536)
Apr  8 10:41:38 holy-slurm01 slurmdbd[46432]: error: packmem: Buffer 
size limit exceeded (-59819 > -65536)
Apr  8 10:41:38 holy-slurm01 slurmdbd[46432]: error: packmem: Buffer 
size limit exceeded (-59824 > -65536)
Apr  8 10:41:38 holy-slurm01 slurmdbd[46432]: error: packmem: Buffer 
size limit exceeded (-59825 > -65536)
Apr  8 10:41:38 holy-slurm01 slurmdbd[46432]: error: packmem: Buffer 
size limit exceeded (-59816 > -65536)
Apr  8 10:41:38 holy-slurm01 slurmdbd[46432]: error: packmem: Buffer 
size limit exceeded (-59821 > -65536)

That's the only error or traffic.  I was watching mysql and slurmdbd at 
the time.  MySQL did the query and then sent it to slurmdbd. Slurmdbd 
was in the process of swallowing the data from that query got to about 
14 GB in size on memory and then freaked out with this error.  I had to 
stop it to get it back to normal.  I will also note that slurmctld was 
inoperable while I was forcing it to dump so the entire system ground to 
a halt.

I think the query itself is just too big for slurmdbd to handle. Clearly 
it is suffering from a buffer overrun, probably because of an integer 
overrun.  All the integers should probably be 64 bit in slurm if they 
aren't.  So not a memory leak, but rather a buffer overrun.

Anyways, that patch that only spits out 50,000 at a time may not be a 
bad idea as it may allow it to output things in more manageable portions.

-Paul Edmon-

On 04/07/2015 07:00 PM, bugs@schedmd.com wrote:
>
> *Comment # 33 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c33> on 
> bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> Ok. Thanks for your patience and help. My hypothesis is that the slurmdbd is
> failing to archive the steps which would then short circuit the rest of the
> archiving and purging. I'm trying to confirm this.
>
> Do you see any errors in the slurmdbd log file? Specifically,
> "Can't save archive, create file"
> or
> "Error writing file"
>
> can you upload the slurmdbd log file for when it tried to archive?
>
>
> Will you show the last run times from the database?
> e.g.
> $mysql slurm
> mysql> select * from <clustername>_last_ran_table;
> +---------------+--------------+----------------+
> | hourly_rollup | daily_rollup | monthly_rollup |
> +---------------+--------------+----------------+
> |    1428012000 |   1427958000 |     1427871600 |
> +---------------+--------------+----------------+
>
> If I can't see anything from the logs I'll get you another patch with more
> logging.
>
> Thanks,
> Brian
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 35 Brian Christiansen 2015-04-08 04:05:39 MDT
Ok. Looks like I need to rewrite the archiving code to write out chunks at a time like I did for purgining. For now, I recommend turning off purging and archiving until I get you the new patch.
Comment 36 Paul Edmon 2015-04-08 04:12:16 MDT
Sure will do.

-Paul Edmon-

On 4/8/2015 12:05 PM, bugs@schedmd.com wrote:
>
> *Comment # 35 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c35> on 
> bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> Ok. Looks like I need to rewrite the archiving code to write out chunks at a
> time like I did for purgining. For now, I recommend turning off purging and
> archiving until I get you the new patch.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 37 Brian Christiansen 2015-04-15 10:31:28 MDT
Created attachment 1819 [details]
archive patch

I've made the changes in 15.08. 
https://github.com/SchedMD/slurm/commit/343d55f85dfbf8eeb4a15d9f110b1ec70819c9a1
https://github.com/SchedMD/slurm/commit/73202646e66f597dab89eca42a499c03bd257919

I've attached a patch for 14.11 that consists of the two commits. What the patch does, is if there are records older than 60 days it will catch up by archiving and purging a month at a time until it is caught up. So you'll see a lot of archive files being generated. This should be a good indication of it's progress.

Instead of using "sacctmgr archive dump" to trigger the archive, just put the time stamp in <cluster>__last_ran_table.montly in the db to 1 second in the past. Doing it this way will prevent the sacctmgr call and the slurmdbd's rollup from waiting on each other for locks on the tables.

Let me know if you have any questions and how it goes.

Thanks,
Brian

ex.
mysql> select monthly_rollup from knc_last_ran_table;
+----------------+
| monthly_rollup |
+----------------+
|     1427871600 |
+----------------+
1 row in set (0.00 sec)

mysql> update knc_last_ran_table set monthly_rollup=1427871599;
Query OK, 0 rows affected (0.05 sec)
Rows matched: 1  Changed: 0  Warnings: 0

mysql> select monthly_rollup from knc_last_ran_table;
+----------------+
| monthly_rollup |
+----------------+
|     1427871599 |
+----------------+
1 row in set (0.00 sec)

mysql> select FROM_UNIXTIME(monthly_rollup) from knc_last_ran_table;
+-------------------------------+
| FROM_UNIXTIME(monthly_rollup) |
+-------------------------------+
| 2015-03-31 23:59:59           |
+-------------------------------+
1 row in set (0.00 sec)
Comment 38 Paul Edmon 2015-04-20 06:03:17 MDT
Thanks for the patch.  I will apply it and see what happens.

-Paul Edmon-

On 04/15/2015 06:31 PM, bugs@schedmd.com wrote:
>
> *Comment # 37 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c37> on 
> bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> Createdattachment 1819  <attachment.cgi?id=1819&action=diff>  [details]  <attachment.cgi?id=1819&action=edit>
> archive patch
>
> I've made the changes in 15.08.
> https://github.com/SchedMD/slurm/commit/343d55f85dfbf8eeb4a15d9f110b1ec70819c9a1
> https://github.com/SchedMD/slurm/commit/73202646e66f597dab89eca42a499c03bd257919
>
> I've attached a patch for 14.11 that consists of the two commits. What the
> patch does, is if there are records older than 60 days it will catch up by
> archiving and purging a month at a time until it is caught up. So you'll see a
> lot of archive files being generated. This should be a good indication of it's
> progress.
>
> Instead of using "sacctmgr archive dump" to trigger the archive, just put the
> time stamp in <cluster>__last_ran_table.montly in the db to 1 second in the
> past. Doing it this way will prevent the sacctmgr call and the slurmdbd's
> rollup from waiting on each other for locks on the tables.
>
> Let me know if you have any questions and how it goes.
>
> Thanks,
> Brian
>
> ex.
> mysql> select monthly_rollup from knc_last_ran_table;
> +----------------+
> | monthly_rollup |
> +----------------+
> |     1427871600 |
> +----------------+
> 1 row in set (0.00 sec)
>
> mysql> update knc_last_ran_table set monthly_rollup=1427871599;
> Query OK, 0 rows affected (0.05 sec)
> Rows matched: 1  Changed: 0  Warnings: 0
>
> mysql> select monthly_rollup from knc_last_ran_table;
> +----------------+
> | monthly_rollup |
> +----------------+
> |     1427871599 |
> +----------------+
> 1 row in set (0.00 sec)
>
> mysql> select FROM_UNIXTIME(monthly_rollup) from knc_last_ran_table;
> +-------------------------------+
> | FROM_UNIXTIME(monthly_rollup) |
> +-------------------------------+
> | 2015-03-31 23:59:59           |
> +-------------------------------+
> 1 row in set (0.00 sec)
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 39 Paul Edmon 2015-04-20 06:08:50 MDT
I'm getting the following error when I use rpmbuild:

+ STATUS=0
+ '[' 0 -ne 0 ']'
+ cd slurm-14.11.4
+ /bin/chmod -Rf a+rX,u+w,g-w,o-w .
+ echo 'Patch #0 (1576.patch):'
Patch #0 (1576.patch):
+ /bin/cat /root/rpmbuild/SOURCES/1576.patch
+ /usr/bin/patch -p1 --fuzz=0
patching file src/slurmdbd/slurmdbd.c
+ echo 'Patch #1 (1576.patch2):'
Patch #1 (1576.patch2):
+ /bin/cat /root/rpmbuild/SOURCES/1576.patch2
+ /usr/bin/patch -p1 --fuzz=0
patching file src/plugins/accounting_storage/mysql/as_mysql_archive.c
patching file src/slurmdbd/slurmdbd.c
+ echo 'Patch #2 (archive_1411.patch):'
Patch #2 (archive_1411.patch):
+ /bin/cat /root/rpmbuild/SOURCES/archive_1411.patch
+ /usr/bin/patch -p1 --fuzz=0
patching file src/plugins/accounting_storage/mysql/as_mysql_archive.c
Hunk #1 FAILED at 50.
Hunk #2 succeeded at 414 (offset -3 lines).
Hunk #3 succeeded at 1449 (offset -3 lines).
Hunk #4 succeeded at 1464 (offset -3 lines).
Hunk #5 succeeded at 1480 (offset -3 lines).
Hunk #6 succeeded at 1574 (offset -3 lines).
Hunk #7 succeeded at 1589 (offset -3 lines).
Hunk #8 succeeded at 1632 (offset -3 lines).
Hunk #9 succeeded at 1710 (offset -3 lines).
Hunk #10 succeeded at 1725 (offset -3 lines).
Hunk #11 succeeded at 1742 (offset -3 lines).
Hunk #12 succeeded at 1793 (offset -3 lines).
Hunk #13 succeeded at 1808 (offset -3 lines).
Hunk #14 succeeded at 1862 (offset -3 lines).
Hunk #15 succeeded at 1951 (offset -3 lines).
Hunk #16 succeeded at 2059 (offset -3 lines).
Hunk #17 succeeded at 2074 (offset -3 lines).
Hunk #18 FAILED at 2121.
2 out of 18 hunks FAILED -- saving rejects to file 
src/plugins/accounting_storage/mysql/as_mysql_archive.c.rej
error: Bad exit status from /var/tmp/rpm-tmp.tkM0lZ (%prep)


RPM build errors:
     Bad exit status from /var/tmp/rpm-tmp.tkM0lZ (%prep)
[root@holy-slurm01 slurm-1576]#

-Paul Edmon-

On 04/15/2015 06:31 PM, bugs@schedmd.com wrote:
>
> *Comment # 37 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c37> on 
> bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> Createdattachment 1819  <attachment.cgi?id=1819&action=diff>  [details]  <attachment.cgi?id=1819&action=edit>
> archive patch
>
> I've made the changes in 15.08.
> https://github.com/SchedMD/slurm/commit/343d55f85dfbf8eeb4a15d9f110b1ec70819c9a1
> https://github.com/SchedMD/slurm/commit/73202646e66f597dab89eca42a499c03bd257919
>
> I've attached a patch for 14.11 that consists of the two commits. What the
> patch does, is if there are records older than 60 days it will catch up by
> archiving and purging a month at a time until it is caught up. So you'll see a
> lot of archive files being generated. This should be a good indication of it's
> progress.
>
> Instead of using "sacctmgr archive dump" to trigger the archive, just put the
> time stamp in <cluster>__last_ran_table.montly in the db to 1 second in the
> past. Doing it this way will prevent the sacctmgr call and the slurmdbd's
> rollup from waiting on each other for locks on the tables.
>
> Let me know if you have any questions and how it goes.
>
> Thanks,
> Brian
>
> ex.
> mysql> select monthly_rollup from knc_last_ran_table;
> +----------------+
> | monthly_rollup |
> +----------------+
> |     1427871600 |
> +----------------+
> 1 row in set (0.00 sec)
>
> mysql> update knc_last_ran_table set monthly_rollup=1427871599;
> Query OK, 0 rows affected (0.05 sec)
> Rows matched: 1  Changed: 0  Warnings: 0
>
> mysql> select monthly_rollup from knc_last_ran_table;
> +----------------+
> | monthly_rollup |
> +----------------+
> |     1427871599 |
> +----------------+
> 1 row in set (0.00 sec)
>
> mysql> select FROM_UNIXTIME(monthly_rollup) from knc_last_ran_table;
> +-------------------------------+
> | FROM_UNIXTIME(monthly_rollup) |
> +-------------------------------+
> | 2015-03-31 23:59:59           |
> +-------------------------------+
> 1 row in set (0.00 sec)
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 40 Brian Christiansen 2015-04-20 06:20:03 MDT
1576.patch2 and the archive_1411.patch have different code paths now. Will you remove 1576.patch and 1576.patch2? You're ok running without them now.
Comment 41 Paul Edmon 2015-04-20 06:22:13 MDT
Still choking:

+ STATUS=0
+ '[' 0 -ne 0 ']'
+ cd slurm-14.11.4
+ /bin/chmod -Rf a+rX,u+w,g-w,o-w .
+ echo 'Patch #0 (archive_1411.patch):'
Patch #0 (archive_1411.patch):
+ /bin/cat /root/rpmbuild/SOURCES/archive_1411.patch
+ /usr/bin/patch -p1 --fuzz=0
patching file src/plugins/accounting_storage/mysql/as_mysql_archive.c
Hunk #1 FAILED at 50.
Hunk #2 succeeded at 414 (offset -3 lines).
Hunk #3 succeeded at 1449 (offset -3 lines).
Hunk #4 succeeded at 1464 (offset -3 lines).
Hunk #5 succeeded at 1480 (offset -3 lines).
Hunk #6 succeeded at 1574 (offset -3 lines).
Hunk #7 succeeded at 1589 (offset -3 lines).
Hunk #8 succeeded at 1632 (offset -3 lines).
Hunk #9 succeeded at 1710 (offset -3 lines).
Hunk #10 succeeded at 1725 (offset -3 lines).
Hunk #11 succeeded at 1742 (offset -3 lines).
Hunk #12 succeeded at 1793 (offset -3 lines).
Hunk #13 succeeded at 1808 (offset -3 lines).
Hunk #14 succeeded at 1862 (offset -3 lines).
Hunk #15 succeeded at 1951 (offset -3 lines).
Hunk #16 succeeded at 2059 (offset -3 lines).
Hunk #17 succeeded at 2074 (offset -3 lines).
Hunk #18 FAILED at 2121.
2 out of 18 hunks FAILED -- saving rejects to file 
src/plugins/accounting_storage/mysql/as_mysql_archive.c.rej
error: Bad exit status from /var/tmp/rpm-tmp.GSgL1E (%prep)


RPM build errors:
     Bad exit status from /var/tmp/rpm-tmp.GSgL1E (%prep)

-Paul Edmon-

On 04/20/2015 02:20 PM, bugs@schedmd.com wrote:
>
> *Comment # 40 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c40> on 
> bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> 1576.patch2 and the archive_1411.patch have different code paths now. Will you
> remove 1576.patch and 1576.patch2? You're ok running without them now.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 42 Brian Christiansen 2015-04-20 06:49:35 MDT
Created attachment 1833 [details]
purge patch

ok, figured it out. It's expecting the previous purging patch. Will you apply this one before the archive_1411.patch?
Comment 43 Paul Edmon 2015-04-20 06:53:07 MDT
Ah, yeah that would make sense.  I never applied that patch.  Looks to 
be building fine now.

-Paul Edmon-

On 04/20/2015 02:49 PM, bugs@schedmd.com wrote:
>
> *Comment # 42 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c42> on 
> bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> Createdattachment 1833  <attachment.cgi?id=1833&action=diff>  [details]  <attachment.cgi?id=1833&action=edit>
> purge patch
>
> ok, figured it out. It's expecting the previous purging patch. Will you apply
> this one before the archive_1411.patch?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 44 Paul Edmon 2015-04-20 08:07:18 MDT
So it looks like it did start the archive:

[root@holy-slurm01 x86_64]# ls -latr /slurm/archive
total 364084
drwxr-xr-x 8 root  root             4096 Mar  5 10:32 ..
-rw------- 1 slurm slurm_users 176829079 Apr 20 15:13 
odyssey_event_archive_2013-08-01T00:00:00_2014-09-30T23:59:59
-rw------- 1 slurm slurm_users    298515 Apr 20 15:13 
odyssey_suspend_archive_2013-08-01T00:00:00_2014-09-30T23:59:59
-rw------- 1 slurm slurm_users    126976 Apr 20 15:24 
odyssey_event_archive_2013-08-01T00:00:00_2013-09-01T00:00:00
-rw------- 1 slurm slurm_users    635274 Apr 20 15:25 
odyssey_event_archive_2013-09-01T00:00:00_2013-10-01T00:00:00
-rw------- 1 slurm slurm_users    645999 Apr 20 15:25 
odyssey_event_archive_2013-10-01T00:00:00_2013-11-01T00:00:00
-rw------- 1 slurm slurm_users    487534 Apr 20 15:25 
odyssey_event_archive_2013-11-01T00:00:00_2013-12-01T00:00:00
-rw------- 1 slurm slurm_users    977130 Apr 20 15:25 
odyssey_event_archive_2013-12-01T00:00:00_2014-01-01T00:00:00
-rw------- 1 slurm slurm_users   1511138 Apr 20 15:25 
odyssey_event_archive_2014-01-01T00:00:00_2014-02-01T00:00:00
-rw------- 1 slurm slurm_users   7746166 Apr 20 15:25 
odyssey_event_archive_2014-02-01T00:00:00_2014-03-01T00:00:00
-rw------- 1 slurm slurm_users  41228052 Apr 20 15:26 
odyssey_event_archive_2014-03-01T00:00:00_2014-04-01T00:00:00
-rw------- 1 slurm slurm_users  22925154 Apr 20 15:26 
odyssey_event_archive_2014-04-01T00:00:00_2014-05-01T00:00:00
-rw------- 1 slurm slurm_users  46937755 Apr 20 15:26 
odyssey_event_archive_2014-05-01T00:00:00_2014-06-01T00:00:00
-rw------- 1 slurm slurm_users  24678369 Apr 20 15:26 
odyssey_event_archive_2014-06-01T00:00:00_2014-07-01T00:00:00
-rw------- 1 slurm slurm_users   5824873 Apr 20 15:26 
odyssey_event_archive_2014-07-01T00:00:00_2014-08-01T00:00:00
-rw------- 1 slurm slurm_users  10123392 Apr 20 15:27 
odyssey_event_archive_2014-08-01T00:00:00_2014-09-01T00:00:00
-rw------- 1 slurm slurm_users  12981631 Apr 20 15:27 
odyssey_event_archive_2014-09-01T00:00:00_2014-09-30T23:59:59
-rw------- 1 slurm slurm_users     68775 Apr 20 15:27 
odyssey_suspend_archive_2013-08-01T00:00:00_2013-09-01T00:00:00
-rw------- 1 slurm slurm_users     22629 Apr 20 15:27 
odyssey_suspend_archive_2013-09-01T00:00:00_2013-10-01T00:00:00
-rw------- 1 slurm slurm_users     67623 Apr 20 15:27 
odyssey_suspend_archive_2013-10-01T00:00:00_2013-11-01T00:00:00
-rw------- 1 slurm slurm_users       895 Apr 20 15:27 
odyssey_suspend_archive_2013-11-01T00:00:00_2013-12-01T00:00:00
-rw------- 1 slurm slurm_users     13186 Apr 20 15:27 
odyssey_suspend_archive_2014-01-01T00:00:00_2014-02-01T00:00:00
-rw------- 1 slurm slurm_users     67615 Apr 20 15:27 
odyssey_suspend_archive_2014-02-01T00:00:00_2014-03-01T00:00:00
-rw------- 1 slurm slurm_users      9028 Apr 20 15:27 
odyssey_suspend_archive_2014-04-01T00:00:00_2014-05-01T00:00:00
-rw------- 1 slurm slurm_users       964 Apr 20 15:27 
odyssey_suspend_archive_2014-06-01T00:00:00_2014-07-01T00:00:00
-rw------- 1 slurm slurm_users      1848 Apr 20 15:27 
odyssey_suspend_archive_2014-07-01T00:00:00_2014-08-01T00:00:00
-rw------- 1 slurm slurm_users      1328 Apr 20 15:27 
odyssey_suspend_archive_2014-08-01T00:00:00_2014-09-01T00:00:00
-rw------- 1 slurm slurm_users     44904 Apr 20 15:27 
odyssey_suspend_archive_2014-09-01T00:00:00_2014-09-30T23:59:59
-rw------- 1 slurm slurm_users  18489762 Apr 20 15:28 
odyssey_step_archive_1969-12-01T00:00:00_1970-01-01T00:00:00
drwxr-xr-x 2 slurm slurm_users      4096 Apr 20 15:28 .

However the ctld locked up again while it was doing the archive.  We 
can't have this happen as we have a very active cluster.  I let it go 
for about 20 minutes hoping it would finish or return but no go. I had 
to eventually kill slurmdbd and restart it with archiving turned off.  
For an order of operations:

Patched slurm
Installed slurmdbd and slurm-sql
Restarted slurmdbd
Updated slurmdbd.conf to permit purging
Reconfigured to pull in the new conf
Updated mysql rollup
Ran a sacctmgr reconfigure

The last command is what caused the lockup.  Is there anyway to do this 
with out locking up the ctld?  Even an active forced purge shouldn't do 
this unless it is looking at active data.

Anyways, we can't go ahead with the purge if it does this.  Or at least 
we can't do it with out an organized downtime.  Do you happen to know 
how long the purge would take?  Or at least a ball park (i.e. this much 
time per million rows)?  Is there a way to see where it is in the purge 
as it rolls?  I have no idea as to where it is when it runs and I can't 
give my bosses and accurate estimate as to how long it will take with 
out a progress bar or some diagnostic.

-Paul Edmon-

On 04/20/2015 02:49 PM, bugs@schedmd.com wrote:
>
> *Comment # 42 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c42> on 
> bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> Createdattachment 1833  <attachment.cgi?id=1833&action=diff>  [details]  <attachment.cgi?id=1833&action=edit>
> purge patch
>
> ok, figured it out. It's expecting the previous purging patch. Will you apply
> this one before the archive_1411.patch?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 45 Brian Christiansen 2015-04-20 11:44:52 MDT
For me, archiving and purging 86520 records (1 month worth or records for me) took -- in microseconds:

event:   ~900000   (~6M file)
suspend: ~2200000  (~4M file)
step:    ~7000000  (~26M file)
job:     ~19300000 (~20M file)
resv:    ~2400000  (~6M file)

The code archives and purges one month at a time and goes in the order: event, suspend, step, job, resv. So from looking at your archive files, it got through events and suspends and stopped at steps.

What was the last reconfigure for?

What does the following query give you?
select count(*) from odyssey_step_table where time_start <= UNIX_TIMESTAMP('2013-08-01 00:00:00');

I'm just curious because the step archive started at 1969 and the others started at 2013-08-01.

I'm trying to reproduce the lock. I'll let you know what I find.
Comment 46 Paul Edmon 2015-04-21 01:29:46 MDT
So 86,000 is roughly 2 days worth for us.  Given how many jobs we have 
it would take roughly 2 hours to do the purge as we would be purging 
roughly 35 million jobs.

It never got to the step before so I think we are heading in the right 
direction.  Though it spitting out 1970 is confusing.  I'm guessing it 
was set to unix time of 0, which means it was uninitialized?  Or perhaps 
corrupt?

The last configure was to force it to run the archive.  When I made the 
change in mysql, it didn't actually do anything until after I ran the 
reconfigure.  Was that incorrect?  Do I just need to be more patient?

mysql> select count(*) from odyssey_step_table where time_start <=
     -> UNIX_TIMESTAMP('2013-08-01 00:00:00')
     -> ;
+----------+
| count(*) |
+----------+
|    49065 |
+----------+
1 row in set (42.41 sec)

-Paul Edmon-

On 04/20/2015 07:44 PM, bugs@schedmd.com wrote:
>
> *Comment # 45 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c45> on 
> bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> For me, archiving and purging 86520 records (1 month worth or records for me)
> took -- in microseconds:
>
> event:   ~900000   (~6M file)
> suspend: ~2200000  (~4M file)
> step:    ~7000000  (~26M file)
> job:     ~19300000 (~20M file)
> resv:    ~2400000  (~6M file)
>
> The code archives and purges one month at a time and goes in the order: event,
> suspend, step, job, resv. So from looking at your archive files, it got through
> events and suspends and stopped at steps.
>
> What was the last reconfigure for?
>
> What does the following query give you?
> select count(*) from odyssey_step_table where time_start <=
> UNIX_TIMESTAMP('2013-08-01 00:00:00');
>
> I'm just curious because the step archive started at 1969 and the others
> started at 2013-08-01.
>
> I'm trying to reproduce the lock. I'll let you know what I find.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 47 Brian Christiansen 2015-04-21 09:03:19 MDT
Let's modify the game plan a little bit. Let's try "walking" the purges up until we get caught up. So instead of purging everything older than 6months let's say purge everything after 20months. Then once that is done, we'll change it to 19 months and so on. This will reduce the amount of time and work the dbd will have to do in one sitting.

I would also recommending doing jobs, steps and reservations separately. We saw that events and suspends went pretty quick, so I think you could leave them at 6 months and do them in one fell swoop. I think that reservations will be quick too but just to be safe, I would do them separately.

So, for example. The first time running you would have:
PurgeEventAfter=6month
PurgeSuspendAfter=6month
#PurgeResvAfter=6month
#PurgeJobAfter=20month
#PurgeStepAfter=20month

Once that is finished, then turn on reservations:
PurgeEventAfter=6month
PurgeSuspendAfter=6month
PurgeResvAfter=6month
#PurgeJobAfter=20month
#PurgeStepAfter=20month

Once reservations are finished then turn on jobs:
PurgeEventAfter=6month
PurgeSuspendAfter=6month
PurgeResvAfter=6month
PurgeJobAfter=20month
#PurgeStepAfter=20month

Then decrement the month count on Jobs:
PurgeEventAfter=6month
PurgeSuspendAfter=6month
PurgeResvAfter=6month
PurgeJobAfter=19month
#PurgeStepAfter=20month

and so on ...

The rollup will happen without running "reconfigure". The rollup is triggered every hour or when you restart the slurmdbd. You could stop the dbd, make the config changes and the month's last roll up time and then restart and then repeat for each config change.

Let me know if you have any questions.
Comment 48 Brian Christiansen 2015-04-21 09:05:31 MDT
For the time_start 0 steps what do the following queries give you?

select count(*) from odyssey_step_table where time_start = 0;

select t1.time_submit from odyssey_job_table t1, odyssey_step_table t2 where t1.job_db_inx = t2.job_db_inx and t2.time_start = 0 LIMIT 1;
Comment 49 Brian Christiansen 2015-04-21 09:40:37 MDT
The most likely situation for the time_start=0 scenario is that when the slurmdbd is down the slurmctld will queue up the records to send to the dbd when it comes back up. The slurmctld will eventually start throwing away step start records. So what could have happened is that the start record was thrown away but the step end record was sent to the dbd when it came back up.
Comment 50 Paul Edmon 2015-04-21 13:55:13 MDT
Ah that would make sense.

I like the plan of a walking purge.  I had been considering that myself 
as a possible solution.  I will work on that tomorrow.  And get you the 
information that you requested.

I was not aware that the dumping behavior happened on the hour.  It 
would be good to document this somewhere.  It would also be good if the 
slurmdbd could be a bit more talkative when it is doing a purge, dump or 
rollup.  We see it doing things, but we aren't sure what it is up to, 
how long it will take or where it is.  Any info (even basic info) would 
be more comforting than none.  Or perhaps we don't have the verbosity on 
the logging turned up far enough.

-Paul Edmon-

On 4/21/2015 5:40 PM, bugs@schedmd.com wrote:
>
> *Comment # 49 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c49> on 
> bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> The most likely situation for the time_start=0 scenario is that when the
> slurmdbd is down the slurmctld will queue up the records to send to the dbd
> when it comes back up. The slurmctld will eventually start throwing away step
> start records. So what could have happened is that the start record was thrown
> away but the step end record was sent to the dbd when it came back up.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 51 Paul Edmon 2015-04-22 01:40:32 MDT
mysql> select count(*) from odyssey_step_table where time_start = 0;
+----------+
| count(*) |
+----------+
|    49393 |
+----------+
1 row in set (20.68 sec)

mysql> select t1.time_submit from odyssey_job_table t1, 
odyssey_step_table t2 where
     -> t1.job_db_inx = t2.job_db_inx and t2.time_start = 0 LIMIT 1;
+-------------+
| time_submit |
+-------------+
|  1377014448 |
+-------------+
1 row in set (0.17 sec)

I'm going to try to do the 20 month purge now.  If successful I will 
walk down month by month till I hit 6 months.

-Paul Edmon-

On 04/21/2015 05:05 PM, bugs@schedmd.com wrote:
> select t1.time_submit from odyssey_job_table t1, odyssey_step_table t2 where
> t1.job_db_inx = t2.job_db_inx and t2.time_start = 0 LIMIT 1;
Comment 52 Paul Edmon 2015-04-22 02:21:30 MDT
Yeah, no go.  It locked up again, even when I let it just run at the 
hour mark only purging things older than 20months.  The mysql query at 
the time was:

mysql> show processlist;
+--------+-------------+--------------------------------------+-------+-------------+---------+----------------------------------------------------------------+------------------------------------------------------------------------------------------------------+
| Id     | User        | Host                                 | db    | 
Command     | Time    | 
State                                                          | Info |
+--------+-------------+--------------------------------------+-------+-------------+---------+----------------------------------------------------------------+------------------------------------------------------------------------------------------------------+
|     12 | replication | db-internal.rc.fas.harvard.edu:49224 | NULL  | 
Binlog Dump | 1203386 | Has sent all binlog to slave; waiting for binlog 
to be updated | NULL |
| 289759 | slurm       | localhost                            | slurm | 
Sleep       |    1077 
|                                                                | NULL |
| 289761 | slurm       | localhost                            | slurm | 
Query       |     586 | 
update                                                         | insert 
into "odyssey_step_table" (job_db_inx, id_step, time_start, step_name, 
state, cpus_alloc, nod |
| 289805 | slurm       | localhost                            | slurm | 
Query       |      14 | Sorting 
result                                                 | select 
job_db_inx, id_step, time_start, time_end, time_suspended, step_name, 
nodelist, node_inx, sta |
| 289861 | root        | localhost                            | NULL  | 
Query       |       0 | 
NULL                                                           | show 
processlist |
+--------+-------------+--------------------------------------+-------+-------------+---------+----------------------------------------------------------------+------------------------------------------------------------------------------------------------------+
5 rows in set (0.00 sec)

I've restarted the slurmdbd with out the purge.

It seems get hung up on the job table.

-Paul Edmon-

On 04/21/2015 05:40 PM, bugs@schedmd.com wrote:
>
> *Comment # 49 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c49> on 
> bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> The most likely situation for the time_start=0 scenario is that when the
> slurmdbd is down the slurmctld will queue up the records to send to the dbd
> when it comes back up. The slurmctld will eventually start throwing away step
> start records. So what could have happened is that the start record was thrown
> away but the step end record was sent to the dbd when it came back up.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 53 Brian Christiansen 2015-04-22 03:50:12 MDT
Shoot. Sorry. I'll take another look.
Comment 54 Brian Christiansen 2015-04-22 08:55:16 MDT
Created attachment 1837 [details]
deadlock patch

Ok. We actually found a lock yesterday while trying to reproduce the deadlock. I mistakenly thought it didn't apply to you. In testing today, when I don't have the patch in I can get the controller to lock while archiving/purging is happening. 

Will you apply this patch (attached) and try again?

The commit:
https://github.com/SchedMD/slurm/commit/59b9c909b61e4543909255bf9191ac29a7e6af44
Comment 55 Paul Edmon 2015-04-22 14:06:40 MDT
Sure will do tomorrow.  Thanks.

-Paul Edmon-

On 4/22/2015 4:55 PM, bugs@schedmd.com wrote:
>
> *Comment # 54 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c54> on 
> bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> Createdattachment 1837  <attachment.cgi?id=1837&action=diff>  [details]  <attachment.cgi?id=1837&action=edit>
> deadlock patch
>
> Ok. We actually found a lock yesterday while trying to reproduce the deadlock.
> I mistakenly thought it didn't apply to you. In testing today, when I don't
> have the patch in I can get the controller to lock while archiving/purging is
> happening.
>
> Will you apply this patch (attached) and try again?
>
> The commit:
> https://github.com/SchedMD/slurm/commit/59b9c909b61e4543909255bf9191ac29a7e6af44
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 56 Paul Edmon 2015-04-23 02:04:16 MDT
Patch applied.  Looks like it is running the archive with out 
deadlocking now.  Keep you updated.

-Paul Edmon-

On 04/22/2015 04:55 PM, bugs@schedmd.com wrote:
>
> *Comment # 54 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c54> on 
> bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> Createdattachment 1837  <attachment.cgi?id=1837&action=diff>  [details]  <attachment.cgi?id=1837&action=edit>
> deadlock patch
>
> Ok. We actually found a lock yesterday while trying to reproduce the deadlock.
> I mistakenly thought it didn't apply to you. In testing today, when I don't
> have the patch in I can get the controller to lock while archiving/purging is
> happening.
>
> Will you apply this patch (attached) and try again?
>
> The commit:
> https://github.com/SchedMD/slurm/commit/59b9c909b61e4543909255bf9191ac29a7e6af44
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 57 Paul Edmon 2015-04-23 02:38:53 MDT
Yeah, it still deadlocks, even after the patch.  Watching mysql I see:

mysql> show processlist;
+--------+-------------+--------------------------------------+-------+-------------+---------+----------------------------------------------------------------+------------------------------------------------------------------------------------------------------+
| Id     | User        | Host                                 | db    | 
Command     | Time    | 
State                                                          | Info |
+--------+-------------+--------------------------------------+-------+-------------+---------+----------------------------------------------------------------+------------------------------------------------------------------------------------------------------+
|     12 | replication | db-internal.rc.fas.harvard.edu:49224 | NULL  | 
Binlog Dump | 1290759 | Has sent all binlog to slave; waiting for binlog 
to be updated | NULL |
| 300314 | slurm       | localhost                            | slurm | 
Sleep       |     624 
|                                                                | NULL |
| 300315 | slurm       | localhost                            | slurm | 
Query       |      22 | Sorting 
result                                                 | select 
job_db_inx, id_step, time_start, time_end, time_suspended, step_name, 
nodelist, node_inx, sta |
| 300321 | slurm       | localhost                            | slurm | 
Query       |     357 | 
update                                                         | insert 
into "odyssey_step_table" (job_db_inx, id_step, time_start, step_name, 
state, cpus_alloc, nod |
| 300397 | root        | localhost                            | NULL  | 
Query       |       0 | 
NULL                                                           | show 
processlist |
+--------+-------------+--------------------------------------+-------+-------------+---------+----------------------------------------------------------------+------------------------------------------------------------------------------------------------------+
5 rows in set (0.00 sec)

mysql> quit
Bye

It actually chugs along fine with out deadlocking for a bit, but once it 
hits around this query it stalls out.  Seems to happen when a request to 
modify the DB comes in.  I will also note that the select job_db query 
keeps looping.  Namely it will run for about 30 seconds, and then start 
again.  I can't tell if its the same query happening over and over though.

Anyways the deadlock in the ctld is still a problem even after the 
patch.  Let me know if you want me to try anything on my end.

-Paul Edmon-

On 04/22/2015 04:55 PM, bugs@schedmd.com wrote:
>
> *Comment # 54 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c54> on 
> bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> Createdattachment 1837  <attachment.cgi?id=1837&action=diff>  [details]  <attachment.cgi?id=1837&action=edit>
> deadlock patch
>
> Ok. We actually found a lock yesterday while trying to reproduce the deadlock.
> I mistakenly thought it didn't apply to you. In testing today, when I don't
> have the patch in I can get the controller to lock while archiving/purging is
> happening.
>
> Will you apply this patch (attached) and try again?
>
> The commit:
> https://github.com/SchedMD/slurm/commit/59b9c909b61e4543909255bf9191ac29a7e6af44
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 58 Brian Christiansen 2015-04-23 03:45:56 MDT
ok. I'll keep looking. Thanks for you help.

How long would you say it took before deadlocking? How many archive files got created?
Comment 59 Paul Edmon 2015-04-23 03:48:42 MDT
So it made the 1970 step file and that's it.  So it got through up 
creating the step files and I saw it doing a delete on the step parts as 
well in mysql.  It seems to be when it hits the job_db query that it 
dead locks.  So a few minutes or so.

-Paul Edmon-

On 04/23/2015 11:45 AM, bugs@schedmd.com wrote:
>
> *Comment # 58 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c58> on 
> bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> ok. I'll keep looking. Thanks for you help.
>
> How long would you say it took before deadlocking? How many archive files got
> created?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 60 Brian Christiansen 2015-04-23 05:29:26 MDT
Did you happen to restart the slurmctld before running the last test? The last patch was for the controller. Sorry I didn't point that out if you didn't.
Comment 61 Paul Edmon 2015-04-23 05:38:08 MDT
I did as I figured that that was what the patch was for.

-Paul Edmon-

On 04/23/2015 01:29 PM, bugs@schedmd.com wrote:
>
> *Comment # 60 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c60> on 
> bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> Did you happen to restart the slurmctld before running the last test? The last
> patch was for the controller. Sorry I didn't point that out if you didn't.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 62 Brian Christiansen 2015-04-23 05:44:44 MDT
ok. Will you run the test again and when the controller locks up will you get gdb backtraces from the controller and the dbd?

gdb attach <slurmctld_pid>
thread apply all bt
quit

gdb attach <slurmdbd_pid>
thread apply all bt
quit

Thanks,
Brian
Comment 63 Paul Edmon 2015-04-23 05:48:48 MDT
Sure.  I'm also going to change my compiler flags to remove optimization 
so that things don't get optimized  out.

-Paul Edmon-

On 04/23/2015 01:44 PM, bugs@schedmd.com wrote:
>
> *Comment # 62 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c62> on 
> bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> ok. Will you run the test again and when the controller locks up will you get
> gdb backtraces from the controller and the dbd?
>
> gdb attach <slurmctld_pid>
> thread apply all bt
> quit
>
> gdb attach <slurmdbd_pid>
> thread apply all bt
> quit
>
> Thanks,
> Brian
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 64 Brian Christiansen 2015-04-23 05:51:51 MDT
Good idea. Thanks.
Comment 65 Paul Edmon 2015-04-23 07:26:51 MDT
This time slurmctld didn't crash.  However slurmdbd did:

Apr 23 14:01:27 holy-slurm01 slurmdbd[2511]: slurmdbd version 14.11.4 
started
Apr 23 14:01:27 holy-slurm01 slurmctld[55664]: Registering slurmctld at 
port 6820 with slurmdbd.
Apr 23 14:03:06 holy-slurm01 slurmctld[55664]: slurmdbd: agent queue size 50
Apr 23 14:34:31 holy-slurm01 slurmctld[55664]: error: slurmdbd: Getting 
response to message type 1472
Apr 23 14:34:31 holy-slurm01 slurmctld[55664]: error: slurmdbd: 
DBD_SEND_MULT_JOB_START failure: No error
Apr 23 14:49:31 holy-slurm01 slurmctld[55664]: error: slurmdbd: Getting 
response to message type 1407
Apr 23 15:04:32 holy-slurm01 slurmdbd[2511]: error: mysql_query failed: 
1205 Lock wait timeout exceeded; try restarting transaction
Apr 23 15:04:32 holy-slurm01 slurmdbd[2511]: fatal: mysql gave 
ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart 
the calling program
Apr 23 15:04:32 holy-slurm01 slurmctld[55664]: error: slurmdbd: Getting 
response to message type 1472
Apr 23 15:04:32 holy-slurm01 slurmctld[55664]: error: slurmdbd: 
DBD_SEND_MULT_JOB_START failure: No error
Apr 23 15:04:32 holy-slurm01 slurmctld[55664]: slurmdbd: reopening 
connection
Apr 23 15:04:37 holy-slurm01 slurmctld[55664]: error: slurmdbd: Sending 
message type 1407: 11: Connection refused
Apr 23 15:04:54 holy-slurm01 slurmctld[55664]: error: slurmdbd: 
DBD_SEND_MULT_JOB_START failure: Connection refused
Apr 23 15:05:00 holy-slurm01 slurmctld[55664]: error: slurmdbd: 
DBD_SEND_MULT_JOB_START failure: Connection refused

I'm going to turn off the purge again.  Let me know if you want any 
other info.

-Paul Edmon-

On 04/23/2015 01:51 PM, bugs@schedmd.com wrote:
>
> *Comment # 64 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c64> on 
> bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> Good idea. Thanks.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 66 Brian Christiansen 2015-04-23 07:40:22 MDT
How long did it go for?
Did you see new archive files be created?
What is your current innodb_lock_wait_timeout set to?

> show variables like 'innodb_lock_wait_timeout';
+--------------------------+-------+
| Variable_name            | Value |
+--------------------------+-------+
| innodb_lock_wait_timeout | 3600  |
+--------------------------+-------+
1 row in set (0.00 sec)


Are you just purging jobs? Just confirming. I would also suggest doing steps first so that it doesn't leave dangling steps (changing what I said earlier).
Comment 67 Paul Edmon 2015-04-23 07:46:05 MDT
Given the log it went for about a half hour before throwing the first 
error (14:34) and then the fatal error came at 15:04 which is just over 
an hour in.  The only file it produced was:

-rw------- 1 slurm slurm_users  18934293 Apr 23 14:02 
odyssey_step_archive_1969-12-01T00:00:00_1970-01-01T00:00:00

The rest in the directory are from previous runs.

mysql> show variables like 'innodb_lock_wait_timeout';
+--------------------------+-------+
| Variable_name            | Value |
+--------------------------+-------+
| innodb_lock_wait_timeout | 3600  |
+--------------------------+-------+
1 row in set (0.00 sec)

mysql>

Right now I am running an archive and purge.  So namely:

ArchiveJobs=yes
ArchiveDir=/slurm/archive
ArchiveSteps=yes
ArchiveResvs=yes
ArchiveEvents=yes
ArchiveSuspend=yes
#ArchiveScript=
PurgeEventAfter=20month
PurgeJobAfter=20month
PurgeResvAfter=20month
PurgeStepAfter=20month
PurgeSuspendAfter=20month

I can turn off purging for everything else if we just want to dump one 
at a time. That may be wise at this point.

-Paul Edmon-

On 04/23/2015 03:40 PM, bugs@schedmd.com wrote:
>
> *Comment # 66 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c66> on 
> bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> How long did it go for?
> Did you see new archive files be created?
> What is your current innodb_lock_wait_timeout set to?
>
> > show variables like 'innodb_lock_wait_timeout';
> +--------------------------+-------+
> | Variable_name            | Value |
> +--------------------------+-------+
> | innodb_lock_wait_timeout | 3600  |
> +--------------------------+-------+
> 1 row in set (0.00 sec)
>
>
> Are you just purging jobs? Just confirming. I would also suggest doing steps
> first so that it doesn't leave dangling steps (changing what I said earlier).
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 68 Brian Christiansen 2015-04-23 08:38:44 MDT
Ya, I think purging one at a time is a good idea. Events, Suspends, and Resvs will probably go quick.
Comment 69 Brian Christiansen 2015-04-23 08:54:29 MDT
Hold off on purging jobs and steps for a minute -- in case you were about to. I found an issue with the time_start = 0 jobs. The code is trying to walk up from 1969 and doing selects for each month which is taking a long time. I'll get back to you with a solution. brb.
Comment 70 Brian Christiansen 2015-04-23 09:51:09 MDT
ok. Will you run these commands in mysql. It will set the start time of the steps with time_start=0 to the start time of the job. I did it in a 
transaction just to make sure that all of the time_start=0 steps are being updated.

First see if there are any jobs that have a time_start of 0 as well. If there are, then update the job_table first.

Let me know if you have any questions.

> start transaction;
Query OK, 0 rows affected (0.00 sec)

> select count(*) from compy_job_table where time_start = 0;
+----------+
| count(*) |
+----------+
|       30 |
+----------+
1 row in set (0.12 sec)

> UPDATE compy_job_table SET time_start = time_end WHERE time_start = 0;
Query OK, 30 rows affected (4.01 sec)
Rows matched: 30  Changed: 30  Warnings: 0

MariaDB [slurm_acct_test3]> select count(*) from compy_job_table where time_start = 0;
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (0.18 sec)


Then update the steps:

> select count(*) from compy_step_table where time_start = 0;
+----------+
| count(*) |
+----------+
|       31 |
+----------+
1 row in set (1.47 sec)

> UPDATE compy_step_table st, compy_job_table jt SET st.time_start = jt.time_start WHERE st.time_start = 0 AND st.job_db_inx = jt.job_db_inx;
Query OK, 31 rows affected (3.65 sec)
Rows matched: 31  Changed: 31  Warnings: 0

> select count(*) from compy_step_table where time_start = 0;
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (3.23 sec)

> commit;
Query OK, 0 rows affected (0.00 sec)
Comment 71 Brian Christiansen 2015-04-23 11:05:14 MDT
With logging and seeing progress, you can add the DB_USAGE debug flag and set the debug level to "debug" and you'll see these messages:

[2015-04-23T15:04:38.671] debug:  Purging job entries before 1422777600 for compy
[2015-04-23T15:04:58.486] debug:  Purging job entries before 1425196799 for compy
[2015-04-23T15:05:17.311] debug:  Purging resv entries before 1409554800 for compy
[2015-04-23T15:05:18.136] debug:  Purging resv entries before 1412146800 for compy
[2015-04-23T15:05:19.472] debug:  Purging resv entries before 1414825200 for compy

slurmdbd.conf:
DebugFlags=DB_USAGE
DebugLevel=debug
Comment 72 Paul Edmon 2015-04-24 02:06:53 MDT
I had a hunch that it was walking month by month from 1970 to today.  I 
was actually going to try to confirm that today until I saw your note.  
No wonder it was taking so long.

Here are the results of the commands you gave me:

mysql> use slurm;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> start transaction;
Query OK, 0 rows affected (0.00 sec)

mysql> select count(*) from odyssey_job_table where time_start = 0;
+----------+
| count(*) |
+----------+
|  4110229 |
+----------+
1 row in set (8.63 sec)

mysql> UPDATE odyssey_job_table SET time_start = time_end WHERE 
time_start = 0;
Query OK, 4097744 rows affected (4 min 9.55 sec)
Rows matched: 4110255  Changed: 4097744  Warnings: 0

mysql> select count(*) from odyssey_job_table where time_start = 0;
+----------+
| count(*) |
+----------+
|    12485 |
+----------+
1 row in set (8.60 sec)

mysql> select count(*) from odyssey_step_table where time_start = 0;
+----------+
| count(*) |
+----------+
|    50181 |
+----------+
1 row in set (18.75 sec)

mysql> UPDATE odyssey_step_table st, odyssey_job_table jt SET 
st.time_start = jt.time_start WHERE st.time_start = 0 AND st.job_db_inx 
= jt.job_db_inx;
Query OK, 50043 rows affected (6 min 20.14 sec)
Rows matched: 50181  Changed: 50043  Warnings: 0

mysql> select count(*) from odyssey_step_table where time_start = 0;
+----------+
| count(*) |
+----------+
|      138 |
+----------+
1 row in set (19.23 sec)

mysql> commit;
Query OK, 0 rows affected (0.20 sec)

mysql> quit
Bye

So it looks like there are still quite a few entries that still have 0 
time.  These commands didn't get them all in our environment. Clearly 
entries with 0 time are bogus, is there a way to make the purge just 
ignore these when calculating where to start doing monthly steps?  
Namely auto take out all the ones with 0 time, then start the monthly 
purge from the next non-zero entry.

I suppose a way I could get around it for now would be to set the purge 
time for 544months.  Have it purge all the 0 records. Then set it to 
something more sane.

Another option would be for jobs that slurm doesn't know the start date 
for, but does have an entry, set the start date to the time that the 
entry is made.

Either that or when the database is down increase the buffer that it 
holds transactions in until it comes back.

I turned on the additional debugging flags for the database you 
suggested.  I would rather have too much data than insufficient.

Let me know how you want to proceed.

-Paul Edmon-

On 04/23/2015 07:05 PM, bugs@schedmd.com wrote:
>
> *Comment # 71 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c71> on 
> bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> With logging and seeing progress, you can add the DB_USAGE debug flag and set
> the debug level to "debug" and you'll see these messages:
>
> [2015-04-23T15:04:38.671] debug:  Purging job entries before 1422777600 for
> compy
> [2015-04-23T15:04:58.486] debug:  Purging job entries before 1425196799 for
> compy
> [2015-04-23T15:05:17.311] debug:  Purging resv entries before 1409554800 for
> compy
> [2015-04-23T15:05:18.136] debug:  Purging resv entries before 1412146800 for
> compy
> [2015-04-23T15:05:19.472] debug:  Purging resv entries before 1414825200 for
> compy
>
> slurmdbd.conf:
> DebugFlags=DB_USAGE
> DebugLevel=debug
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 73 Brian Christiansen 2015-04-24 02:59:56 MDT
I'll get you a patch that will skip the times in between when there aren't any records. I'm testing it this morning.
Comment 74 Brian Christiansen 2015-04-24 06:16:31 MDT
Created attachment 1840 [details]
archive  patch 2

I've attached the additional patch that skips the months without records.
The commit:
https://github.com/SchedMD/slurm/commit/8c1a151277302f92e38a25264136c0170b3677bc

So jobs with a time_start of 0 are ok. This means that the jobs haven't started yet. With the steps, I'm guessing that they are dangling steps (which shouldn't happen) since they didn't get updated with the queries. Maybe they got added when accounting was just turned on?

You can verify that they are dangling or not by running these queries:

> select count(*) from compy_step_table st left join compy_job_table jt on st.job_db_inx = jt.job_db_inx where jt.job_db_inx is null;
+----------+
| count(*) |
+----------+
|      549 |
+----------+
1 row in set (0.16 sec)

> select count(*) from compy_step_table where time_start = 0;
+----------+
| count(*) |
+----------+
|      549 |
+----------+
1 row in set (0.05 sec)

If they are dangling they could be just deleted or they'll just get purged out.

Let me know if you have any questions.

Thanks,
Brian
Comment 75 Paul Edmon 2015-04-24 06:30:57 MDT
Thanks for the patch.  I will try it out.

Here are the query results:

mysql> select count(*) from odyssey_step_table st left join 
odyssey_job_table jt on st.job_db_inx = jt.job_db_inx where 
jt.job_db_inx is null;
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (1 min 2.04 sec)

mysql> select count(*) from odyssey_step_table where time_start = 0;
+----------+
| count(*) |
+----------+
|      174 |
+----------+
1 row in set (21.40 sec)

mysql>

There have been periods in the past when we had the ctld running with 
out the dbd.  Particularly when we have done major version updates.  In 
those cases the ctld could be running for an hour or two with out the dbd.

-Paul Edmon-

On 04/24/2015 02:16 PM, bugs@schedmd.com wrote:
>
> *Comment # 74 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c74> on 
> bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> Createdattachment 1840  <attachment.cgi?id=1840&action=diff>  [details]  <attachment.cgi?id=1840&action=edit>
> archive  patch 2
>
> I've attached the additional patch that skips the months without records.
> The commit:
> https://github.com/SchedMD/slurm/commit/8c1a151277302f92e38a25264136c0170b3677bc
>
> So jobs with a time_start of 0 are ok. This means that the jobs haven't started
> yet. With the steps, I'm guessing that they are dangling steps (which shouldn't
> happen) since they didn't get updated with the queries. Maybe they got added
> when accounting was just turned on?
>
> You can verify that they are dangling or not by running these queries:
>
> > select count(*) from compy_step_table st left join compy_job_table jt on st.job_db_inx = jt.job_db_inx where jt.job_db_inx is null;
> +----------+
> | count(*) |
> +----------+
> |      549 |
> +----------+
> 1 row in set (0.16 sec)
>
> > select count(*) from compy_step_table where time_start = 0;
> +----------+
> | count(*) |
> +----------+
> |      549 |
> +----------+
> 1 row in set (0.05 sec)
>
> If they are dangling they could be just deleted or they'll just get purged out.
>
> Let me know if you have any questions.
>
> Thanks,
> Brian
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 76 Brian Christiansen 2015-04-24 07:18:23 MDT
Interesting. Looks like there aren't any dangling steps. Will you run this query?

select jt.*, ":", st.* from compy_job_table jt, compy_step_table st where st.job_db_inx = jt.job_db_inx and st.time_start = 0 LIMIT 1 \G;
Comment 77 Paul Edmon 2015-04-24 07:33:23 MDT
So I successfully ran a partial purge.  It purged up to 6 months of 
resv, suspend, event, and 19 months for step.  I'm now trying 12 months 
of step.  If successful I will keep walking it down to 6 months and then 
try to do the jobs.  Here is the query:

mysql> select jt.*, ":", st.* from odyssey_job_table jt, 
odyssey_step_table st where st.job_db_inx = jt.job_db_inx and 
st.time_start = 0 LIMIT 1 \G;
*************************** 1. row ***************************
          job_db_inx: 42138008
            mod_time: 0
             deleted: 0
             account: davis_lab
      array_task_str: NULL
     array_max_tasks: 0
  array_task_pending: 0
            cpus_req: 1
          cpus_alloc: 1
          derived_ec: 0
          derived_es: NULL
           exit_code: 0
            job_name: shift.csh
            id_assoc: 4146
        id_array_job: 0
       id_array_task: 4294967294
            id_block: NULL
              id_job: 38784352
              id_qos: 1
             id_resv: 0
            id_wckey: 0
             id_user: 59769
            id_group: 10209
         kill_requid: -1
             mem_req: 16000
            nodelist: holy2a11207
         nodes_alloc: 1
            node_inx: 318
           partition: serial_requeue
            priority: 15696570
               state: 1024
           timelimit: 1440
         time_submit: 1429881377
       time_eligible: 1429881387
          time_start: 1429883645
            time_end: 1429902954
      time_suspended: 0
            gres_req:
          gres_alloc:
           gres_used:
               wckey:
         track_steps: 0
                   :: :
          job_db_inx: 42138008
             deleted: 0
          cpus_alloc: 1
           exit_code: 15
             id_step: -2
         kill_requid: -1
            nodelist: holy2a11207
         nodes_alloc: 1
            node_inx: 318
               state: 4
           step_name: batch
            task_cnt: 1
           task_dist: 0
          time_start: 0
            time_end: 1429903409
      time_suspended: 0
            user_sec: 8027
           user_usec: 5999
             sys_sec: 5834
            sys_usec: 2999
           max_pages: 31
      max_pages_task: 0
      max_pages_node: 0
           ave_pages: 31
             max_rss: 5646544
        max_rss_task: 0
        max_rss_node: 0
             ave_rss: 3235840
           max_vsize: 6639576
      max_vsize_task: 0
      max_vsize_node: 0
           ave_vsize: 3706388
             min_cpu: 13862
        min_cpu_task: 0
        min_cpu_node: 0
             ave_cpu: 13862
         act_cpufreq: 130330
     consumed_energy: 0
         req_cpufreq: 0
       max_disk_read: 324573.793987
  max_disk_read_task: 0
  max_disk_read_node: 0
       ave_disk_read: 324573.793987
      max_disk_write: 62103.008348
max_disk_write_task: 0
max_disk_write_node: 0
      ave_disk_write: 62103.008348
1 row in set (2 min 44.12 sec)

ERROR:
No query specified

-Paul Edmon-

On 04/24/2015 03:18 PM, bugs@schedmd.com wrote:
>
> *Comment # 76 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c76> on 
> bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> Interesting. Looks like there aren't any dangling steps. Will you run this
> query?
>
> select jt.*, ":", st.* from compy_job_table jt, compy_step_table st where
> st.job_db_inx = jt.job_db_inx and st.time_start = 0 LIMIT 1 \G;
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 78 Brian Christiansen 2015-04-27 04:14:46 MDT
Hey Paul, how did the step archiving and purging go?
Comment 79 Paul Edmon 2015-04-27 04:19:51 MDT
So I'm still working through the steps.  I'm at 18 months.  The 
slurmctld still stalls out occasionally.  I think it has to do with if 
one uses an scontrol command to modify a job while the dump is going.  
It seems to pause when the ctld shouts out the "updating accounting" 
statement.  Given that it still stalls I'm going to try to do some more 
tonight off hours so that it will be less user impacting.  Hopefully I 
will be able to get down to 6 months tonight for the steps.  Then I can 
start trying the jobs section and see what happens.

Regardless this is more progress than we've made yet.  Hopefully I can 
finish whittling this down to what we wanted.  Each month of steps seems 
to take about 5-10 minutes to archive and purge.

-Paul Edmon-

On 4/27/2015 12:14 PM, bugs@schedmd.com wrote:
>
> *Comment # 78 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c78> on 
> bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> Hey Paul, how did the step archiving and purging go?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 80 Brian Christiansen 2015-04-27 05:03:54 MDT
Created attachment 1843 [details]
archive patch 3

Glad we're making progress. You're suspicions on the updating causing the deadlock are probably correct. We made several other possible deadlock fixes last week, based off of the scenario of the other one, in 15.08 and this was one of them. It would happen if you submit a job and then quickly try to modify the job and the db is busy archiving/purging. I've attached a patch that will fix this, if you want to put it in.

Thanks,
Brian
Comment 81 Paul Edmon 2015-04-27 05:35:34 MDT
Yup, that's exactly the scenario.  We have a probe job that we submit 
every 5 minutes and then modify it to the top priority. Thanks for the 
patch.  I will try it out.

-Paul Edmon-

On 4/27/2015 1:03 PM, bugs@schedmd.com wrote:
>
> *Comment # 80 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c80> on 
> bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> Createdattachment 1843  <attachment.cgi?id=1843&action=diff>  [details]  <attachment.cgi?id=1843&action=edit>
> archive patch 3
>
> Glad we're making progress. You're suspicions on the updating causing the
> deadlock are probably correct. We made several other possible deadlock fixes
> last week, based off of the scenario of the other one, in 15.08 and this was
> one of them. It would happen if you submit a job and then quickly try to modify
> the job and the db is busy archiving/purging. I've attached a patch that will
> fix this, if you want to put it in.
>
> Thanks,
> Brian
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 82 Paul Edmon 2015-04-27 08:56:44 MDT
The new patch is looking good.  I've just about purged down to 6months 
of step data.  Assuming that completes okay I will start on the job data.

Hopefully we have this sorted at this point.  Won't declare victory 
until we are done though as the job data is the bulk of the data.

-Paul Edmon-

On 4/27/2015 1:03 PM, bugs@schedmd.com wrote:
>
> *Comment # 80 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c80> on 
> bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> Createdattachment 1843  <attachment.cgi?id=1843&action=diff>  [details]  <attachment.cgi?id=1843&action=edit>
> archive patch 3
>
> Glad we're making progress. You're suspicions on the updating causing the
> deadlock are probably correct. We made several other possible deadlock fixes
> last week, based off of the scenario of the other one, in 15.08 and this was
> one of them. It would happen if you submit a job and then quickly try to modify
> the job and the db is busy archiving/purging. I've attached a patch that will
> fix this, if you want to put it in.
>
> Thanks,
> Brian
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 83 Paul Edmon 2015-04-28 02:17:13 MDT
So update.  The step purge was successful and I did an initial purge 
down to 18 months for jobs.  Interesting the job dump was different than 
the other purges.  Instead of doing it per month it did a purge of two 
months:

-rw------- 1 slurm slurm_users  621864632 Apr 28 09:49 
odyssey_job_archive_2013-08-01T00:00:00_2013-09-30T23:59:59

That's fine, but it is different than the other archives which are per 
month.  I will keep purging down to 6 months and hopefully that will be 
that.

-Paul Edmon-

On 04/27/2015 12:14 PM, bugs@schedmd.com wrote:
>
> *Comment # 78 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c78> on 
> bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> Hey Paul, how did the step archiving and purging go?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 84 Paul Edmon 2015-04-28 02:40:06 MDT
Nevermind, it is now dumping per month.  That first month though was 2 
interestingly.

-Paul Edmon-


On 04/28/2015 10:16 AM, Paul Edmon wrote:
> So update.  The step purge was successful and I did an initial purge 
> down to 18 months for jobs.  Interesting the job dump was different 
> than the other purges.  Instead of doing it per month it did a purge 
> of two months:
>
> -rw------- 1 slurm slurm_users  621864632 Apr 28 09:49 
> odyssey_job_archive_2013-08-01T00:00:00_2013-09-30T23:59:59
>
> That's fine, but it is different than the other archives which are per 
> month.  I will keep purging down to 6 months and hopefully that will 
> be that.
>
> -Paul Edmon-
>
> On 04/27/2015 12:14 PM, bugs@schedmd.com wrote:
>>
>> *Comment # 78 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c78> on 
>> bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian 
>> Christiansen <mailto:brian@schedmd.com> *
>> Hey Paul, how did the step archiving and purging go?
>> ------------------------------------------------------------------------
>> You are receiving this mail because:
>>
>>   * You reported the bug.
>>
>
Comment 85 Brian Christiansen 2015-04-28 02:57:39 MDT
I can understand that. The code only walks up by month if the oldest record is older than 60 days. So the oldest record probably wasn't 60 days older than 20 -- or whatever  number you had -- months. So it just did all it could.
Comment 86 Paul Edmon 2015-04-28 02:58:34 MDT
Ah, that makes sense.  I will let you know when the purges are done.  I 
think at that point we can close this out.

-Paul Edmon-

On 04/28/2015 10:57 AM, bugs@schedmd.com wrote:
>
> *Comment # 85 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c85> on 
> bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> I can understand that. The code only walks up by month if the oldest record is
> older than 60 days. So the oldest record probably wasn't 60 days older than 20
> -- or whatever  number you had -- months. So it just did all it could.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 87 Brian Christiansen 2015-04-28 03:07:33 MDT
Ok. Let me know when you're done and then I'll close it out. Thanks for your help!
Comment 88 Paul Edmon 2015-04-28 07:30:35 MDT
Okay, just finished.  Got us down to 6 months.  Thanks for all your 
help.  This was a long one.

-Paul Edmon-

On 04/28/2015 11:07 AM, bugs@schedmd.com wrote:
>
> *Comment # 87 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c87> on 
> bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> Ok. Let me know when you're done and then I'll close it out. Thanks for your
> help!
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 89 Brian Christiansen 2015-04-28 07:36:53 MDT
Awesome! You can hold on to the patches until 15.08 or not worry about them since the original code will just be archiving one month at a time -- now that you are caught up.

Thanks again for your help and patience.
Comment 90 Paul Edmon 2015-04-28 07:37:47 MDT
I'll keep them any way if only out of paranoia.  Thanks.

-Paul Edmon-

On 04/28/2015 03:36 PM, bugs@schedmd.com wrote:
> Brian Christiansen <mailto:brian@schedmd.com> changed bug 1576 
> <http://bugs.schedmd.com/show_bug.cgi?id=1576>
> What 	Removed 	Added
> Status 	IN_PROGRESS 	RESOLVED
> Version Fixed 		15.08.0pre4
> Resolution 	--- 	FIXED
>
> *Comment # 89 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c89> on 
> bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> Awesome! You can hold on to the patches until 15.08 or not worry about them
> since the original code will just be archiving one month at a time -- now that
> you are caught up.
>
> Thanks again for your help and patience.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>