Last night we had scheduled a purge of our 40 GB, 2 year old job database. Just prior to this it appears that the slurmdbd suffered the memory leak noted in: bugs.schedmd.com/show_bug.cgi?id=1534 Inspite of being in this state it did successfully begin the purge. It successfully dumped the date in events and suspend, but when it got to jobs it threw the following errors: Apr 1 00:04:30 holy-slurm01 slurmdbd[41341]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program Apr 1 00:04:30 holy-slurm01 slurmdbd[41341]: error: pack32: Buffer size limit exceeded (-61117 > -65536) Apr 1 00:04:30 holy-slurm01 slurmdbd[41341]: error: packmem: Buffer size limit exceeded (-61116 > -65536) Apr 1 00:04:30 holy-slurm01 slurmdbd[41341]: error: pack32: Buffer size limit exceeded (-61117 > -65536) Apr 1 00:04:30 holy-slurm01 slurmdbd[41341]: error: packmem: Buffer size limit exceeded (-61103 > -65536) Apr 1 00:04:30 holy-slurm01 slurmdbd[41341]: error: pack32: Buffer size limit exceeded (-61117 > -65536) Apr 1 00:04:30 holy-slurm01 slurmdbd[41341]: error: packmem: Buffer size limit exceeded (-61116 > -65536) Apr 1 00:04:30 holy-slurm01 slurmdbd[41341]: error: pack32: Buffer size limit exceeded (-61117 > -65536) Apr 1 00:04:30 holy-slurm01 slurmdbd[41341]: error: pack32: Buffer size limit exceeded (-61117 > -65536) Apr 1 00:04:30 holy-slurm01 slurmdbd[41341]: error: pack32: Buffer size limit exceeded (-61117 > -65536) Apr 1 00:04:30 holy-slurm01 slurmdbd[41341]: error: pack32: Buffer size limit exceeded (-61117 > -65536) Apr 1 00:04:30 holy-slurm01 slurmdbd[41341]: error: packmem: Buffer size limit exceeded (-61117 > -65536) Apr 1 00:04:30 holy-slurm01 slurmdbd[41341]: error: pack32: Buffer size limit exceeded (-61117 > -65536) Apr 1 00:04:30 holy-slurm01 slurmdbd[41341]: error: packmem: Buffer size limit exceeded (-61117 > -65536) Apr 1 00:04:30 holy-slurm01 slurmdbd[41341]: error: packmem: Buffer size limit exceeded (-61109 > -65536) Apr 1 00:04:30 holy-slurm01 slurmdbd[41341]: error: pack32: Buffer size limit exceeded (-61117 > -65536) Apr 1 00:04:30 holy-slurm01 slurmdbd[41341]: error: packmem: Buffer size limit exceeded (-61117 > -65536) Apr 1 00:04:37 holy-slurm01 slurmctld[55137]: error: slurmdbd: Getting response to message type 1407 Apr 1 00:04:37 holy-slurm01 slurmctld[55137]: slurmdbd: reopening connection Apr 1 00:04:42 holy-slurm01 slurmctld[55137]: error: slurmdbd: Sending message type 1472: 11: Connection refused Apr 1 00:04:42 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused Apr 1 00:05:22 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused Apr 1 00:05:44 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused Apr 1 00:05:56 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused Apr 1 00:06:28 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused Apr 1 00:06:37 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused Apr 1 00:06:43 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused Apr 1 00:06:58 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused Apr 1 00:07:38 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused Apr 1 00:07:53 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused Apr 1 00:08:15 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused Apr 1 00:08:32 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused Apr 1 00:08:55 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused Apr 1 00:09:11 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused Apr 1 00:10:07 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused Apr 1 00:10:18 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused Apr 1 00:13:29 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused Apr 1 00:14:27 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused At which point slurmdbd crashed. It tried to restart again when our puppet software tried to auto restart it but it crashed immediately: Apr 1 01:23:22 holy-slurm01 puppet-agent[12548]: (/Stage[main]/Slurm::Master/Service[slurmdbd]/ensure) ensure changed 'stopped' to 'running' Apr 1 01:24:07 holy-slurm01 slurmdbd[14045]: slurmdbd version 14.11.4 started Apr 1 01:24:07 holy-slurm01 slurmctld[55137]: Registering slurmctld at port 6820 with slurmdbd. Apr 1 01:24:13 holy-slurm01 slurmctld[55137]: slurmdbd: agent queue size 11800 Apr 1 01:24:20 holy-slurm01 slurmctld[55137]: slurmdbd: agent queue size 2850 Apr 1 01:32:19 holy-slurm01 slurmdbd[14045]: error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction Apr 1 01:32:19 holy-slurm01 slurmdbd[14045]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program Apr 1 01:32:20 holy-slurm01 slurmctld[55137]: error: slurmdbd: Getting response to message type 1407 Apr 1 01:32:20 holy-slurm01 slurmctld[55137]: slurmdbd: reopening connection Apr 1 01:32:25 holy-slurm01 slurmctld[55137]: error: slurmdbd: Sending message type 1472: 11: Connection refused Apr 1 01:32:25 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused Apr 1 01:32:41 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused Apr 1 01:32:49 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused [root@holy-slurm01 log]# grep fatal messages Apr 1 00:04:30 holy-slurm01 slurmdbd[41341]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program Apr 1 01:32:19 holy-slurm01 slurmdbd[14045]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program Apr 1 03:46:31 holy-slurm01 slurmdbd[51927]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program Apr 1 05:52:05 holy-slurm01 slurmdbd[63753]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program Apr 1 07:52:29 holy-slurm01 slurmdbd[7120]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program Apr 1 09:55:48 holy-slurm01 slurmdbd[58235]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program [root@holy-slurm01 log]# grep fatal mysql grep: mysql: No such file or directory [root@holy-slurm01 log]# grep mysql messages Apr 1 00:04:30 holy-slurm01 slurmdbd[41341]: error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction Apr 1 00:04:30 holy-slurm01 slurmdbd[41341]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program Apr 1 01:32:19 holy-slurm01 slurmdbd[14045]: error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction Apr 1 01:32:19 holy-slurm01 slurmdbd[14045]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program Apr 1 03:46:31 holy-slurm01 slurmdbd[51927]: error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction Apr 1 03:46:31 holy-slurm01 slurmdbd[51927]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program Apr 1 05:52:05 holy-slurm01 slurmdbd[63753]: error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction Apr 1 05:52:05 holy-slurm01 slurmdbd[63753]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program Apr 1 07:52:29 holy-slurm01 slurmdbd[7120]: error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction Apr 1 07:52:29 holy-slurm01 slurmdbd[7120]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program Apr 1 09:55:48 holy-slurm01 slurmdbd[58235]: error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction Apr 1 09:55:48 holy-slurm01 slurmdbd[58235]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program I have the full logs but they are massive given the other errors related to the the memory leak. Messages is 5.8G, and greping out only the slurmdbd errors gives 5.5G. These are larger than I can attach. I suspsect that purging a database this size is no go. I had attempted before to force it from the command line but it timed out. I was told that allowing it to do so on the 1st of the month automatically would avoid this problem. There is either a bug or the timeouts are not sufficient to purge a 36 million entry database. Things are back up and stable, but our original goal of shrinking our job database has not been met. It would be good if we could safely do this.
Actually scratch that. slurmdbd is still crashing. Everytime it restarts it has that error chain I said below. To reiterate: Apr 1 10:01:41 holy-slurm01 slurmdbd[50263]: slurmdbd version 14.11.4 started Apr 1 10:01:42 holy-slurm01 slurmctld[55137]: Registering slurmctld at port 6820 with slurmdbd. Apr 1 10:36:39 holy-slurm01 slurmdbd[50263]: error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction Apr 1 10:36:39 holy-slurm01 slurmdbd[50263]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program Apr 1 10:36:40 holy-slurm01 slurmctld[55137]: error: slurmdbd: Getting response to message type 1407 Apr 1 10:36:40 holy-slurm01 slurmctld[55137]: slurmdbd: reopening connection Apr 1 10:36:46 holy-slurm01 slurmctld[55137]: error: slurmdbd: Sending message type 1472: 11: Connection refused Apr 1 10:36:46 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused Apr 1 10:37:02 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused Apr 1 10:37:29 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused Apr 1 10:38:19 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused Apr 1 10:38:32 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused Apr 1 10:38:39 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused Apr 1 10:38:44 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused Apr 1 10:38:51 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused Apr 1 10:38:59 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused Apr 1 10:39:06 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused Apr 1 10:39:23 holy-slurm01 slurmctld[55137]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused Apr 1 10:39:35 holy-slurm01 slurmdbd[46066]: chdir to /var/tmp Apr 1 10:39:35 holy-slurm01 slurmdbd[46066]: slurmdbd version 14.11.4 started Apr 1 10:39:36 holy-slurm01 slurmctld[55137]: Registering slurmctld at port 6820 with slurmdbd. The dbd is crashing each time. Essentially we can't keep a stable slurmdbd currently. I can keep restarting it but it keeps crashing.
Sorry you are going through this Paul. I wouldn't expect this kind of behavior. I am guessing when you say "Things are back up and stable" you have taken out the purge options from the slurmdbd.conf. Would it be possible to try again? What does 'SHOW INNODB STATUS' give you when ran in a mysql prompt during the rollup? Is is possible you have another process like mysqldump running in the background at the same time? This is usually the issue with things. If that is the case we can do something to that to make things better. Bug 1304 sounds very similar to this. One way you can do to get around the ER_LOCK_WAIT_TIMEOUT is to increase innodb_lock_wait_timeout in you my.cnf I remember the default used to be a 15 minute timeout, but according the the docs it appears mysql will default to 50 seconds instead (http://dev.mysql.com/doc/refman/5.5/en/innodb-parameters.html#sysvar_innodb_lock_wait_timeout). If this is the case and other things are trying to do things to the database I can easily see how 50 seconds wouldn't be long enough. Could you try to increase this to 900 or greater and restart the mysqld and the dbd?
I haven't changed the DB yet so we are still running with the settings that automatically initiated the purge. I can remove those if you think it will keep it from crashing. Here is the show innodb status: [root@holy-slurm01 log]# mysql Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 1710474 Server version: 5.1.69-log Source distribution Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. mysql> show innodb status; +--------+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Type | Name | Status | +--------+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | InnoDB | | ===================================== 150401 10:48:05 INNODB MONITOR OUTPUT ===================================== Per second averages calculated from the last 22 seconds ---------- SEMAPHORES ---------- OS WAIT ARRAY INFO: reservation count 421352606, signal count 411903012 Mutex spin waits 0, rounds 11433585640, OS waits 38695224 RW-shared spins 641564306, OS waits 312022856; RW-excl spins 72794209, OS waits 49283733 ------------ TRANSACTIONS ------------ Trx id counter 0 371325858 Purge done for trx's n:o < 0 371325133 undo n:o < 0 0 History list length 132 LIST OF TRANSACTIONS FOR EACH SESSION: ---TRANSACTION 0 0, not started, process no 4821, OS thread id 140336043128576 MySQL thread id 1710474, query id 235838008 localhost root show innodb status ---TRANSACTION 0 371325852, not started, process no 4821, OS thread id 140348404791040 MySQL thread id 1710029, query id 235837350 localhost slurm ---TRANSACTION 0 371325128, not started, process no 4821, OS thread id 140346243675904 MySQL thread id 1710025, query id 235823044 localhost slurm ---TRANSACTION 0 371325815, ACTIVE 53 sec, process no 4821, OS thread id 140348415973120 mysql tables in use 1, locked 0 MySQL thread id 1710451, query id 235838009 localhost slurm select t1.id_step, t1.time_start, t1.time_end, t1.time_suspended, t1.step_name, t1.nodelist, t1.node_inx, t1.state, t1.kill_requid, t1.exit_code, t1.nodes_alloc, t1.cpus_alloc, t1.task_cnt, t1.task_dist, t1.user_sec, t1.user_usec, t1.sys_sec, t1.sys_usec, t1.max_disk_read, t1.max_disk_read_task, t1.max_disk_read_node, t1.ave_disk_read, t1.max_disk_write, t1.max_disk_write_task, t1.max_disk_write_node, t1.ave_disk_write, t1.max_vsize, t1.max_vsize_task, t1.max_vsize_node, t1.ave_vsize, t1.max_rss, t1.max_rss_task, t1.max_rss_node, t1.ave_rss, t1.max_pages, t1.max_pages_task, t1.max_pages_node, Trx read view will not see trx with id >= 0 371325816, sees < 0 371325130 ---TRANSACTION 0 371325257, ACTIVE 453 sec, process no 4821, OS thread id 140348424492800 MySQL thread id 1710109, query id 235832975 localhost slurm Trx read view will not see trx with id >= 0 371325258, sees < 0 371325130 ---TRANSACTION 0 371325130, ACTIVE 510 sec, process no 4821, OS thread id 140348436739840 fetching rows, thread declared inside InnoDB 58 mysql tables in use 2, locked 0 267 lock struct(s), heap size 30704, 1570 row lock(s), undo log entries 1570 MySQL thread id 1710027, query id 235837349 localhost slurm Copying to tmp table select job.job_db_inx, job.id_job, job.id_assoc, job.id_wckey, job.array_task_pending, job.time_eligible, job.time_start, job.time_end, job.time_suspended, job.cpus_alloc, job.cpus_req, job.id_resv, SUM(step.consumed_energy) from "odyssey_job_table" as job left outer join "odyssey_step_table" as step on job.job_db_inx=step.job_db_inx and (step.id_step>=0) where (job.time_eligible < 1427850000 && (job.time_end >= 1427846400 || job.time_end = 0)) group by job.job_db_inx order by job.id_assoc, job.time_eligible Trx read view will not see trx with id >= 0 371325131, sees < 0 371325128 -------- FILE I/O -------- I/O thread 0 state: waiting for i/o request (insert buffer thread) I/O thread 1 state: waiting for i/o request (log thread) I/O thread 2 state: waiting for i/o request (read thread) I/O thread 3 state: waiting for i/o request (write thread) Pending normal aio reads: 0, aio writes: 0, ibuf aio reads: 0, log i/o's: 0, sync i/o's: 0 Pending flushes (fsync) log: 0; buffer pool: 0 1087776614 OS file reads, 23847621 OS file writes, 14255701 OS fsyncs 4181.90 reads/s, 55273 avg bytes/read, 12.45 writes/s, 1.95 fsyncs/s ------------------------------------- INSERT BUFFER AND ADAPTIVE HASH INDEX ------------------------------------- Ibuf: size 1, free list len 398, seg size 400, 5807829 inserts, 5807829 merged recs, 957399 merges Hash table size 17393, node heap has 1 buffer(s) 15750.51 hash searches/s, 29272.12 non-hash searches/s --- LOG --- Log sequence number 144 2401285555 Log flushed up to 144 2401285555 Last checkpoint at 144 2401285555 0 pending log writes, 0 pending chkp writes 12545380 log i/o's done, 0.86 log i/o's/second ---------------------- BUFFER POOL AND MEMORY ---------------------- Total memory allocated 21146200; in additional pool allocated 1047552 Dictionary memory allocated 176120 Buffer pool size 512 Free buffers 0 Database pages 510 Modified db pages 0 Pending reads 1 Pending writes: LRU 0, flush list 0, single page 0 Pages read 2993816805, created 1381330, written 13753708 14108.54 reads/s, 0.05 creates/s, 13.73 writes/s Buffer pool hit rate 955 / 1000 -------------- ROW OPERATIONS -------------- 1 queries inside InnoDB, 0 queries in queue 4 read views open inside InnoDB Main thread process no. 4821, id 140346901477120, state: sleeping Number of rows inserted 4942516, updated 5005110, deleted 9622643, read 83360981565 0.59 inserts/s, 10.09 updates/s, 0.00 deletes/s, 542952.55 reads/s ---------------------------- END OF INNODB MONITOR OUTPUT ============================ | +--------+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set, 1 warning (0.01 sec) I will answer the rest of the questions in a separate note. -Paul Edmon- On 04/01/2015 10:44 AM, bugs@schedmd.com wrote: > > *Comment # 2 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c2> on bug > 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Danny Auble > <mailto:da@schedmd.com> * > Sorry you are going through this Paul. I wouldn't expect this kind of > behavior. > > I am guessing when you say "Things are back up and stable" you have taken out > the purge options from the slurmdbd.conf. > > Would it be possible to try again? > > What does 'SHOW INNODB STATUS' give you when ran in a mysql prompt during the > rollup? > > Is is possible you have another process like mysqldump running in the > background at the same time? This is usually the issue with things. If that > is the case we can do something to that to make things better.Bug 1304 <show_bug.cgi?id=1304> sounds > very similar to this. > > One way you can do to get around the ER_LOCK_WAIT_TIMEOUT is to increase > > innodb_lock_wait_timeout > > in you my.cnf > > I remember the default used to be a 15 minute timeout, but according the the > docs it appears mysql will default to 50 seconds instead > (http://dev.mysql.com/doc/refman/5.5/en/innodb-parameters.html#sysvar_innodb_lock_wait_timeout). > > If this is the case and other things are trying to do things to the database I > can easily see how 50 seconds wouldn't be long enough. > > Could you try to increase this to 900 or greater and restart the mysqld and the > dbd? > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
We do have another query going: mysql> show processlist; +---------+-------------+--------------------------------------+-------+-------------+---------+----------------------------------------------------------------+------------------------------------------------------------------------------------------------------+ | Id | User | Host | db | Command | Time | State | Info | +---------+-------------+--------------------------------------+-------+-------------+---------+----------------------------------------------------------------+------------------------------------------------------------------------------------------------------+ | 7 | replication | db-internal.rc.fas.harvard.edu:51846 | NULL | Binlog Dump | 1645344 | Has sent all binlog to slave; waiting for binlog to be updated | NULL | | 1710025 | slurm | localhost | slurm | Sleep | 626 | | NULL | | 1710027 | slurm | localhost | slurm | Query | 59 | Copying to tmp table | select job.job_db_inx, job.id_job, job.id_assoc, job.id_wckey, job.array_task_pending, job.time_elig | | 1710029 | slurm | localhost | slurm | Sleep | 1 | | NULL | | 1710109 | slurm | localhost | slurm | Sleep | 564 | | NULL | | 1710474 | root | localhost | NULL | Query | 0 | NULL | show processlist | | 1710502 | slurm | localhost | slurm | Sleep | 65 | | NULL | +---------+-------------+--------------------------------------+-------+-------------+---------+----------------------------------------------------------------+------------------------------------------------------------------------------------------------------+ 7 rows in set (0.00 sec) But it wasn't a dump and shouldn't have caused a delay in dumping. Our my.cnf looks like: [root@holy-slurm01 etc]# cat my.cnf [mysqld] datadir=/var/lib/mysql socket=/var/lib/mysql/mysql.sock user=mysql # Disabling symbolic-links is recommended to prevent assorted security risks symbolic-links=0 !includedir /etc/mysql.d [mysqld_safe] log-error=/var/log/mysqld.log pid-file=/var/run/mysqld/mysqld.pid So it looks like we are using the defaults. I remember asking Danny about this and he recommended against it if we didn't have to. Do you think we should increase the time out? For reference our version of mysql is: [root@holy-slurm01 etc]# rpm -qa | grep mysql mysql-5.1.69-1.el6_4.x86_64 mysql-devel-5.1.69-1.el6_4.x86_64 mysql-libs-5.1.69-1.el6_4.x86_64 nagios-plugins-mysql-1.4.16-5.el6.x86_64 mysql-server-5.1.69-1.el6_4.x86_64 -Paul Edmon- On 04/01/2015 10:44 AM, bugs@schedmd.com wrote: > > *Comment # 2 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c2> on bug > 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Danny Auble > <mailto:da@schedmd.com> * > Sorry you are going through this Paul. I wouldn't expect this kind of > behavior. > > I am guessing when you say "Things are back up and stable" you have taken out > the purge options from the slurmdbd.conf. > > Would it be possible to try again? > > What does 'SHOW INNODB STATUS' give you when ran in a mysql prompt during the > rollup? > > Is is possible you have another process like mysqldump running in the > background at the same time? This is usually the issue with things. If that > is the case we can do something to that to make things better.Bug 1304 <show_bug.cgi?id=1304> sounds > very similar to this. > > One way you can do to get around the ER_LOCK_WAIT_TIMEOUT is to increase > > innodb_lock_wait_timeout > > in you my.cnf > > I remember the default used to be a 15 minute timeout, but according the the > docs it appears mysql will default to 50 seconds instead > (http://dev.mysql.com/doc/refman/5.5/en/innodb-parameters.html#sysvar_innodb_lock_wait_timeout). > > If this is the case and other things are trying to do things to the database I > can easily see how 50 seconds wouldn't be long enough. > > Could you try to increase this to 900 or greater and restart the mysqld and the > dbd? > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
No, don't remove anything from the slurmdbd.conf unless we can't fix the issue readily. Thanks for the information Paul. Looks like that select there is taking a long time which can be expected I suppose. Could you try increasing the timeout in my.cnf? I am not sure what 'Binlog Dump' does. It looks harmless, but perhaps it is causing something that isn't obvious. You will also want to have at least innodb_buffer_pool_size=64M. This also might fix the issue since we are dealing with large data. From the docs... Note that at least the first time running the slurmdbd with MySQL you need to make sure your my.cnf file has innodb_buffer_pool_size equal to at least 64 megabytes. You can accomplish this by adding the line: innodb_buffer_pool_size=64M under the [mysqld] reference in the my.cnf file and restarting the mysqld. This is needed when converting large tables over to the new database schema. While this doesn't spell out this scenario it could be relevant. Please add these two options to the my.cnf and restart the daemons and report back.
[root@holy-slurm01 etc]# cat my.cnf [mysqld] datadir=/var/lib/mysql socket=/var/lib/mysql/mysql.sock user=mysql # Disabling symbolic-links is recommended to prevent assorted security risks symbolic-links=0 innodb_lock_wait_timeout=900 innodb_buffer_pool_size=64M !includedir /etc/mysql.d [mysqld_safe] log-error=/var/log/mysqld.log pid-file=/var/run/mysqld/mysqld.pid This is the modified my.cnf just for the record. -Paul Edmon- On 04/01/2015 11:09 AM, bugs@schedmd.com wrote: > > *Comment # 5 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c5> on bug > 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Danny Auble > <mailto:da@schedmd.com> * > No, don't remove anything from the slurmdbd.conf unless we can't fix the issue > readily. > > Thanks for the information Paul. Looks like that select there is taking a long > time which can be expected I suppose. > > Could you try increasing the timeout in my.cnf? I am not sure what 'Binlog > Dump' does. It looks harmless, but perhaps it is causing something that isn't > obvious. > > You will also want to have at least innodb_buffer_pool_size=64M. This also > might fix the issue since we are dealing with large data. > > From the docs... > > Note that at least the first time running the slurmdbd with MySQL you need to > make sure your my.cnf file has innodb_buffer_pool_size equal to at least 64 > megabytes. You can accomplish this by adding the line: > innodb_buffer_pool_size=64M > under the [mysqld] reference in the my.cnf file and restarting the mysqld. This > is needed when converting large tables over to the new database schema. > > While this doesn't spell out this scenario it could be relevant. > > Please add these two options to the my.cnf and restart the daemons and report > back. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
So we had the memory leak again noted in bugs.schedmd.com/show_bug.cgi?id=1534 I don't know if the change in my.cnf caused the issue or if the my.cnf change excacerbated this problem. I've restarted slurmdbd to see if it happens again. -Paul Edmon- On 04/01/2015 11:09 AM, bugs@schedmd.com wrote: > > *Comment # 5 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c5> on bug > 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Danny Auble > <mailto:da@schedmd.com> * > No, don't remove anything from the slurmdbd.conf unless we can't fix the issue > readily. > > Thanks for the information Paul. Looks like that select there is taking a long > time which can be expected I suppose. > > Could you try increasing the timeout in my.cnf? I am not sure what 'Binlog > Dump' does. It looks harmless, but perhaps it is causing something that isn't > obvious. > > You will also want to have at least innodb_buffer_pool_size=64M. This also > might fix the issue since we are dealing with large data. > > From the docs... > > Note that at least the first time running the slurmdbd with MySQL you need to > make sure your my.cnf file has innodb_buffer_pool_size equal to at least 64 > megabytes. You can accomplish this by adding the line: > innodb_buffer_pool_size=64M > under the [mysqld] reference in the my.cnf file and restarting the mysqld. This > is needed when converting large tables over to the new database schema. > > While this doesn't spell out this scenario it could be relevant. > > Please add these two options to the my.cnf and restart the daemons and report > back. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Yeah, that definitely exacerbated the memory leak in (http://bugs.schedmd.com/show_bug.cgi?id=1534): top - 12:35:40 up 19 days, 2:49, 5 users, load average: 2.43, 3.14, 3.12 Tasks: 1568 total, 2 running, 1566 sleeping, 0 stopped, 0 zombie Cpu(s): 1.1%us, 1.7%sy, 0.0%ni, 97.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 264498708k total, 189166220k used, 75332488k free, 662488k buffers Swap: 4194296k total, 0k used, 4194296k free, 162701464k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 34477 root 20 0 264m 225m 1672 R 99.8 0.1 79:55.18 syslog-ng 28465 slurm 20 0 18.2g 17g 2552 S 65.0 7.0 1:42.46 slurmdbd the slurmdbd is never that big and its growing. It's also spewing: Apr 1 12:35:36 holy-slurm01 slurmdbd[28465]: error: packmem: Buffer size limit exceeded (-56581 > -65536) Apr 1 12:35:36 holy-slurm01 slurmdbd[28465]: error: packmem: Buffer size limit exceeded (-56588 > -65536) Apr 1 12:35:36 holy-slurm01 slurmdbd[28465]: error: packmem: Buffer size limit exceeded (-56588 > -65536) Apr 1 12:35:36 holy-slurm01 slurmdbd[28465]: error: packmem: Buffer size limit exceeded (-56581 > -65536) Apr 1 12:35:36 holy-slurm01 slurmdbd[28465]: error: packmem: Buffer size limit exceeded (-56588 > -65536) Apr 1 12:35:36 holy-slurm01 slurmdbd[28465]: error: packmem: Buffer size limit exceeded (-56588 > -65536) Apr 1 12:35:36 holy-slurm01 slurmdbd[28465]: error: packmem: Buffer size limit exceeded (-56588 > -65536) Apr 1 12:35:36 holy-slurm01 slurmdbd[28465]: error: packmem: Buffer size limit exceeded (-56588 > -65536) Apr 1 12:35:36 holy-slurm01 slurmdbd[28465]: error: packmem: Buffer size limit exceeded (-56588 > -65536) Apr 1 12:35:36 holy-slurm01 slurmdbd[28465]: error: packmem: Buffer size limit exceeded (-56584 > -65536) Do you think this is due to us running an older version of mysql (we are on 5.1.69? What do you recommend we do? Change the my.cnf back to what we had? I've had to restart slurmdbd again to get it to reset itself. -Paul Edmon- On 04/01/2015 11:09 AM, bugs@schedmd.com wrote: > > *Comment # 5 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c5> on bug > 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Danny Auble > <mailto:da@schedmd.com> * > No, don't remove anything from the slurmdbd.conf unless we can't fix the issue > readily. > > Thanks for the information Paul. Looks like that select there is taking a long > time which can be expected I suppose. > > Could you try increasing the timeout in my.cnf? I am not sure what 'Binlog > Dump' does. It looks harmless, but perhaps it is causing something that isn't > obvious. > > You will also want to have at least innodb_buffer_pool_size=64M. This also > might fix the issue since we are dealing with large data. > > From the docs... > > Note that at least the first time running the slurmdbd with MySQL you need to > make sure your my.cnf file has innodb_buffer_pool_size equal to at least 64 > megabytes. You can accomplish this by adding the line: > innodb_buffer_pool_size=64M > under the [mysqld] reference in the my.cnf file and restarting the mysqld. This > is needed when converting large tables over to the new database schema. > > While this doesn't spell out this scenario it could be relevant. > > Please add these two options to the my.cnf and restart the daemons and report > back. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Created attachment 1792 [details] don't unload plugin patch Will you apply this patch and then run the slurmdbd under valgrind again? The patch just prevents the plugin from being unloaded so that we can see where the memory is going. Will you run it like this? valgrind --tool=memcheck --xml=yes --xml-file=slurmdbd.mc --error-limit=no --leak-check=full --show-reachable=yes --max-stackframe=16777216 --num-callers=20 --child-silent-after-fork=yes --track-origins=yes slurmdbd -Dvv
Sure. I will do so. -Paul Edmon- On 04/01/2015 12:56 PM, bugs@schedmd.com wrote: > > *Comment # 9 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c9> on bug > 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian > Christiansen <mailto:brian@schedmd.com> * > Createdattachment 1792 <attachment.cgi?id=1792&action=diff> [details] <attachment.cgi?id=1792&action=edit> > don't unload plugin patch > > Will you apply this patch and then run the slurmdbd under valgrind again? The > patch just prevents the plugin from being unloaded so that we can see where the > memory is going. > > Will you run it like this? > > valgrind --tool=memcheck --xml=yes --xml-file=slurmdbd.mc --error-limit=no > --leak-check=full --show-reachable=yes --max-stackframe=16777216 > --num-callers=20 --child-silent-after-fork=yes --track-origins=yes slurmdbd > -Dvv > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Thanks. The valgrind options will spit out a "slurmdbd..0.mc" file. Will you attach that after it's done?
Will do. I'm rebuilding the rpms now. -Paul Edmon- On 04/01/2015 01:07 PM, bugs@schedmd.com wrote: > > *Comment # 11 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c11> on > bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian > Christiansen <mailto:brian@schedmd.com> * > Thanks. The valgrind options will spit out a "slurmdbd..0.mc" file. Will you > attach that after it's done? > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Created attachment 1793 [details] slurmdbd.mc Okay, so I've started valgrind on slurmdbd, its spit out the initial mc file but it's still working here's the initial data: [root@holy-slurm01 valgrind]# valgrind --tool=memcheck --xml=yes --xml-file=slurmdbd.mc --error-limit=no --leak-check=full --show-reachable=yes --max-stackframe=16777216 --num-callers=20 --child-silent-after-fork=yes --track-origins=yes slurmdbd -Dvv slurmdbd: debug: auth plugin for Munge (http://code.google.com/p/munge/) loaded slurmdbd: Accounting storage MYSQL plugin loaded slurmdbd: slurmdbd version 14.11.4 started slurmdbd: debug: Reading slurm.conf file: /etc/slurm/slurm.conf slurmdbd: debug: DBD_INIT: CLUSTER:odyssey VERSION:7168 UID:57812 IP:10.31.20.139 CONN:8 slurmdbd: debug: DBD_INIT: CLUSTER:odyssey VERSION:7168 UID:57812 IP:10.31.20.139 CONN:6 slurmdbd: debug: DBD_INIT: CLUSTER:odyssey VERSION:7168 UID:56875 IP:10.242.104.140 CONN:10 slurmdbd: debug: DBD_INIT: CLUSTER:odyssey VERSION:7168 UID:57812 IP:10.31.20.139 CONN:6 slurmdbd: debug: DBD_INIT: CLUSTER:odyssey VERSION:7168 UID:59719 IP:10.242.86.133 CONN:10 slurmdbd: debug: DBD_INIT: CLUSTER:odyssey VERSION:7168 UID:100663 IP:10.242.104.137 CONN:6 slurmdbd: debug: DBD_INIT: CLUSTER:odyssey VERSION:7168 UID:59719 IP:10.242.86.133 CONN:10 slurmdbd: debug: DBD_INIT: CLUSTER:odyssey VERSION:7168 UID:56875 IP:10.242.104.140 CONN:6 slurmdbd: debug: DBD_INIT: CLUSTER:odyssey VERSION:7168 UID:557658 IP:10.242.128.22 CONN:12 I will let you know when its done. -Paul Edmon- On 04/01/2015 01:07 PM, bugs@schedmd.com wrote: > > *Comment # 11 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c11> on > bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian > Christiansen <mailto:brian@schedmd.com> * > Thanks. The valgrind options will spit out a "slurmdbd..0.mc" file. Will you > attach that after it's done? > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
This is one more thing to consider for mysql. Setting innodb_buffer_pool to 64 was the minimum that we've seen needed before. The documentation for the parameter says that "On a dedicated database server, you may set this to up to 80% of the machine physical memory size." Mine was set at 256M on an 8GB machine. You may consider increasing the value. http://dev.mysql.com/doc/refman/5.0/en/innodb-parameters.html#sysvar_innodb_buffer_pool_size
Okay, good to know. I haven't see the slurmdbd leak again, it could be that slowing it down with valgrind is suppressing it somewhat. It also hasn't crashed either. That said, we can't keep running it with valgrind forever. I will check again in the morning to see its state. For the record our DB is on the same machine as the master (we found that it was best for performance). We have 256GB available for memory, so we have plenty of room if we want to up the buffer. -Paul Edmon- On 4/1/2015 5:06 PM, bugs@schedmd.com wrote: > Brian Christiansen <mailto:brian@schedmd.com> changed bug 1576 > <http://bugs.schedmd.com/show_bug.cgi?id=1576> > What Removed Added > Status UNCONFIRMED IN_PROGRESS > Ever confirmed 1 > > *Comment # 14 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c14> on > bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian > Christiansen <mailto:brian@schedmd.com> * > This is one more thing to consider for mysql. Setting innodb_buffer_pool to 64 > was the minimum that we've seen needed before. The documentation for the > parameter says that "On a dedicated database server, you may set this to up to > 80% of the machine physical memory size." Mine was set at 256M on an 8GB > machine. You may consider increasing the value. > > http://dev.mysql.com/doc/refman/5.0/en/innodb-parameters.html#sysvar_innodb_buffer_pool_size > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Created attachment 1796 [details] slurmdbd.mc Update. So it hasn't memory leaked but the memory footprint of slurmdbd has increased by a factor of two (used to be 300 MB now 700 MB). I've attached the current valgrind results. slurmdbd is still running with valgrind. Let me know if you want me to change that. On a side note we were planning on upgrading from 14.11.4 to 14.11.5 on Monday. Do you think we still should or should we wait? -Paul Edmon- On 04/01/2015 05:06 PM, bugs@schedmd.com wrote: > Brian Christiansen <mailto:brian@schedmd.com> changed bug 1576 > <http://bugs.schedmd.com/show_bug.cgi?id=1576> > What Removed Added > Status UNCONFIRMED IN_PROGRESS > Ever confirmed 1 > > *Comment # 14 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c14> on > bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian > Christiansen <mailto:brian@schedmd.com> * > This is one more thing to consider for mysql. Setting innodb_buffer_pool to 64 > was the minimum that we've seen needed before. The documentation for the > parameter says that "On a dedicated database server, you may set this to up to > 80% of the machine physical memory size." Mine was set at 256M on an 8GB > machine. You may consider increasing the value. > > http://dev.mysql.com/doc/refman/5.0/en/innodb-parameters.html#sysvar_innodb_buffer_pool_size > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
I don't see harm upgrading on Monday. 14.11.6 will probably be released in 2-3 weeks. Maybe increasing the buffer pool size will give it the room to handle the large numbers of records. Could you try running the slurmdbd under valgrind over the weekend?
Sure can do. We will plan on doing the upgrade then. -Paul Edmon- On 04/02/2015 01:59 PM, bugs@schedmd.com wrote: > > *Comment # 17 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c17> on > bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian > Christiansen <mailto:brian@schedmd.com> * > I don't see harm upgrading on Monday. 14.11.6 will probably be released in 2-3 > weeks. > > Maybe increasing the buffer pool size will give it the room to handle the large > numbers of records. > > Could you try running the slurmdbd under valgrind over the weekend? > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Sounds good.
So I talked with my manager and we are not going to do a slurm upgrade until this database issue is resolved. So anything else we want to try in terms of fixing this I'm game. I'm out tomorrow but will be back on Monday. -Paul Edmon- On 04/02/2015 02:09 PM, bugs@schedmd.com wrote: > > *Comment # 19 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c19> on > bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian > Christiansen <mailto:brian@schedmd.com> * > Sounds good. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Created attachment 1798 [details] slurmdbd-0.mc So after talking about it here we thought that maybe issuing the archive dump caused the memory leak. As we did that shortly after we upgraded to 14.11. So I ran it again. [root@holy-slurm01 valgrind]# sacctmgr -i archive dump sview: error: slurm_receive_msg: Socket timed out on send/recv operation sview: error: slurm_receive_msg: Socket timed out on send/recv operation sview: error: slurm_receive_msg: Socket timed out on send/recv operation sview: error: slurm_receive_msg: Socket timed out on send/recv operation sview: error: slurm_receive_msg: Socket timed out on send/recv operation sview: error: slurm_receive_msg: Socket timed out on send/recv operation sacctmgr: error: slurmdbd: Getting response to message type 1459 sacctmgr: error: slurmdbd: DBD_ARCHIVE_DUMP failure: No error Problem dumping archive: Unspecified error It failed but while it was running slurmctld slowed to a crawl and slurmdbd was still running at 100% load after the purge had failed. I had to kill slurmdbd to get it back to normal. I've included the last valgrind data. I'm thinking that it's possible that our DB is just too big to purge. We may have to get more creative to purge this as it is 36 million lines and 40 GB. Have you ever had anyone purge something this big? -Paul Edmon- On 04/02/2015 02:09 PM, bugs@schedmd.com wrote: > > *Comment # 19 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c19> on > bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian > Christiansen <mailto:brian@schedmd.com> * > Sounds good. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Created attachment 1799 [details] debug patch 2 We're not aware of anyone purging records of that size. It's possible that it's been done, but they haven't told us. Questions: What purge failed? The one initiated with sacctmgr or the one periodically being run by the slurmdbd? I'm assuming the sacctmgr one. Do you see any errors in the slurmdbd log? How long did you wait before killing the slurmdbd? Did you kill the slurmdbd or valgrind? I ask because the valgrind output was the same size as the previous one. Can you tell if the slurmdbd is doing it's own archive --without issuing the sacctmgr command -- (Eg. do you see archive files being created, or log statements)? I've attached a patch will print out logs at "info" when the roll up and archiving is happening. ex. slurmdbd: running rollup at Thu Apr 02 14:51:21 2015 slurmdbd: archiving cluster 'compy' at Thu Apr 02 14:51:21 2015 slurmdbd: done archiving cluster 'compy' at Thu Apr 02 14:51:21 2015 slurmdbd: rollup finished at Thu Apr 02 14:51:21 2015 Let's try this: 1. apply the debug patch. 2. let's run without valgrind. It's probably slowing it down too much. 3. increase the innodb_buffer_pool_size to something large like 80GB. 4. set innodb_lock_wait_timeout to 60 minutes. 5. restart mysql and verify that they settings are in place. ex. MariaDB [(none)]> show variables like 'innodb_lock_wait_timeout'; +--------------------------+-------+ | Variable_name | Value | +--------------------------+-------+ | innodb_lock_wait_timeout | 300 | +--------------------------+-------+ 1 row in set (0.00 sec) MariaDB [(none)]> show variables like 'innodb_buffer_pool_size'; +-------------------------+-----------+ | Variable_name | Value | +-------------------------+-----------+ | innodb_buffer_pool_size | 134217728 | +-------------------------+-----------+ 1 row in set (0.00 sec) 6. restart slurmdbd Then be patient and see if it can get through all of the records. Maybe you could do this during a system maintenance time so that users don't see the slowness. How does that sound?
Actually, the idea about during a system maintenance window is a moot point because the dbd should start rolling up at start up. If you can too, try not to kill the slurmdbd. The rollup is being done in a transaction and killing it will just throw away everything that was done and will have to be redone.
Created attachment 1801 [details] slurmdbd-1.mc Yeah, sorry about that I was out on Friday. I've attached the latest valgrind results from running for the past 3 days. For the patch you just sent should I remove the previous patch? I will get the changes you suggested put in later today. As to your questions. The purge the failed was both the automatic purge that initiated this ticket, and then the forced purge I ran via sacctmgr. I waited about 15 minutes before killing the db as ctld was inoperable and so I had to get things back to a stable state. I did a kill -9 of the pid for slurmdbd. It does create the archive for the events and suspend. However if fails when it gets to jobs, I think due to the sheer size of our job table. Our maintenance window is over now, so anything will have to be done on the live system. -Paul Edmon- On 04/02/2015 06:10 PM, bugs@schedmd.com wrote: > > *Comment # 23 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c23> on > bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian > Christiansen <mailto:brian@schedmd.com> * > Actually, the idea about during a system maintenance window is a moot point > because the dbd should start rolling up at start up. If you can too, try not to > kill the slurmdbd. The rollup is being done in a transaction and killing it > will just throw away everything that was done and will have to be redone. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
No problem. You can apply the patch on top of the other. I also committed a patch that purges 50,000 records at a time so that the locks can be periodically released. You can apply this as well. https://github.com/SchedMD/slurm/commit/2f477b42d36b428f048952b053d59aa5f11c69da.patch Then when you restart, look for the log statements. ex. slurmdbd: running rollup at Thu Apr 02 14:51:21 2015 slurmdbd: archiving cluster 'compy' at Thu Apr 02 14:51:21 2015 slurmdbd: done archiving cluster 'compy' at Thu Apr 02 14:51:21 2015 slurmdbd: rollup finished at Thu Apr 02 14:51:21 2015
Okay, I will do so. Currently slurmdbd is running stably with the patch. I'm not running it in interactive mode but in daemon. What debug level do you want me to have it log at? -Paul Edmon- On 4/6/2015 2:28 PM, bugs@schedmd.com wrote: > > *Comment # 25 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c25> on > bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian > Christiansen <mailto:brian@schedmd.com> * > No problem. You can apply the patch on top of the other. I also committed a > patch that purges 50,000 records at a time so that the locks can be > periodically released. You can apply this as well. > > https://github.com/SchedMD/slurm/commit/2f477b42d36b428f048952b053d59aa5f11c69da.patch > > Then when you restart, look for the log statements. > > ex. > slurmdbd: running rollup at Thu Apr 02 14:51:21 2015 > slurmdbd: archiving cluster 'compy' at Thu Apr 02 14:51:21 2015 > slurmdbd: done archiving cluster 'compy' at Thu Apr 02 14:51:21 2015 > slurmdbd: rollup finished at Thu Apr 02 14:51:21 2015 > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
If it is running fine just leave it and report when the rollup finishes. If for some reason it fails then introduce the new patch. On April 6, 2015 5:42:34 PM PDT, bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=1576 > >--- Comment #26 from Paul Edmon <pedmon@cfa.harvard.edu> --- >Okay, I will do so. Currently slurmdbd is running stably with the >patch. I'm not running it in interactive mode but in daemon. What >debug level do you want me to have it log at? > >-Paul Edmon- > >On 4/6/2015 2:28 PM, bugs@schedmd.com wrote: >> >> *Comment # 25 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c25> on >> bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian >> Christiansen <mailto:brian@schedmd.com> * >> No problem. You can apply the patch on top of the other. I also >committed a >> patch that purges 50,000 records at a time so that the locks can be >> periodically released. You can apply this as well. >> >> >https://github.com/SchedMD/slurm/commit/2f477b42d36b428f048952b053d59aa5f11c69da.patch >> >> Then when you restart, look for the log statements. >> >> ex. >> slurmdbd: running rollup at Thu Apr 02 14:51:21 2015 >> slurmdbd: archiving cluster 'compy' at Thu Apr 02 14:51:21 2015 >> slurmdbd: done archiving cluster 'compy' at Thu Apr 02 14:51:21 2015 >> slurmdbd: rollup finished at Thu Apr 02 14:51:21 2015 >> >------------------------------------------------------------------------ >> You are receiving this mail because: >> >> * You reported the bug. >> > >-- >You are receiving this mail because: >You are on the CC list for the bug.
So it did do the roll up and it took about 50 minutes, but it did not archive. Will it automatically initiate that? Or should I try to force it to archive? I haven't applied the patch you sent yet. -Paul Edmon- On 04/06/2015 08:56 PM, bugs@schedmd.com wrote: > > *Comment # 27 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c27> on > bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Danny > Auble <mailto:da@schedmd.com> * > If it is running fine just leave it and report when the rollup finishes. If > for some reason it fails then introduce the new patch. > > On April 6, 2015 5:42:34 PM PDT,bugs@schedmd.com <mailto:bugs@schedmd.com> wrote: > >http://bugs.schedmd.com/show_bug.cgi?id=1576 <show_bug.cgi?id=1576> > > > >---Comment #26 <show_bug.cgi?id=1576#c26> from Paul Edmon <pedmon@cfa.harvard.edu <mailto:pedmon@cfa.harvard.edu>> --- > >Okay, I will do so. Currently slurmdbd is running stably with the > >patch. I'm not running it in interactive mode but in daemon. What > >debug level do you want me to have it log at? > > > >-Paul Edmon- > > > >On 4/6/2015 2:28 PM,bugs@schedmd.com <mailto:bugs@schedmd.com> wrote: > >> > >> *Comment # 25 <show_bug.cgi?id=1576#c25> <http://bugs.schedmd.com/show_bug.cgi?id=1576#c25 <show_bug.cgi?id=1576#c25>> on > >>bug 1576 <show_bug.cgi?id=1576> <http://bugs.schedmd.com/show_bug.cgi?id=1576 <show_bug.cgi?id=1576>> from Brian > >> Christiansen <mailto:brian@schedmd.com> * > >> No problem. You can apply the patch on top of the other. I also > >committed a > >> patch that purges 50,000 records at a time so that the locks can be > >> periodically released. You can apply this as well. > >> > >> > >https://github.com/SchedMD/slurm/commit/2f477b42d36b428f048952b053d59aa5f11c69da.patch > >> > >> Then when you restart, look for the log statements. > >> > >> ex. > >> slurmdbd: running rollup at Thu Apr 02 14:51:21 2015 > >> slurmdbd: archiving cluster 'compy' at Thu Apr 02 14:51:21 2015 > >> slurmdbd: done archiving cluster 'compy' at Thu Apr 02 14:51:21 2015 > >> slurmdbd: rollup finished at Thu Apr 02 14:51:21 2015 > >> > >------------------------------------------------------------------------ > >> You are receiving this mail because: > >> > >> * You reported the bug. > >> > > > >-- > >You are receiving this mail because: > >You are on the CC list for the bug. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Will you send your slurmdbd.conf?
# Archive info ArchiveJobs=yes ArchiveDir=/slurm/archive ArchiveSteps=yes ArchiveResvs=yes ArchiveEvents=yes ArchiveSuspend=yes #ArchiveScript= PurgeEventAfter=6month PurgeJobAfter=6month PurgeResvAfter=6month PurgeStepAfter=6month PurgeSuspendAfter=6month # Authentication info AuthType=auth/munge #AuthInfo=/var/run/munge/munge.socket.2 # slurmDBD info CommitDelay=1 DbdHost=localhost SlurmUser=slurm #MessageTimeout=300 DebugLevel=info #DefaultQOS=normal,standby PidFile=/var/slurmd/run/slurmdbd.pid #PluginDir=/usr/lib/slurm #PrivateData=accounts,users,usage,jobs #TrackWCKey=yes # Database info StorageType=accounting_storage/mysql StorageHost=localhost StorageUser=slurm StoragePass=[censored] StorageLoc=slurm On 04/07/2015 10:51 AM, bugs@schedmd.com wrote: > > *Comment # 29 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c29> on > bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian > Christiansen <mailto:brian@schedmd.com> * > Will you send your slurmdbd.conf? > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
More questions: Do you see that archive files were created? How big are they? Do see that records were purged from the database? (eg. is the db smaller, does sacct return jobs from 6 months ago -- sacct -E 2014-09-01) Do you see any errors in the slurmdbd log? I'm looking at some other possibilities that might be a problem with large databases -- particularly when archiving. I would just let it keep running as is in the mean time.
We've actually never had a successful purge. The errors at the start of this bug are what happened when the purge tried to run. Nothing has actually been removed from the database as it never actually dumped any files but these: [root@holy-slurm01 ~]# ls -ltr /slurm/archive total 506224 -rw------- 1 slurm slurm_users 163847476 Mar 10 09:16 odyssey_event_archive_2013-08-01T00:00:00_2014-08-31T23:59:59 -rw------- 1 slurm slurm_users 253639 Mar 10 09:17 odyssey_suspend_archive_2013-08-01T00:00:00_2014-08-31T23:59:59 -rw------- 1 slurm slurm_users 176829079 Apr 1 13:01 odyssey_event_archive_2013-08-01T00:00:00_2014-09-30T23:59:59.old -rw------- 1 slurm slurm_users 298515 Apr 1 13:01 odyssey_suspend_archive_2013-08-01T00:00:00_2014-09-30T23:59:59.old -rw------- 1 slurm slurm_users 176829079 Apr 6 14:33 odyssey_event_archive_2013-08-01T00:00:00_2014-09-30T23:59:59 -rw------- 1 slurm slurm_users 298515 Apr 6 14:33 odyssey_suspend_archive_2013-08-01T00:00:00_2014-09-30T23:59:59 As I noted at the beginning, we've never purged this database so it is gigantic. It should be managable once we get it down to a 6 month size. Right now it has everything in it still. -Paul Edmon- On 04/07/2015 02:31 PM, bugs@schedmd.com wrote: > > *Comment # 31 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c31> on > bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian > Christiansen <mailto:brian@schedmd.com> * > More questions: > Do you see that archive files were created? How big are they? > Do see that records were purged from the database? (eg. is the db smaller, does > sacct return jobs from 6 months ago -- sacct -E 2014-09-01) > Do you see any errors in the slurmdbd log? > > I'm looking at some other possibilities that might be a problem with large > databases -- particularly when archiving. I would just let it keep running as > is in the mean time. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Ok. Thanks for your patience and help. My hypothesis is that the slurmdbd is failing to archive the steps which would then short circuit the rest of the archiving and purging. I'm trying to confirm this. Do you see any errors in the slurmdbd log file? Specifically, "Can't save archive, create file" or "Error writing file " can you upload the slurmdbd log file for when it tried to archive? Will you show the last run times from the database? e.g. $mysql slurm mysql> select * from <clustername>_last_ran_table; +---------------+--------------+----------------+ | hourly_rollup | daily_rollup | monthly_rollup | +---------------+--------------+----------------+ | 1428012000 | 1427958000 | 1427871600 | +---------------+--------------+----------------+ If I can't see anything from the logs I'll get you another patch with more logging. Thanks, Brian
So I tried to force a dump and got this error: [root@holy-slurm01 ~]# sacctmgr -i archive dump sacctmgr: error: slurmdbd: Getting response to message type 1459 sacctmgr: error: slurmdbd: DBD_ARCHIVE_DUMP failure: No error Problem dumping archive: Unspecified error In the logs I got: Apr 8 10:41:38 holy-slurm01 slurmdbd[46432]: error: packmem: Buffer size limit exceeded (-59825 > -65536) Apr 8 10:41:38 holy-slurm01 slurmdbd[46432]: error: packmem: Buffer size limit exceeded (-59819 > -65536) Apr 8 10:41:38 holy-slurm01 slurmdbd[46432]: error: packmem: Buffer size limit exceeded (-59824 > -65536) Apr 8 10:41:38 holy-slurm01 slurmdbd[46432]: error: packmem: Buffer size limit exceeded (-59825 > -65536) Apr 8 10:41:38 holy-slurm01 slurmdbd[46432]: error: packmem: Buffer size limit exceeded (-59816 > -65536) Apr 8 10:41:38 holy-slurm01 slurmdbd[46432]: error: packmem: Buffer size limit exceeded (-59821 > -65536) That's the only error or traffic. I was watching mysql and slurmdbd at the time. MySQL did the query and then sent it to slurmdbd. Slurmdbd was in the process of swallowing the data from that query got to about 14 GB in size on memory and then freaked out with this error. I had to stop it to get it back to normal. I will also note that slurmctld was inoperable while I was forcing it to dump so the entire system ground to a halt. I think the query itself is just too big for slurmdbd to handle. Clearly it is suffering from a buffer overrun, probably because of an integer overrun. All the integers should probably be 64 bit in slurm if they aren't. So not a memory leak, but rather a buffer overrun. Anyways, that patch that only spits out 50,000 at a time may not be a bad idea as it may allow it to output things in more manageable portions. -Paul Edmon- On 04/07/2015 07:00 PM, bugs@schedmd.com wrote: > > *Comment # 33 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c33> on > bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian > Christiansen <mailto:brian@schedmd.com> * > Ok. Thanks for your patience and help. My hypothesis is that the slurmdbd is > failing to archive the steps which would then short circuit the rest of the > archiving and purging. I'm trying to confirm this. > > Do you see any errors in the slurmdbd log file? Specifically, > "Can't save archive, create file" > or > "Error writing file" > > can you upload the slurmdbd log file for when it tried to archive? > > > Will you show the last run times from the database? > e.g. > $mysql slurm > mysql> select * from <clustername>_last_ran_table; > +---------------+--------------+----------------+ > | hourly_rollup | daily_rollup | monthly_rollup | > +---------------+--------------+----------------+ > | 1428012000 | 1427958000 | 1427871600 | > +---------------+--------------+----------------+ > > If I can't see anything from the logs I'll get you another patch with more > logging. > > Thanks, > Brian > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Ok. Looks like I need to rewrite the archiving code to write out chunks at a time like I did for purgining. For now, I recommend turning off purging and archiving until I get you the new patch.
Sure will do. -Paul Edmon- On 4/8/2015 12:05 PM, bugs@schedmd.com wrote: > > *Comment # 35 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c35> on > bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian > Christiansen <mailto:brian@schedmd.com> * > Ok. Looks like I need to rewrite the archiving code to write out chunks at a > time like I did for purgining. For now, I recommend turning off purging and > archiving until I get you the new patch. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Created attachment 1819 [details] archive patch I've made the changes in 15.08. https://github.com/SchedMD/slurm/commit/343d55f85dfbf8eeb4a15d9f110b1ec70819c9a1 https://github.com/SchedMD/slurm/commit/73202646e66f597dab89eca42a499c03bd257919 I've attached a patch for 14.11 that consists of the two commits. What the patch does, is if there are records older than 60 days it will catch up by archiving and purging a month at a time until it is caught up. So you'll see a lot of archive files being generated. This should be a good indication of it's progress. Instead of using "sacctmgr archive dump" to trigger the archive, just put the time stamp in <cluster>__last_ran_table.montly in the db to 1 second in the past. Doing it this way will prevent the sacctmgr call and the slurmdbd's rollup from waiting on each other for locks on the tables. Let me know if you have any questions and how it goes. Thanks, Brian ex. mysql> select monthly_rollup from knc_last_ran_table; +----------------+ | monthly_rollup | +----------------+ | 1427871600 | +----------------+ 1 row in set (0.00 sec) mysql> update knc_last_ran_table set monthly_rollup=1427871599; Query OK, 0 rows affected (0.05 sec) Rows matched: 1 Changed: 0 Warnings: 0 mysql> select monthly_rollup from knc_last_ran_table; +----------------+ | monthly_rollup | +----------------+ | 1427871599 | +----------------+ 1 row in set (0.00 sec) mysql> select FROM_UNIXTIME(monthly_rollup) from knc_last_ran_table; +-------------------------------+ | FROM_UNIXTIME(monthly_rollup) | +-------------------------------+ | 2015-03-31 23:59:59 | +-------------------------------+ 1 row in set (0.00 sec)
Thanks for the patch. I will apply it and see what happens. -Paul Edmon- On 04/15/2015 06:31 PM, bugs@schedmd.com wrote: > > *Comment # 37 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c37> on > bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian > Christiansen <mailto:brian@schedmd.com> * > Createdattachment 1819 <attachment.cgi?id=1819&action=diff> [details] <attachment.cgi?id=1819&action=edit> > archive patch > > I've made the changes in 15.08. > https://github.com/SchedMD/slurm/commit/343d55f85dfbf8eeb4a15d9f110b1ec70819c9a1 > https://github.com/SchedMD/slurm/commit/73202646e66f597dab89eca42a499c03bd257919 > > I've attached a patch for 14.11 that consists of the two commits. What the > patch does, is if there are records older than 60 days it will catch up by > archiving and purging a month at a time until it is caught up. So you'll see a > lot of archive files being generated. This should be a good indication of it's > progress. > > Instead of using "sacctmgr archive dump" to trigger the archive, just put the > time stamp in <cluster>__last_ran_table.montly in the db to 1 second in the > past. Doing it this way will prevent the sacctmgr call and the slurmdbd's > rollup from waiting on each other for locks on the tables. > > Let me know if you have any questions and how it goes. > > Thanks, > Brian > > ex. > mysql> select monthly_rollup from knc_last_ran_table; > +----------------+ > | monthly_rollup | > +----------------+ > | 1427871600 | > +----------------+ > 1 row in set (0.00 sec) > > mysql> update knc_last_ran_table set monthly_rollup=1427871599; > Query OK, 0 rows affected (0.05 sec) > Rows matched: 1 Changed: 0 Warnings: 0 > > mysql> select monthly_rollup from knc_last_ran_table; > +----------------+ > | monthly_rollup | > +----------------+ > | 1427871599 | > +----------------+ > 1 row in set (0.00 sec) > > mysql> select FROM_UNIXTIME(monthly_rollup) from knc_last_ran_table; > +-------------------------------+ > | FROM_UNIXTIME(monthly_rollup) | > +-------------------------------+ > | 2015-03-31 23:59:59 | > +-------------------------------+ > 1 row in set (0.00 sec) > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
I'm getting the following error when I use rpmbuild: + STATUS=0 + '[' 0 -ne 0 ']' + cd slurm-14.11.4 + /bin/chmod -Rf a+rX,u+w,g-w,o-w . + echo 'Patch #0 (1576.patch):' Patch #0 (1576.patch): + /bin/cat /root/rpmbuild/SOURCES/1576.patch + /usr/bin/patch -p1 --fuzz=0 patching file src/slurmdbd/slurmdbd.c + echo 'Patch #1 (1576.patch2):' Patch #1 (1576.patch2): + /bin/cat /root/rpmbuild/SOURCES/1576.patch2 + /usr/bin/patch -p1 --fuzz=0 patching file src/plugins/accounting_storage/mysql/as_mysql_archive.c patching file src/slurmdbd/slurmdbd.c + echo 'Patch #2 (archive_1411.patch):' Patch #2 (archive_1411.patch): + /bin/cat /root/rpmbuild/SOURCES/archive_1411.patch + /usr/bin/patch -p1 --fuzz=0 patching file src/plugins/accounting_storage/mysql/as_mysql_archive.c Hunk #1 FAILED at 50. Hunk #2 succeeded at 414 (offset -3 lines). Hunk #3 succeeded at 1449 (offset -3 lines). Hunk #4 succeeded at 1464 (offset -3 lines). Hunk #5 succeeded at 1480 (offset -3 lines). Hunk #6 succeeded at 1574 (offset -3 lines). Hunk #7 succeeded at 1589 (offset -3 lines). Hunk #8 succeeded at 1632 (offset -3 lines). Hunk #9 succeeded at 1710 (offset -3 lines). Hunk #10 succeeded at 1725 (offset -3 lines). Hunk #11 succeeded at 1742 (offset -3 lines). Hunk #12 succeeded at 1793 (offset -3 lines). Hunk #13 succeeded at 1808 (offset -3 lines). Hunk #14 succeeded at 1862 (offset -3 lines). Hunk #15 succeeded at 1951 (offset -3 lines). Hunk #16 succeeded at 2059 (offset -3 lines). Hunk #17 succeeded at 2074 (offset -3 lines). Hunk #18 FAILED at 2121. 2 out of 18 hunks FAILED -- saving rejects to file src/plugins/accounting_storage/mysql/as_mysql_archive.c.rej error: Bad exit status from /var/tmp/rpm-tmp.tkM0lZ (%prep) RPM build errors: Bad exit status from /var/tmp/rpm-tmp.tkM0lZ (%prep) [root@holy-slurm01 slurm-1576]# -Paul Edmon- On 04/15/2015 06:31 PM, bugs@schedmd.com wrote: > > *Comment # 37 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c37> on > bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian > Christiansen <mailto:brian@schedmd.com> * > Createdattachment 1819 <attachment.cgi?id=1819&action=diff> [details] <attachment.cgi?id=1819&action=edit> > archive patch > > I've made the changes in 15.08. > https://github.com/SchedMD/slurm/commit/343d55f85dfbf8eeb4a15d9f110b1ec70819c9a1 > https://github.com/SchedMD/slurm/commit/73202646e66f597dab89eca42a499c03bd257919 > > I've attached a patch for 14.11 that consists of the two commits. What the > patch does, is if there are records older than 60 days it will catch up by > archiving and purging a month at a time until it is caught up. So you'll see a > lot of archive files being generated. This should be a good indication of it's > progress. > > Instead of using "sacctmgr archive dump" to trigger the archive, just put the > time stamp in <cluster>__last_ran_table.montly in the db to 1 second in the > past. Doing it this way will prevent the sacctmgr call and the slurmdbd's > rollup from waiting on each other for locks on the tables. > > Let me know if you have any questions and how it goes. > > Thanks, > Brian > > ex. > mysql> select monthly_rollup from knc_last_ran_table; > +----------------+ > | monthly_rollup | > +----------------+ > | 1427871600 | > +----------------+ > 1 row in set (0.00 sec) > > mysql> update knc_last_ran_table set monthly_rollup=1427871599; > Query OK, 0 rows affected (0.05 sec) > Rows matched: 1 Changed: 0 Warnings: 0 > > mysql> select monthly_rollup from knc_last_ran_table; > +----------------+ > | monthly_rollup | > +----------------+ > | 1427871599 | > +----------------+ > 1 row in set (0.00 sec) > > mysql> select FROM_UNIXTIME(monthly_rollup) from knc_last_ran_table; > +-------------------------------+ > | FROM_UNIXTIME(monthly_rollup) | > +-------------------------------+ > | 2015-03-31 23:59:59 | > +-------------------------------+ > 1 row in set (0.00 sec) > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
1576.patch2 and the archive_1411.patch have different code paths now. Will you remove 1576.patch and 1576.patch2? You're ok running without them now.
Still choking: + STATUS=0 + '[' 0 -ne 0 ']' + cd slurm-14.11.4 + /bin/chmod -Rf a+rX,u+w,g-w,o-w . + echo 'Patch #0 (archive_1411.patch):' Patch #0 (archive_1411.patch): + /bin/cat /root/rpmbuild/SOURCES/archive_1411.patch + /usr/bin/patch -p1 --fuzz=0 patching file src/plugins/accounting_storage/mysql/as_mysql_archive.c Hunk #1 FAILED at 50. Hunk #2 succeeded at 414 (offset -3 lines). Hunk #3 succeeded at 1449 (offset -3 lines). Hunk #4 succeeded at 1464 (offset -3 lines). Hunk #5 succeeded at 1480 (offset -3 lines). Hunk #6 succeeded at 1574 (offset -3 lines). Hunk #7 succeeded at 1589 (offset -3 lines). Hunk #8 succeeded at 1632 (offset -3 lines). Hunk #9 succeeded at 1710 (offset -3 lines). Hunk #10 succeeded at 1725 (offset -3 lines). Hunk #11 succeeded at 1742 (offset -3 lines). Hunk #12 succeeded at 1793 (offset -3 lines). Hunk #13 succeeded at 1808 (offset -3 lines). Hunk #14 succeeded at 1862 (offset -3 lines). Hunk #15 succeeded at 1951 (offset -3 lines). Hunk #16 succeeded at 2059 (offset -3 lines). Hunk #17 succeeded at 2074 (offset -3 lines). Hunk #18 FAILED at 2121. 2 out of 18 hunks FAILED -- saving rejects to file src/plugins/accounting_storage/mysql/as_mysql_archive.c.rej error: Bad exit status from /var/tmp/rpm-tmp.GSgL1E (%prep) RPM build errors: Bad exit status from /var/tmp/rpm-tmp.GSgL1E (%prep) -Paul Edmon- On 04/20/2015 02:20 PM, bugs@schedmd.com wrote: > > *Comment # 40 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c40> on > bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian > Christiansen <mailto:brian@schedmd.com> * > 1576.patch2 and the archive_1411.patch have different code paths now. Will you > remove 1576.patch and 1576.patch2? You're ok running without them now. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Created attachment 1833 [details] purge patch ok, figured it out. It's expecting the previous purging patch. Will you apply this one before the archive_1411.patch?
Ah, yeah that would make sense. I never applied that patch. Looks to be building fine now. -Paul Edmon- On 04/20/2015 02:49 PM, bugs@schedmd.com wrote: > > *Comment # 42 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c42> on > bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian > Christiansen <mailto:brian@schedmd.com> * > Createdattachment 1833 <attachment.cgi?id=1833&action=diff> [details] <attachment.cgi?id=1833&action=edit> > purge patch > > ok, figured it out. It's expecting the previous purging patch. Will you apply > this one before the archive_1411.patch? > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
So it looks like it did start the archive: [root@holy-slurm01 x86_64]# ls -latr /slurm/archive total 364084 drwxr-xr-x 8 root root 4096 Mar 5 10:32 .. -rw------- 1 slurm slurm_users 176829079 Apr 20 15:13 odyssey_event_archive_2013-08-01T00:00:00_2014-09-30T23:59:59 -rw------- 1 slurm slurm_users 298515 Apr 20 15:13 odyssey_suspend_archive_2013-08-01T00:00:00_2014-09-30T23:59:59 -rw------- 1 slurm slurm_users 126976 Apr 20 15:24 odyssey_event_archive_2013-08-01T00:00:00_2013-09-01T00:00:00 -rw------- 1 slurm slurm_users 635274 Apr 20 15:25 odyssey_event_archive_2013-09-01T00:00:00_2013-10-01T00:00:00 -rw------- 1 slurm slurm_users 645999 Apr 20 15:25 odyssey_event_archive_2013-10-01T00:00:00_2013-11-01T00:00:00 -rw------- 1 slurm slurm_users 487534 Apr 20 15:25 odyssey_event_archive_2013-11-01T00:00:00_2013-12-01T00:00:00 -rw------- 1 slurm slurm_users 977130 Apr 20 15:25 odyssey_event_archive_2013-12-01T00:00:00_2014-01-01T00:00:00 -rw------- 1 slurm slurm_users 1511138 Apr 20 15:25 odyssey_event_archive_2014-01-01T00:00:00_2014-02-01T00:00:00 -rw------- 1 slurm slurm_users 7746166 Apr 20 15:25 odyssey_event_archive_2014-02-01T00:00:00_2014-03-01T00:00:00 -rw------- 1 slurm slurm_users 41228052 Apr 20 15:26 odyssey_event_archive_2014-03-01T00:00:00_2014-04-01T00:00:00 -rw------- 1 slurm slurm_users 22925154 Apr 20 15:26 odyssey_event_archive_2014-04-01T00:00:00_2014-05-01T00:00:00 -rw------- 1 slurm slurm_users 46937755 Apr 20 15:26 odyssey_event_archive_2014-05-01T00:00:00_2014-06-01T00:00:00 -rw------- 1 slurm slurm_users 24678369 Apr 20 15:26 odyssey_event_archive_2014-06-01T00:00:00_2014-07-01T00:00:00 -rw------- 1 slurm slurm_users 5824873 Apr 20 15:26 odyssey_event_archive_2014-07-01T00:00:00_2014-08-01T00:00:00 -rw------- 1 slurm slurm_users 10123392 Apr 20 15:27 odyssey_event_archive_2014-08-01T00:00:00_2014-09-01T00:00:00 -rw------- 1 slurm slurm_users 12981631 Apr 20 15:27 odyssey_event_archive_2014-09-01T00:00:00_2014-09-30T23:59:59 -rw------- 1 slurm slurm_users 68775 Apr 20 15:27 odyssey_suspend_archive_2013-08-01T00:00:00_2013-09-01T00:00:00 -rw------- 1 slurm slurm_users 22629 Apr 20 15:27 odyssey_suspend_archive_2013-09-01T00:00:00_2013-10-01T00:00:00 -rw------- 1 slurm slurm_users 67623 Apr 20 15:27 odyssey_suspend_archive_2013-10-01T00:00:00_2013-11-01T00:00:00 -rw------- 1 slurm slurm_users 895 Apr 20 15:27 odyssey_suspend_archive_2013-11-01T00:00:00_2013-12-01T00:00:00 -rw------- 1 slurm slurm_users 13186 Apr 20 15:27 odyssey_suspend_archive_2014-01-01T00:00:00_2014-02-01T00:00:00 -rw------- 1 slurm slurm_users 67615 Apr 20 15:27 odyssey_suspend_archive_2014-02-01T00:00:00_2014-03-01T00:00:00 -rw------- 1 slurm slurm_users 9028 Apr 20 15:27 odyssey_suspend_archive_2014-04-01T00:00:00_2014-05-01T00:00:00 -rw------- 1 slurm slurm_users 964 Apr 20 15:27 odyssey_suspend_archive_2014-06-01T00:00:00_2014-07-01T00:00:00 -rw------- 1 slurm slurm_users 1848 Apr 20 15:27 odyssey_suspend_archive_2014-07-01T00:00:00_2014-08-01T00:00:00 -rw------- 1 slurm slurm_users 1328 Apr 20 15:27 odyssey_suspend_archive_2014-08-01T00:00:00_2014-09-01T00:00:00 -rw------- 1 slurm slurm_users 44904 Apr 20 15:27 odyssey_suspend_archive_2014-09-01T00:00:00_2014-09-30T23:59:59 -rw------- 1 slurm slurm_users 18489762 Apr 20 15:28 odyssey_step_archive_1969-12-01T00:00:00_1970-01-01T00:00:00 drwxr-xr-x 2 slurm slurm_users 4096 Apr 20 15:28 . However the ctld locked up again while it was doing the archive. We can't have this happen as we have a very active cluster. I let it go for about 20 minutes hoping it would finish or return but no go. I had to eventually kill slurmdbd and restart it with archiving turned off. For an order of operations: Patched slurm Installed slurmdbd and slurm-sql Restarted slurmdbd Updated slurmdbd.conf to permit purging Reconfigured to pull in the new conf Updated mysql rollup Ran a sacctmgr reconfigure The last command is what caused the lockup. Is there anyway to do this with out locking up the ctld? Even an active forced purge shouldn't do this unless it is looking at active data. Anyways, we can't go ahead with the purge if it does this. Or at least we can't do it with out an organized downtime. Do you happen to know how long the purge would take? Or at least a ball park (i.e. this much time per million rows)? Is there a way to see where it is in the purge as it rolls? I have no idea as to where it is when it runs and I can't give my bosses and accurate estimate as to how long it will take with out a progress bar or some diagnostic. -Paul Edmon- On 04/20/2015 02:49 PM, bugs@schedmd.com wrote: > > *Comment # 42 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c42> on > bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian > Christiansen <mailto:brian@schedmd.com> * > Createdattachment 1833 <attachment.cgi?id=1833&action=diff> [details] <attachment.cgi?id=1833&action=edit> > purge patch > > ok, figured it out. It's expecting the previous purging patch. Will you apply > this one before the archive_1411.patch? > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
For me, archiving and purging 86520 records (1 month worth or records for me) took -- in microseconds: event: ~900000 (~6M file) suspend: ~2200000 (~4M file) step: ~7000000 (~26M file) job: ~19300000 (~20M file) resv: ~2400000 (~6M file) The code archives and purges one month at a time and goes in the order: event, suspend, step, job, resv. So from looking at your archive files, it got through events and suspends and stopped at steps. What was the last reconfigure for? What does the following query give you? select count(*) from odyssey_step_table where time_start <= UNIX_TIMESTAMP('2013-08-01 00:00:00'); I'm just curious because the step archive started at 1969 and the others started at 2013-08-01. I'm trying to reproduce the lock. I'll let you know what I find.
So 86,000 is roughly 2 days worth for us. Given how many jobs we have it would take roughly 2 hours to do the purge as we would be purging roughly 35 million jobs. It never got to the step before so I think we are heading in the right direction. Though it spitting out 1970 is confusing. I'm guessing it was set to unix time of 0, which means it was uninitialized? Or perhaps corrupt? The last configure was to force it to run the archive. When I made the change in mysql, it didn't actually do anything until after I ran the reconfigure. Was that incorrect? Do I just need to be more patient? mysql> select count(*) from odyssey_step_table where time_start <= -> UNIX_TIMESTAMP('2013-08-01 00:00:00') -> ; +----------+ | count(*) | +----------+ | 49065 | +----------+ 1 row in set (42.41 sec) -Paul Edmon- On 04/20/2015 07:44 PM, bugs@schedmd.com wrote: > > *Comment # 45 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c45> on > bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian > Christiansen <mailto:brian@schedmd.com> * > For me, archiving and purging 86520 records (1 month worth or records for me) > took -- in microseconds: > > event: ~900000 (~6M file) > suspend: ~2200000 (~4M file) > step: ~7000000 (~26M file) > job: ~19300000 (~20M file) > resv: ~2400000 (~6M file) > > The code archives and purges one month at a time and goes in the order: event, > suspend, step, job, resv. So from looking at your archive files, it got through > events and suspends and stopped at steps. > > What was the last reconfigure for? > > What does the following query give you? > select count(*) from odyssey_step_table where time_start <= > UNIX_TIMESTAMP('2013-08-01 00:00:00'); > > I'm just curious because the step archive started at 1969 and the others > started at 2013-08-01. > > I'm trying to reproduce the lock. I'll let you know what I find. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Let's modify the game plan a little bit. Let's try "walking" the purges up until we get caught up. So instead of purging everything older than 6months let's say purge everything after 20months. Then once that is done, we'll change it to 19 months and so on. This will reduce the amount of time and work the dbd will have to do in one sitting. I would also recommending doing jobs, steps and reservations separately. We saw that events and suspends went pretty quick, so I think you could leave them at 6 months and do them in one fell swoop. I think that reservations will be quick too but just to be safe, I would do them separately. So, for example. The first time running you would have: PurgeEventAfter=6month PurgeSuspendAfter=6month #PurgeResvAfter=6month #PurgeJobAfter=20month #PurgeStepAfter=20month Once that is finished, then turn on reservations: PurgeEventAfter=6month PurgeSuspendAfter=6month PurgeResvAfter=6month #PurgeJobAfter=20month #PurgeStepAfter=20month Once reservations are finished then turn on jobs: PurgeEventAfter=6month PurgeSuspendAfter=6month PurgeResvAfter=6month PurgeJobAfter=20month #PurgeStepAfter=20month Then decrement the month count on Jobs: PurgeEventAfter=6month PurgeSuspendAfter=6month PurgeResvAfter=6month PurgeJobAfter=19month #PurgeStepAfter=20month and so on ... The rollup will happen without running "reconfigure". The rollup is triggered every hour or when you restart the slurmdbd. You could stop the dbd, make the config changes and the month's last roll up time and then restart and then repeat for each config change. Let me know if you have any questions.
For the time_start 0 steps what do the following queries give you? select count(*) from odyssey_step_table where time_start = 0; select t1.time_submit from odyssey_job_table t1, odyssey_step_table t2 where t1.job_db_inx = t2.job_db_inx and t2.time_start = 0 LIMIT 1;
The most likely situation for the time_start=0 scenario is that when the slurmdbd is down the slurmctld will queue up the records to send to the dbd when it comes back up. The slurmctld will eventually start throwing away step start records. So what could have happened is that the start record was thrown away but the step end record was sent to the dbd when it came back up.
Ah that would make sense. I like the plan of a walking purge. I had been considering that myself as a possible solution. I will work on that tomorrow. And get you the information that you requested. I was not aware that the dumping behavior happened on the hour. It would be good to document this somewhere. It would also be good if the slurmdbd could be a bit more talkative when it is doing a purge, dump or rollup. We see it doing things, but we aren't sure what it is up to, how long it will take or where it is. Any info (even basic info) would be more comforting than none. Or perhaps we don't have the verbosity on the logging turned up far enough. -Paul Edmon- On 4/21/2015 5:40 PM, bugs@schedmd.com wrote: > > *Comment # 49 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c49> on > bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian > Christiansen <mailto:brian@schedmd.com> * > The most likely situation for the time_start=0 scenario is that when the > slurmdbd is down the slurmctld will queue up the records to send to the dbd > when it comes back up. The slurmctld will eventually start throwing away step > start records. So what could have happened is that the start record was thrown > away but the step end record was sent to the dbd when it came back up. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
mysql> select count(*) from odyssey_step_table where time_start = 0; +----------+ | count(*) | +----------+ | 49393 | +----------+ 1 row in set (20.68 sec) mysql> select t1.time_submit from odyssey_job_table t1, odyssey_step_table t2 where -> t1.job_db_inx = t2.job_db_inx and t2.time_start = 0 LIMIT 1; +-------------+ | time_submit | +-------------+ | 1377014448 | +-------------+ 1 row in set (0.17 sec) I'm going to try to do the 20 month purge now. If successful I will walk down month by month till I hit 6 months. -Paul Edmon- On 04/21/2015 05:05 PM, bugs@schedmd.com wrote: > select t1.time_submit from odyssey_job_table t1, odyssey_step_table t2 where > t1.job_db_inx = t2.job_db_inx and t2.time_start = 0 LIMIT 1;
Yeah, no go. It locked up again, even when I let it just run at the hour mark only purging things older than 20months. The mysql query at the time was: mysql> show processlist; +--------+-------------+--------------------------------------+-------+-------------+---------+----------------------------------------------------------------+------------------------------------------------------------------------------------------------------+ | Id | User | Host | db | Command | Time | State | Info | +--------+-------------+--------------------------------------+-------+-------------+---------+----------------------------------------------------------------+------------------------------------------------------------------------------------------------------+ | 12 | replication | db-internal.rc.fas.harvard.edu:49224 | NULL | Binlog Dump | 1203386 | Has sent all binlog to slave; waiting for binlog to be updated | NULL | | 289759 | slurm | localhost | slurm | Sleep | 1077 | | NULL | | 289761 | slurm | localhost | slurm | Query | 586 | update | insert into "odyssey_step_table" (job_db_inx, id_step, time_start, step_name, state, cpus_alloc, nod | | 289805 | slurm | localhost | slurm | Query | 14 | Sorting result | select job_db_inx, id_step, time_start, time_end, time_suspended, step_name, nodelist, node_inx, sta | | 289861 | root | localhost | NULL | Query | 0 | NULL | show processlist | +--------+-------------+--------------------------------------+-------+-------------+---------+----------------------------------------------------------------+------------------------------------------------------------------------------------------------------+ 5 rows in set (0.00 sec) I've restarted the slurmdbd with out the purge. It seems get hung up on the job table. -Paul Edmon- On 04/21/2015 05:40 PM, bugs@schedmd.com wrote: > > *Comment # 49 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c49> on > bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian > Christiansen <mailto:brian@schedmd.com> * > The most likely situation for the time_start=0 scenario is that when the > slurmdbd is down the slurmctld will queue up the records to send to the dbd > when it comes back up. The slurmctld will eventually start throwing away step > start records. So what could have happened is that the start record was thrown > away but the step end record was sent to the dbd when it came back up. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Shoot. Sorry. I'll take another look.
Created attachment 1837 [details] deadlock patch Ok. We actually found a lock yesterday while trying to reproduce the deadlock. I mistakenly thought it didn't apply to you. In testing today, when I don't have the patch in I can get the controller to lock while archiving/purging is happening. Will you apply this patch (attached) and try again? The commit: https://github.com/SchedMD/slurm/commit/59b9c909b61e4543909255bf9191ac29a7e6af44
Sure will do tomorrow. Thanks. -Paul Edmon- On 4/22/2015 4:55 PM, bugs@schedmd.com wrote: > > *Comment # 54 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c54> on > bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian > Christiansen <mailto:brian@schedmd.com> * > Createdattachment 1837 <attachment.cgi?id=1837&action=diff> [details] <attachment.cgi?id=1837&action=edit> > deadlock patch > > Ok. We actually found a lock yesterday while trying to reproduce the deadlock. > I mistakenly thought it didn't apply to you. In testing today, when I don't > have the patch in I can get the controller to lock while archiving/purging is > happening. > > Will you apply this patch (attached) and try again? > > The commit: > https://github.com/SchedMD/slurm/commit/59b9c909b61e4543909255bf9191ac29a7e6af44 > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Patch applied. Looks like it is running the archive with out deadlocking now. Keep you updated. -Paul Edmon- On 04/22/2015 04:55 PM, bugs@schedmd.com wrote: > > *Comment # 54 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c54> on > bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian > Christiansen <mailto:brian@schedmd.com> * > Createdattachment 1837 <attachment.cgi?id=1837&action=diff> [details] <attachment.cgi?id=1837&action=edit> > deadlock patch > > Ok. We actually found a lock yesterday while trying to reproduce the deadlock. > I mistakenly thought it didn't apply to you. In testing today, when I don't > have the patch in I can get the controller to lock while archiving/purging is > happening. > > Will you apply this patch (attached) and try again? > > The commit: > https://github.com/SchedMD/slurm/commit/59b9c909b61e4543909255bf9191ac29a7e6af44 > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Yeah, it still deadlocks, even after the patch. Watching mysql I see: mysql> show processlist; +--------+-------------+--------------------------------------+-------+-------------+---------+----------------------------------------------------------------+------------------------------------------------------------------------------------------------------+ | Id | User | Host | db | Command | Time | State | Info | +--------+-------------+--------------------------------------+-------+-------------+---------+----------------------------------------------------------------+------------------------------------------------------------------------------------------------------+ | 12 | replication | db-internal.rc.fas.harvard.edu:49224 | NULL | Binlog Dump | 1290759 | Has sent all binlog to slave; waiting for binlog to be updated | NULL | | 300314 | slurm | localhost | slurm | Sleep | 624 | | NULL | | 300315 | slurm | localhost | slurm | Query | 22 | Sorting result | select job_db_inx, id_step, time_start, time_end, time_suspended, step_name, nodelist, node_inx, sta | | 300321 | slurm | localhost | slurm | Query | 357 | update | insert into "odyssey_step_table" (job_db_inx, id_step, time_start, step_name, state, cpus_alloc, nod | | 300397 | root | localhost | NULL | Query | 0 | NULL | show processlist | +--------+-------------+--------------------------------------+-------+-------------+---------+----------------------------------------------------------------+------------------------------------------------------------------------------------------------------+ 5 rows in set (0.00 sec) mysql> quit Bye It actually chugs along fine with out deadlocking for a bit, but once it hits around this query it stalls out. Seems to happen when a request to modify the DB comes in. I will also note that the select job_db query keeps looping. Namely it will run for about 30 seconds, and then start again. I can't tell if its the same query happening over and over though. Anyways the deadlock in the ctld is still a problem even after the patch. Let me know if you want me to try anything on my end. -Paul Edmon- On 04/22/2015 04:55 PM, bugs@schedmd.com wrote: > > *Comment # 54 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c54> on > bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian > Christiansen <mailto:brian@schedmd.com> * > Createdattachment 1837 <attachment.cgi?id=1837&action=diff> [details] <attachment.cgi?id=1837&action=edit> > deadlock patch > > Ok. We actually found a lock yesterday while trying to reproduce the deadlock. > I mistakenly thought it didn't apply to you. In testing today, when I don't > have the patch in I can get the controller to lock while archiving/purging is > happening. > > Will you apply this patch (attached) and try again? > > The commit: > https://github.com/SchedMD/slurm/commit/59b9c909b61e4543909255bf9191ac29a7e6af44 > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
ok. I'll keep looking. Thanks for you help. How long would you say it took before deadlocking? How many archive files got created?
So it made the 1970 step file and that's it. So it got through up creating the step files and I saw it doing a delete on the step parts as well in mysql. It seems to be when it hits the job_db query that it dead locks. So a few minutes or so. -Paul Edmon- On 04/23/2015 11:45 AM, bugs@schedmd.com wrote: > > *Comment # 58 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c58> on > bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian > Christiansen <mailto:brian@schedmd.com> * > ok. I'll keep looking. Thanks for you help. > > How long would you say it took before deadlocking? How many archive files got > created? > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Did you happen to restart the slurmctld before running the last test? The last patch was for the controller. Sorry I didn't point that out if you didn't.
I did as I figured that that was what the patch was for. -Paul Edmon- On 04/23/2015 01:29 PM, bugs@schedmd.com wrote: > > *Comment # 60 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c60> on > bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian > Christiansen <mailto:brian@schedmd.com> * > Did you happen to restart the slurmctld before running the last test? The last > patch was for the controller. Sorry I didn't point that out if you didn't. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
ok. Will you run the test again and when the controller locks up will you get gdb backtraces from the controller and the dbd? gdb attach <slurmctld_pid> thread apply all bt quit gdb attach <slurmdbd_pid> thread apply all bt quit Thanks, Brian
Sure. I'm also going to change my compiler flags to remove optimization so that things don't get optimized out. -Paul Edmon- On 04/23/2015 01:44 PM, bugs@schedmd.com wrote: > > *Comment # 62 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c62> on > bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian > Christiansen <mailto:brian@schedmd.com> * > ok. Will you run the test again and when the controller locks up will you get > gdb backtraces from the controller and the dbd? > > gdb attach <slurmctld_pid> > thread apply all bt > quit > > gdb attach <slurmdbd_pid> > thread apply all bt > quit > > Thanks, > Brian > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Good idea. Thanks.
This time slurmctld didn't crash. However slurmdbd did: Apr 23 14:01:27 holy-slurm01 slurmdbd[2511]: slurmdbd version 14.11.4 started Apr 23 14:01:27 holy-slurm01 slurmctld[55664]: Registering slurmctld at port 6820 with slurmdbd. Apr 23 14:03:06 holy-slurm01 slurmctld[55664]: slurmdbd: agent queue size 50 Apr 23 14:34:31 holy-slurm01 slurmctld[55664]: error: slurmdbd: Getting response to message type 1472 Apr 23 14:34:31 holy-slurm01 slurmctld[55664]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: No error Apr 23 14:49:31 holy-slurm01 slurmctld[55664]: error: slurmdbd: Getting response to message type 1407 Apr 23 15:04:32 holy-slurm01 slurmdbd[2511]: error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction Apr 23 15:04:32 holy-slurm01 slurmdbd[2511]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program Apr 23 15:04:32 holy-slurm01 slurmctld[55664]: error: slurmdbd: Getting response to message type 1472 Apr 23 15:04:32 holy-slurm01 slurmctld[55664]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: No error Apr 23 15:04:32 holy-slurm01 slurmctld[55664]: slurmdbd: reopening connection Apr 23 15:04:37 holy-slurm01 slurmctld[55664]: error: slurmdbd: Sending message type 1407: 11: Connection refused Apr 23 15:04:54 holy-slurm01 slurmctld[55664]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused Apr 23 15:05:00 holy-slurm01 slurmctld[55664]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused I'm going to turn off the purge again. Let me know if you want any other info. -Paul Edmon- On 04/23/2015 01:51 PM, bugs@schedmd.com wrote: > > *Comment # 64 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c64> on > bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian > Christiansen <mailto:brian@schedmd.com> * > Good idea. Thanks. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
How long did it go for? Did you see new archive files be created? What is your current innodb_lock_wait_timeout set to? > show variables like 'innodb_lock_wait_timeout'; +--------------------------+-------+ | Variable_name | Value | +--------------------------+-------+ | innodb_lock_wait_timeout | 3600 | +--------------------------+-------+ 1 row in set (0.00 sec) Are you just purging jobs? Just confirming. I would also suggest doing steps first so that it doesn't leave dangling steps (changing what I said earlier).
Given the log it went for about a half hour before throwing the first error (14:34) and then the fatal error came at 15:04 which is just over an hour in. The only file it produced was: -rw------- 1 slurm slurm_users 18934293 Apr 23 14:02 odyssey_step_archive_1969-12-01T00:00:00_1970-01-01T00:00:00 The rest in the directory are from previous runs. mysql> show variables like 'innodb_lock_wait_timeout'; +--------------------------+-------+ | Variable_name | Value | +--------------------------+-------+ | innodb_lock_wait_timeout | 3600 | +--------------------------+-------+ 1 row in set (0.00 sec) mysql> Right now I am running an archive and purge. So namely: ArchiveJobs=yes ArchiveDir=/slurm/archive ArchiveSteps=yes ArchiveResvs=yes ArchiveEvents=yes ArchiveSuspend=yes #ArchiveScript= PurgeEventAfter=20month PurgeJobAfter=20month PurgeResvAfter=20month PurgeStepAfter=20month PurgeSuspendAfter=20month I can turn off purging for everything else if we just want to dump one at a time. That may be wise at this point. -Paul Edmon- On 04/23/2015 03:40 PM, bugs@schedmd.com wrote: > > *Comment # 66 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c66> on > bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian > Christiansen <mailto:brian@schedmd.com> * > How long did it go for? > Did you see new archive files be created? > What is your current innodb_lock_wait_timeout set to? > > > show variables like 'innodb_lock_wait_timeout'; > +--------------------------+-------+ > | Variable_name | Value | > +--------------------------+-------+ > | innodb_lock_wait_timeout | 3600 | > +--------------------------+-------+ > 1 row in set (0.00 sec) > > > Are you just purging jobs? Just confirming. I would also suggest doing steps > first so that it doesn't leave dangling steps (changing what I said earlier). > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Ya, I think purging one at a time is a good idea. Events, Suspends, and Resvs will probably go quick.
Hold off on purging jobs and steps for a minute -- in case you were about to. I found an issue with the time_start = 0 jobs. The code is trying to walk up from 1969 and doing selects for each month which is taking a long time. I'll get back to you with a solution. brb.
ok. Will you run these commands in mysql. It will set the start time of the steps with time_start=0 to the start time of the job. I did it in a transaction just to make sure that all of the time_start=0 steps are being updated. First see if there are any jobs that have a time_start of 0 as well. If there are, then update the job_table first. Let me know if you have any questions. > start transaction; Query OK, 0 rows affected (0.00 sec) > select count(*) from compy_job_table where time_start = 0; +----------+ | count(*) | +----------+ | 30 | +----------+ 1 row in set (0.12 sec) > UPDATE compy_job_table SET time_start = time_end WHERE time_start = 0; Query OK, 30 rows affected (4.01 sec) Rows matched: 30 Changed: 30 Warnings: 0 MariaDB [slurm_acct_test3]> select count(*) from compy_job_table where time_start = 0; +----------+ | count(*) | +----------+ | 0 | +----------+ 1 row in set (0.18 sec) Then update the steps: > select count(*) from compy_step_table where time_start = 0; +----------+ | count(*) | +----------+ | 31 | +----------+ 1 row in set (1.47 sec) > UPDATE compy_step_table st, compy_job_table jt SET st.time_start = jt.time_start WHERE st.time_start = 0 AND st.job_db_inx = jt.job_db_inx; Query OK, 31 rows affected (3.65 sec) Rows matched: 31 Changed: 31 Warnings: 0 > select count(*) from compy_step_table where time_start = 0; +----------+ | count(*) | +----------+ | 0 | +----------+ 1 row in set (3.23 sec) > commit; Query OK, 0 rows affected (0.00 sec)
With logging and seeing progress, you can add the DB_USAGE debug flag and set the debug level to "debug" and you'll see these messages: [2015-04-23T15:04:38.671] debug: Purging job entries before 1422777600 for compy [2015-04-23T15:04:58.486] debug: Purging job entries before 1425196799 for compy [2015-04-23T15:05:17.311] debug: Purging resv entries before 1409554800 for compy [2015-04-23T15:05:18.136] debug: Purging resv entries before 1412146800 for compy [2015-04-23T15:05:19.472] debug: Purging resv entries before 1414825200 for compy slurmdbd.conf: DebugFlags=DB_USAGE DebugLevel=debug
I had a hunch that it was walking month by month from 1970 to today. I was actually going to try to confirm that today until I saw your note. No wonder it was taking so long. Here are the results of the commands you gave me: mysql> use slurm; Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Database changed mysql> start transaction; Query OK, 0 rows affected (0.00 sec) mysql> select count(*) from odyssey_job_table where time_start = 0; +----------+ | count(*) | +----------+ | 4110229 | +----------+ 1 row in set (8.63 sec) mysql> UPDATE odyssey_job_table SET time_start = time_end WHERE time_start = 0; Query OK, 4097744 rows affected (4 min 9.55 sec) Rows matched: 4110255 Changed: 4097744 Warnings: 0 mysql> select count(*) from odyssey_job_table where time_start = 0; +----------+ | count(*) | +----------+ | 12485 | +----------+ 1 row in set (8.60 sec) mysql> select count(*) from odyssey_step_table where time_start = 0; +----------+ | count(*) | +----------+ | 50181 | +----------+ 1 row in set (18.75 sec) mysql> UPDATE odyssey_step_table st, odyssey_job_table jt SET st.time_start = jt.time_start WHERE st.time_start = 0 AND st.job_db_inx = jt.job_db_inx; Query OK, 50043 rows affected (6 min 20.14 sec) Rows matched: 50181 Changed: 50043 Warnings: 0 mysql> select count(*) from odyssey_step_table where time_start = 0; +----------+ | count(*) | +----------+ | 138 | +----------+ 1 row in set (19.23 sec) mysql> commit; Query OK, 0 rows affected (0.20 sec) mysql> quit Bye So it looks like there are still quite a few entries that still have 0 time. These commands didn't get them all in our environment. Clearly entries with 0 time are bogus, is there a way to make the purge just ignore these when calculating where to start doing monthly steps? Namely auto take out all the ones with 0 time, then start the monthly purge from the next non-zero entry. I suppose a way I could get around it for now would be to set the purge time for 544months. Have it purge all the 0 records. Then set it to something more sane. Another option would be for jobs that slurm doesn't know the start date for, but does have an entry, set the start date to the time that the entry is made. Either that or when the database is down increase the buffer that it holds transactions in until it comes back. I turned on the additional debugging flags for the database you suggested. I would rather have too much data than insufficient. Let me know how you want to proceed. -Paul Edmon- On 04/23/2015 07:05 PM, bugs@schedmd.com wrote: > > *Comment # 71 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c71> on > bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian > Christiansen <mailto:brian@schedmd.com> * > With logging and seeing progress, you can add the DB_USAGE debug flag and set > the debug level to "debug" and you'll see these messages: > > [2015-04-23T15:04:38.671] debug: Purging job entries before 1422777600 for > compy > [2015-04-23T15:04:58.486] debug: Purging job entries before 1425196799 for > compy > [2015-04-23T15:05:17.311] debug: Purging resv entries before 1409554800 for > compy > [2015-04-23T15:05:18.136] debug: Purging resv entries before 1412146800 for > compy > [2015-04-23T15:05:19.472] debug: Purging resv entries before 1414825200 for > compy > > slurmdbd.conf: > DebugFlags=DB_USAGE > DebugLevel=debug > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
I'll get you a patch that will skip the times in between when there aren't any records. I'm testing it this morning.
Created attachment 1840 [details] archive patch 2 I've attached the additional patch that skips the months without records. The commit: https://github.com/SchedMD/slurm/commit/8c1a151277302f92e38a25264136c0170b3677bc So jobs with a time_start of 0 are ok. This means that the jobs haven't started yet. With the steps, I'm guessing that they are dangling steps (which shouldn't happen) since they didn't get updated with the queries. Maybe they got added when accounting was just turned on? You can verify that they are dangling or not by running these queries: > select count(*) from compy_step_table st left join compy_job_table jt on st.job_db_inx = jt.job_db_inx where jt.job_db_inx is null; +----------+ | count(*) | +----------+ | 549 | +----------+ 1 row in set (0.16 sec) > select count(*) from compy_step_table where time_start = 0; +----------+ | count(*) | +----------+ | 549 | +----------+ 1 row in set (0.05 sec) If they are dangling they could be just deleted or they'll just get purged out. Let me know if you have any questions. Thanks, Brian
Thanks for the patch. I will try it out. Here are the query results: mysql> select count(*) from odyssey_step_table st left join odyssey_job_table jt on st.job_db_inx = jt.job_db_inx where jt.job_db_inx is null; +----------+ | count(*) | +----------+ | 0 | +----------+ 1 row in set (1 min 2.04 sec) mysql> select count(*) from odyssey_step_table where time_start = 0; +----------+ | count(*) | +----------+ | 174 | +----------+ 1 row in set (21.40 sec) mysql> There have been periods in the past when we had the ctld running with out the dbd. Particularly when we have done major version updates. In those cases the ctld could be running for an hour or two with out the dbd. -Paul Edmon- On 04/24/2015 02:16 PM, bugs@schedmd.com wrote: > > *Comment # 74 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c74> on > bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian > Christiansen <mailto:brian@schedmd.com> * > Createdattachment 1840 <attachment.cgi?id=1840&action=diff> [details] <attachment.cgi?id=1840&action=edit> > archive patch 2 > > I've attached the additional patch that skips the months without records. > The commit: > https://github.com/SchedMD/slurm/commit/8c1a151277302f92e38a25264136c0170b3677bc > > So jobs with a time_start of 0 are ok. This means that the jobs haven't started > yet. With the steps, I'm guessing that they are dangling steps (which shouldn't > happen) since they didn't get updated with the queries. Maybe they got added > when accounting was just turned on? > > You can verify that they are dangling or not by running these queries: > > > select count(*) from compy_step_table st left join compy_job_table jt on st.job_db_inx = jt.job_db_inx where jt.job_db_inx is null; > +----------+ > | count(*) | > +----------+ > | 549 | > +----------+ > 1 row in set (0.16 sec) > > > select count(*) from compy_step_table where time_start = 0; > +----------+ > | count(*) | > +----------+ > | 549 | > +----------+ > 1 row in set (0.05 sec) > > If they are dangling they could be just deleted or they'll just get purged out. > > Let me know if you have any questions. > > Thanks, > Brian > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Interesting. Looks like there aren't any dangling steps. Will you run this query? select jt.*, ":", st.* from compy_job_table jt, compy_step_table st where st.job_db_inx = jt.job_db_inx and st.time_start = 0 LIMIT 1 \G;
So I successfully ran a partial purge. It purged up to 6 months of resv, suspend, event, and 19 months for step. I'm now trying 12 months of step. If successful I will keep walking it down to 6 months and then try to do the jobs. Here is the query: mysql> select jt.*, ":", st.* from odyssey_job_table jt, odyssey_step_table st where st.job_db_inx = jt.job_db_inx and st.time_start = 0 LIMIT 1 \G; *************************** 1. row *************************** job_db_inx: 42138008 mod_time: 0 deleted: 0 account: davis_lab array_task_str: NULL array_max_tasks: 0 array_task_pending: 0 cpus_req: 1 cpus_alloc: 1 derived_ec: 0 derived_es: NULL exit_code: 0 job_name: shift.csh id_assoc: 4146 id_array_job: 0 id_array_task: 4294967294 id_block: NULL id_job: 38784352 id_qos: 1 id_resv: 0 id_wckey: 0 id_user: 59769 id_group: 10209 kill_requid: -1 mem_req: 16000 nodelist: holy2a11207 nodes_alloc: 1 node_inx: 318 partition: serial_requeue priority: 15696570 state: 1024 timelimit: 1440 time_submit: 1429881377 time_eligible: 1429881387 time_start: 1429883645 time_end: 1429902954 time_suspended: 0 gres_req: gres_alloc: gres_used: wckey: track_steps: 0 :: : job_db_inx: 42138008 deleted: 0 cpus_alloc: 1 exit_code: 15 id_step: -2 kill_requid: -1 nodelist: holy2a11207 nodes_alloc: 1 node_inx: 318 state: 4 step_name: batch task_cnt: 1 task_dist: 0 time_start: 0 time_end: 1429903409 time_suspended: 0 user_sec: 8027 user_usec: 5999 sys_sec: 5834 sys_usec: 2999 max_pages: 31 max_pages_task: 0 max_pages_node: 0 ave_pages: 31 max_rss: 5646544 max_rss_task: 0 max_rss_node: 0 ave_rss: 3235840 max_vsize: 6639576 max_vsize_task: 0 max_vsize_node: 0 ave_vsize: 3706388 min_cpu: 13862 min_cpu_task: 0 min_cpu_node: 0 ave_cpu: 13862 act_cpufreq: 130330 consumed_energy: 0 req_cpufreq: 0 max_disk_read: 324573.793987 max_disk_read_task: 0 max_disk_read_node: 0 ave_disk_read: 324573.793987 max_disk_write: 62103.008348 max_disk_write_task: 0 max_disk_write_node: 0 ave_disk_write: 62103.008348 1 row in set (2 min 44.12 sec) ERROR: No query specified -Paul Edmon- On 04/24/2015 03:18 PM, bugs@schedmd.com wrote: > > *Comment # 76 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c76> on > bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian > Christiansen <mailto:brian@schedmd.com> * > Interesting. Looks like there aren't any dangling steps. Will you run this > query? > > select jt.*, ":", st.* from compy_job_table jt, compy_step_table st where > st.job_db_inx = jt.job_db_inx and st.time_start = 0 LIMIT 1 \G; > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Hey Paul, how did the step archiving and purging go?
So I'm still working through the steps. I'm at 18 months. The slurmctld still stalls out occasionally. I think it has to do with if one uses an scontrol command to modify a job while the dump is going. It seems to pause when the ctld shouts out the "updating accounting" statement. Given that it still stalls I'm going to try to do some more tonight off hours so that it will be less user impacting. Hopefully I will be able to get down to 6 months tonight for the steps. Then I can start trying the jobs section and see what happens. Regardless this is more progress than we've made yet. Hopefully I can finish whittling this down to what we wanted. Each month of steps seems to take about 5-10 minutes to archive and purge. -Paul Edmon- On 4/27/2015 12:14 PM, bugs@schedmd.com wrote: > > *Comment # 78 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c78> on > bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian > Christiansen <mailto:brian@schedmd.com> * > Hey Paul, how did the step archiving and purging go? > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Created attachment 1843 [details] archive patch 3 Glad we're making progress. You're suspicions on the updating causing the deadlock are probably correct. We made several other possible deadlock fixes last week, based off of the scenario of the other one, in 15.08 and this was one of them. It would happen if you submit a job and then quickly try to modify the job and the db is busy archiving/purging. I've attached a patch that will fix this, if you want to put it in. Thanks, Brian
Yup, that's exactly the scenario. We have a probe job that we submit every 5 minutes and then modify it to the top priority. Thanks for the patch. I will try it out. -Paul Edmon- On 4/27/2015 1:03 PM, bugs@schedmd.com wrote: > > *Comment # 80 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c80> on > bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian > Christiansen <mailto:brian@schedmd.com> * > Createdattachment 1843 <attachment.cgi?id=1843&action=diff> [details] <attachment.cgi?id=1843&action=edit> > archive patch 3 > > Glad we're making progress. You're suspicions on the updating causing the > deadlock are probably correct. We made several other possible deadlock fixes > last week, based off of the scenario of the other one, in 15.08 and this was > one of them. It would happen if you submit a job and then quickly try to modify > the job and the db is busy archiving/purging. I've attached a patch that will > fix this, if you want to put it in. > > Thanks, > Brian > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
The new patch is looking good. I've just about purged down to 6months of step data. Assuming that completes okay I will start on the job data. Hopefully we have this sorted at this point. Won't declare victory until we are done though as the job data is the bulk of the data. -Paul Edmon- On 4/27/2015 1:03 PM, bugs@schedmd.com wrote: > > *Comment # 80 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c80> on > bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian > Christiansen <mailto:brian@schedmd.com> * > Createdattachment 1843 <attachment.cgi?id=1843&action=diff> [details] <attachment.cgi?id=1843&action=edit> > archive patch 3 > > Glad we're making progress. You're suspicions on the updating causing the > deadlock are probably correct. We made several other possible deadlock fixes > last week, based off of the scenario of the other one, in 15.08 and this was > one of them. It would happen if you submit a job and then quickly try to modify > the job and the db is busy archiving/purging. I've attached a patch that will > fix this, if you want to put it in. > > Thanks, > Brian > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
So update. The step purge was successful and I did an initial purge down to 18 months for jobs. Interesting the job dump was different than the other purges. Instead of doing it per month it did a purge of two months: -rw------- 1 slurm slurm_users 621864632 Apr 28 09:49 odyssey_job_archive_2013-08-01T00:00:00_2013-09-30T23:59:59 That's fine, but it is different than the other archives which are per month. I will keep purging down to 6 months and hopefully that will be that. -Paul Edmon- On 04/27/2015 12:14 PM, bugs@schedmd.com wrote: > > *Comment # 78 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c78> on > bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian > Christiansen <mailto:brian@schedmd.com> * > Hey Paul, how did the step archiving and purging go? > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Nevermind, it is now dumping per month. That first month though was 2 interestingly. -Paul Edmon- On 04/28/2015 10:16 AM, Paul Edmon wrote: > So update. The step purge was successful and I did an initial purge > down to 18 months for jobs. Interesting the job dump was different > than the other purges. Instead of doing it per month it did a purge > of two months: > > -rw------- 1 slurm slurm_users 621864632 Apr 28 09:49 > odyssey_job_archive_2013-08-01T00:00:00_2013-09-30T23:59:59 > > That's fine, but it is different than the other archives which are per > month. I will keep purging down to 6 months and hopefully that will > be that. > > -Paul Edmon- > > On 04/27/2015 12:14 PM, bugs@schedmd.com wrote: >> >> *Comment # 78 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c78> on >> bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian >> Christiansen <mailto:brian@schedmd.com> * >> Hey Paul, how did the step archiving and purging go? >> ------------------------------------------------------------------------ >> You are receiving this mail because: >> >> * You reported the bug. >> >
I can understand that. The code only walks up by month if the oldest record is older than 60 days. So the oldest record probably wasn't 60 days older than 20 -- or whatever number you had -- months. So it just did all it could.
Ah, that makes sense. I will let you know when the purges are done. I think at that point we can close this out. -Paul Edmon- On 04/28/2015 10:57 AM, bugs@schedmd.com wrote: > > *Comment # 85 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c85> on > bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian > Christiansen <mailto:brian@schedmd.com> * > I can understand that. The code only walks up by month if the oldest record is > older than 60 days. So the oldest record probably wasn't 60 days older than 20 > -- or whatever number you had -- months. So it just did all it could. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Ok. Let me know when you're done and then I'll close it out. Thanks for your help!
Okay, just finished. Got us down to 6 months. Thanks for all your help. This was a long one. -Paul Edmon- On 04/28/2015 11:07 AM, bugs@schedmd.com wrote: > > *Comment # 87 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c87> on > bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian > Christiansen <mailto:brian@schedmd.com> * > Ok. Let me know when you're done and then I'll close it out. Thanks for your > help! > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Awesome! You can hold on to the patches until 15.08 or not worry about them since the original code will just be archiving one month at a time -- now that you are caught up. Thanks again for your help and patience.
I'll keep them any way if only out of paranoia. Thanks. -Paul Edmon- On 04/28/2015 03:36 PM, bugs@schedmd.com wrote: > Brian Christiansen <mailto:brian@schedmd.com> changed bug 1576 > <http://bugs.schedmd.com/show_bug.cgi?id=1576> > What Removed Added > Status IN_PROGRESS RESOLVED > Version Fixed 15.08.0pre4 > Resolution --- FIXED > > *Comment # 89 <http://bugs.schedmd.com/show_bug.cgi?id=1576#c89> on > bug 1576 <http://bugs.schedmd.com/show_bug.cgi?id=1576> from Brian > Christiansen <mailto:brian@schedmd.com> * > Awesome! You can hold on to the patches until 15.08 or not worry about them > since the original code will just be archiving one month at a time -- now that > you are caught up. > > Thanks again for your help and patience. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >