Ticket 1304 - slurmdbd crashed on mysql error ER_LOCK_WAIT_TIMEOUT
Summary: slurmdbd crashed on mysql error ER_LOCK_WAIT_TIMEOUT
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmdbd (show other tickets)
Version: 14.03.9
Hardware: Linux Linux
: --- 2 - High Impact
Assignee: David Bigagli
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2014-12-07 13:39 MST by Akmal Madzlan
Modified: 2014-12-12 08:55 MST (History)
3 users (show)

See Also:
Site: DownUnder GeoSolutions
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Akmal Madzlan 2014-12-07 13:39:20 MST
Hi,
slurmdbd crashed with this error in log. Any way to fix this?

[2014-12-08T00:45:04.530] error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction
update "london_last_ran_table" set hourly_rollup=1417786695, daily_rollup=1417786695, monthly_rollup=1417786695
[2014-12-08T00:45:04.530] fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program
Comment 1 Danny Auble 2014-12-07 13:56:07 MST
As the message implies, the database wasn't responding for a long time,  normally for 15 minutes. When this happens the only way to fix the issue is to restart the calling program,  this is the reason for the fatal.  I don't know if there is much we can do here since this is a mysql issue.

It is highly advised you run the slurmdbd on top to database instead of the slurmctld.  In addition to the other numerous advantages the slurmdbd offers it would prevent the slurmctld getting a fatal here. 



On December 7, 2014 7:39:20 PM PST, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=1304
>
>              Site: DownUnder GeoSolutions
>            Bug ID: 1304
>          Summary: slurmdbd crashed on mysql error ER_LOCK_WAIT_TIMEOUT
>           Product: SLURM
>           Version: 14.03.9
>          Hardware: Linux
>                OS: Linux
>            Status: UNCONFIRMED
>          Severity: 2 - High Impact
>          Priority: ---
>         Component: slurmdbd daemon
>          Assignee: david@schedmd.com
>          Reporter: akmalm@dugeo.com
>              CC: brian@schedmd.com, da@schedmd.com, david@schedmd.com,
>                    jette@schedmd.com
>
>Hi,
>slurmdbd crashed with this error in log. Any way to fix this?
>
>[2014-12-08T00:45:04.530] error: mysql_query failed: 1205 Lock wait
>timeout
>exceeded; try restarting transaction
>update "london_last_ran_table" set hourly_rollup=1417786695,
>daily_rollup=1417786695, monthly_rollup=1417786695
>[2014-12-08T00:45:04.530] fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an
>error.
>The only way to fix this is restart the calling program
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.
Comment 2 Akmal Madzlan 2014-12-07 15:05:18 MST
Actually we did use slurmdbd recently. slurmctld didnt crash but slurmdbd did crash. After restarting slurmdbd, it crashing again after a few minutes. Do you have any other suggestion?
Comment 3 Danny Auble 2014-12-07 15:34:16 MST
That is good you have switched to the dbd.  I am not sure what is happening to your database.  Is it under a heavy load?  Seems like something is different now than before since this is the first time you are seeing this.  In my experience it is a very rare occurrence, usually from a query that was taking too long to complete, perhaps upping the debug to give an idea of what query is causing the problem.  

I know in 14.11 there was work done that sped up interactions with the database that may help.  You might consider upgrading at least the slurmdbd node to 14.11 and see if that helps.  Have you tried restarting your database?  Is the database used for anything else? 

On December 7, 2014 9:05:18 PM PST, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=1304
>
>--- Comment #2 from Akmal Madzlan <akmalm@dugeo.com> ---
>Actually we did use slurmdbd recently. slurmctld didnt crash but
>slurmdbd did
>crash. After restarting slurmdbd, it crashing again after a few
>minutes. Do you
>have any other suggestion?
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.
Comment 4 Akmal Madzlan 2014-12-07 15:44:08 MST
Thanks for that suggestion. I'll try that
Comment 5 Danny Auble 2014-12-08 03:26:15 MST
Akmal, please see if there was anything else locking up the database.  A common cause of this is mysqldump running or something.  Running mysqldump will sometimes cause all sorts of issues on a live database.  If this is the situation please make sure you have the following mysqldump options...

    --single-transaction
    --quick
    --lock-tables=false

Without them the database will lock up and you will get these kind of issues.  I am hoping this is the case, or something like it was locking the tables messing things up.
Comment 6 Danny Auble 2014-12-10 09:27:57 MST
Akmal, any more on this?
Comment 7 Akmal Madzlan 2014-12-10 13:21:01 MST
I've managed to get slurmdbd running without crashes.
Actually this is the first time slurmdbd is used on that particular cluster, so on initial start, it try to do a lot of rollup and it might causing mysql to become very busy.

So I let slurmdbd run first and let all the rollup finish before restarting slurmctld. It works fine now
Comment 8 David Bigagli 2014-12-12 08:55:25 MST
Thanks for the update. Please reopen if necessary.

David