Summary: | slurmdbd down each 1rst of the month | ||
---|---|---|---|
Product: | Slurm | Reporter: | Véronique Legrand <vlegrand> |
Component: | slurmdbd | Assignee: | Felip Moll <felip.moll> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | CC: | felip.moll, sophie.creno |
Version: | 16.05.9 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | Institut Pasteur | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | 16.05.9 | Target Release: | --- |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Description
Véronique Legrand
2017-10-24 02:41:35 MDT
Hi Véronique, > ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the > calling program This is indeed a mysql timeout raised when db is not responding for 50 seconds (default). The only way to fix this problem is restarting the calling program, so this is the reason of the fatal. One thing you can do if you have big queries is to avoid ER_LOCK_WAIT_TIMEOUT increasing innodb_lock_wait_timeout in your my.cnf. Here you can find the details of this parameter: https://dev.mysql.com/doc/refman/5.7/en/innodb-parameters.html#sysvar_innodb_lock_wait_timeout Could try to increase this to 900 or greater and restart the mysqld and the dbd and check if it solves the issue? > So, I suppose that what happens, as jobs are running even at midnight is: > - slurmdbd tries to insert a record in the job_step_table whereas the > database is locked for the purge. > - As the purge takes a long time, the insert request times out. > Yes, your assumption is right, it is a bit more complex but this is the idea. > We didn't have that problem before but this was maybe due to the fact that > we had less jobs (usage of our cluster is always increasing) so, the purge > took less time... > This perfectly make sense, in fact there are some limits in the code like MAX_PURGE_LIMIT that try to avoid the number of records purged at a time so locks can be periodically released. Anyway with an increasing number of jobs in your system can make this limits not enough, so frequency of purging or timeouts should be tuned. > If I understood well what I read from the documentation: at the beginning of > the month and according to my configuration, slurm purges all jobs and > events that are older than 1 year and all job steps that are older than 1 month. > Can you confirm that I understood well? > > If this is so, is there a way to have a shorter "purge interval"? I would > like to see if the problem happens if we purged every week. > Yes, you understood well. If you specify "months" as time unit, then each "month" the function for purging will be called. Then if you specify "Purgexxxx=12months", everything older than 12 months will be purged. You can use "hours", or "days" as time unit, so for example: PurgeStepAfter=7days , job step info will be purged daily at 00:00:00 and will remove information older than 7 days. This is detailed in the slurmdbd.conf documentation: PurgeStepAfter .......The time is a numeric value and is a number of months. If you want to purge more often you can include "hours", or "days" behind the numeric value to get those more frequent purges (i.e. a value of "12hours" would purge everything older than 12 hours). The purge takes place at the start of the each purge interval........ Hope it helps and please let me know if you have solved the problem. Best regards, Felip M Hello Felip, Thank you for your answer. We have changed the value of innodb_lock_wait_timeout to 900 as you advised. But I will not know if it fixes the problem until November 1rst... As for the slurmdbd parameters, if I specify : PurgeStepAfter =31 days will all job steps older than 31 day be purged every day? Regards, Véronique (In reply to Véronique Legrand from comment #3) > Hello Felip, > > Thank you for your answer. We have changed the value of > innodb_lock_wait_timeout to 900 as you advised. But I will not know if it > fixes the problem until November 1rst... > > As for the slurmdbd parameters, > if I specify : > > PurgeStepAfter =31 days > > will all job steps older than 31 day be purged every day? > > Regards, > > Véronique There is an internal process that runs every hour. Inside this process the archiving/purge functions are called depending on your time units configured. If units are months, every 1st at 00:00:00 action will be called. If units are days, every day at 00:00 action will be performed. If units are hours, every hour at minute 00 action will be performed. Answering your question: > will all job steps older than 31 day be purged every day? Yes, everyday at 00:00h job steps older than now-31 days will be purged. Tell me if you do that change and how it goes. Hello, today is 1st Nov. Have you observed any improvements? May we close the ticket? Thanks, Felip M Hello Felip, November first is a bank holiday here. I am happy to tell you that slurmdbd didn't crash yesterday. Regards, Véronique (In reply to Véronique Legrand from comment #6) > Hello Felip, > > November first is a bank holiday here. I am happy to tell you that slurmdbd > didn't crash yesterday. > > Regards, > > Véronique Thanks Verónique, I will close the bug then, but keep an eye on it and just reopen if the same problem arises again. Best regards, Felip M |