Bug 4295

Summary: slurmdbd down each 1rst of the month
Product: Slurm Reporter: Véronique Legrand <vlegrand>
Component: slurmdbdAssignee: Felip Moll <felip.moll>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: felip.moll, sophie.creno
Version: 16.05.9   
Hardware: Linux   
OS: Linux   
Site: Institut Pasteur Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 16.05.9 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Véronique Legrand 2017-10-24 02:41:35 MDT
For 2 month now we have been finding the slurmdbd daemon down on every 1rst of the month.
Error messages in the logs appear shortly after midnight.
They say:
2017-10-01T00:02:42.468823+02:00 tars-acct slurmdbd[7762]: error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction#012insert into "tars_step_table" (job_db_inx, id_step, time_start, step_name, state, tres_alloc, nodes_alloc, task_cnt, nodelist, node_inx, task_dist, req_cpufreq, req_cpufreq_min, req_cpufreq_gov) values (48088499, -2, 1506747882, 'batch', 1, '1=1,2=5000,4=1', 1, 1, 'tars-584', '252', 0, 0, 0, 0) on duplicate key update nodes_alloc=1, task_cnt=1, time_end=0, state=1, nodelist='tars-584', node_inx='252', task_dis
2017-10-01T00:02:42.468854+02:00 tars-acct slurmdbd[7762]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program
 
In the slurmdbd.conf we have:
# CONTROLLER
AuthType=auth/munge
DbdHost=tars-acct
PidFile=/var/run/slurmdbd.pid
SlurmUser=slurm
 
# DEBUG
DebugLevel=info
#DebugLevel=debug2
#DebugFlags=DB_ARCHIVE
 
# DATABASE
StorageType=accounting_storage/mysql
StoragePass=slurmdbd
StorageUser=slurmdbd
 
# ARCHIVES
ArchiveDir=/path/to/archives
ArchiveEvents=yes
ArchiveJobs=yes
ArchiveResvs=yes
ArchiveSteps=yes
ArchiveSuspend=yes
PurgeEventAfter=12months
PurgeJobAfter=12months
PurgeResvAfter=12months
PurgeStepAfter=1months
PurgeSuspendAfter=12months
 
I have read in the slurmdbb.conf documentation : "The purge takes place at the start of the each purge interval.  For example, if the purge time is 2 months, the purge would happen
at the beginning of each month."
So, I suppose that what happens, as jobs are running even at midnight is:
- slurmdbd tries to insert a record in the job_step_table whereas the database is locked for the purge.
- As the purge takes a long time, the insert request times out.
 
We didn't have that problem before but this was maybe due to the fact that we had less jobs (usage of our cluster is always increasing) so, the purge took less time...
 
If I understood well what I read from the documentation: at the beginning of the month and according to my configuration, slurm purges all jobs and events that are older than 1 year
and all job steps that are older than 1 month.
Can you confirm that I understood well?
 
If this is so, is there a way to have a shorter "purge interval"? I would like to see if the problem happens if we purged every week.
 
 
Regards,
 
 
Véronique
Comment 1 Felip Moll 2017-10-24 05:25:07 MDT
Hi Véronique,

> ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the
> calling program

This is indeed a mysql timeout raised when db is not responding for 50 seconds (default). The only way to fix this problem is restarting the calling program, so this is the reason of the fatal.

One thing you can do if you have big queries is to avoid ER_LOCK_WAIT_TIMEOUT increasing innodb_lock_wait_timeout in your my.cnf.

Here you can find the details of this parameter:

https://dev.mysql.com/doc/refman/5.7/en/innodb-parameters.html#sysvar_innodb_lock_wait_timeout

Could try to increase this to 900 or greater and restart the mysqld and the dbd and check if it solves the issue?

> So, I suppose that what happens, as jobs are running even at midnight is:
> - slurmdbd tries to insert a record in the job_step_table whereas the
> database is locked for the purge.
> - As the purge takes a long time, the insert request times out.
>  

Yes, your assumption is right, it is a bit more complex but this is the idea.


> We didn't have that problem before but this was maybe due to the fact that
> we had less jobs (usage of our cluster is always increasing) so, the purge
> took less time...
>  

This perfectly make sense, in fact there are some limits in the code like MAX_PURGE_LIMIT that try to avoid the number of records purged at a time so locks can be periodically released. 
Anyway with an increasing number of jobs in your system can make this limits not enough, so frequency of purging or timeouts should be tuned.


> If I understood well what I read from the documentation: at the beginning of
> the month and according to my configuration, slurm purges all jobs and
> events that are older than 1 year and all job steps that are older than 1 month.
> Can you confirm that I understood well?
>  
> If this is so, is there a way to have a shorter "purge interval"? I would
> like to see if the problem happens if we purged every week.
>  

Yes, you understood well. If you specify "months" as time unit, then each "month" the function for purging will be called.
Then if you specify "Purgexxxx=12months", everything older than 12 months will be purged.

You can use "hours", or "days" as time unit, so for example:

PurgeStepAfter=7days , job step info will be purged daily at 00:00:00 and will remove information older than 7 days.


This is detailed in the slurmdbd.conf documentation:

PurgeStepAfter
.......The time is a numeric value and is a number of months.  If you want to purge more often  you  can  include  "hours", or "days" behind the numeric value to get those more frequent purges (i.e. a value of "12hours" would purge everything older than 12 hours).  The purge takes place at the start of the each purge interval........


Hope it helps and please let me know if you have solved the problem.

Best regards,
Felip M
Comment 3 Véronique Legrand 2017-10-25 02:50:16 MDT
Hello Felip,

Thank you for your answer. We have changed the value of innodb_lock_wait_timeout to 900 as you advised. But I will not know if it fixes the problem until November 1rst...

As for the slurmdbd parameters,
if I specify :

PurgeStepAfter =31 days

will all job steps older than 31 day be purged every day?

Regards,

Véronique
Comment 4 Felip Moll 2017-10-25 11:14:30 MDT
(In reply to Véronique Legrand from comment #3)
> Hello Felip,
> 
> Thank you for your answer. We have changed the value of
> innodb_lock_wait_timeout to 900 as you advised. But I will not know if it
> fixes the problem until November 1rst...
> 
> As for the slurmdbd parameters,
> if I specify :
> 
> PurgeStepAfter =31 days
> 
> will all job steps older than 31 day be purged every day?
> 
> Regards,
> 
> Véronique

There is an internal process that runs every hour. Inside this process the archiving/purge functions are called depending on your time units configured. 

If units are months, every 1st at 00:00:00 action will be called.
If units are days, every day at 00:00 action will be performed.
If units are hours, every hour at minute 00 action will be performed.


Answering your question:

> will all job steps older than 31 day be purged every day?

Yes, everyday at 00:00h job steps older than now-31 days will be purged.


Tell me if you do that change and how it goes.
Comment 5 Felip Moll 2017-11-01 10:16:24 MDT
Hello, today is 1st Nov. Have you observed any improvements?

May we close the ticket?

Thanks,
Felip M
Comment 6 Véronique Legrand 2017-11-02 02:33:02 MDT
Hello Felip,

November first is a bank holiday here. I am happy to tell you that slurmdbd didn't crash yesterday.

Regards,

Véronique
Comment 7 Felip Moll 2017-11-02 07:12:35 MDT
(In reply to Véronique Legrand from comment #6)
> Hello Felip,
> 
> November first is a bank holiday here. I am happy to tell you that slurmdbd
> didn't crash yesterday.
> 
> Regards,
> 
> Véronique

Thanks Verónique,

I will close the bug then, but keep an eye on it and just reopen if the same problem arises again.

Best regards,
Felip M