Ticket 2344 - slurmdbd memory use and log size
Summary: slurmdbd memory use and log size
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Accounting (show other tickets)
Version: 14.11.10
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2016-01-14 00:58 MST by Susan Chacko
Modified: 2016-01-14 05:51 MST (History)
1 user (show)

See Also:
Site: NIH
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Susan Chacko 2016-01-14 00:58:03 MST
We recently found that some sacct commands cause the slurmdbd memory usage to balloon up to 50 GB or more. Additionally, the slurmdbd.log filled with errors like:
[2016-01-09T03:44:15.787] error: packdouble: Buffer size limit exceeded (4294917478 > 4294901760)
[2016-01-09T03:44:15.787] error: pack32: Buffer size limit exceeded (4294917478 > 4294901760)

and filled the disk. Either the memory or disk usage causes the slurmdbd process to die, and needs to be restarted. 

The first time we noticed this problem was a week after updating to 14.11.10, but we can't be sure the upgrade and problem are necessarily related. 

After some investigation, we've found that a command like

sacct -a -npo 'user,account,jobid,state' --starttime=2015-01-13T09:00:00

can 
- hang and not produce any output

- alternatively, produce an error like 
sacct: error: slurmdbd: Getting response to message type 1444
sacct: error: slurmdbd: DBD_GET_JOBS_COND failure: No error

- cause the slurmdbd process to steadily start using more memory

- slurmdbd memory usage still increases after Ctrl-C'ing out of the sacct process

- the slurmdbd log fills with errors like the above and grows by 20-30 GB in a few minutes.

For the slurm user:
ulimit -a
core file size          (blocks, -c) 1
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 257437
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 257437
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

There haven't been any changes to slurmdbd.conf since May 2015 (except for increasing the debug level just now to try to debug this problem). 

The slurmacct database has about 11M jobs + steps. 

Happy to provide any additional info you need.
Comment 1 Tim Wickberg 2016-01-14 03:09:00 MST
(In reply to Susan Chacko from comment #0)
> We recently found that some sacct commands cause the slurmdbd memory usage
> to balloon up to 50 GB or more. Additionally, the slurmdbd.log filled with
> errors like:
> [2016-01-09T03:44:15.787] error: packdouble: Buffer size limit exceeded
> (4294917478 > 4294901760)
> [2016-01-09T03:44:15.787] error: pack32: Buffer size limit exceeded
> (4294917478 > 4294901760)
> 
> and filled the disk. Either the memory or disk usage causes the slurmdbd
> process to die, and needs to be restarted. 
> 
> The first time we noticed this problem was a week after updating to
> 14.11.10, but we can't be sure the upgrade and problem are necessarily
> related. 

How many jobs / job steps are you running per day?

It looks like the response message is trying to grow to over 4GB. Whether that's appropriate or not for your query it's hard to say, although that certainly does sound like a bug. How large is the raw mysql database at this point?

> After some investigation, we've found that a command like
> 
> sacct -a -npo 'user,account,jobid,state' --starttime=2015-01-13T09:00:00
> 
> can 
> - hang and not produce any output
> 
> - alternatively, produce an error like 
> sacct: error: slurmdbd: Getting response to message type 1444
> sacct: error: slurmdbd: DBD_GET_JOBS_COND failure: No error
> 
> - cause the slurmdbd process to steadily start using more memory
> 
> - slurmdbd memory usage still increases after Ctrl-C'ing out of the sacct
> process

It's asynchronous - slurmdbd will still complete the query even if sacct disconnects. There's no mechanism to cancel the query at present, although that does sound like a good improvement to be made.

Can you run a command directly on the database like:

select count(*) from cluster_job_table where time_start > unix_timestamp('2016-01-13');

Actually - I see the problem here - you're asking for over a year of records in one shot (2015-01-13T09:00:00). I don't know if that was intentional or not - it's 2016 now. I could easily see a year+ of your system resulting in a huge result that goes over our response buffer. You'd also probably not like to experience a terminal window getting 4GB of text dumped to it.

If you're intentionally trying to grab a year of results, you'll need to split it up into multiple query ranges to avoid issues.

Admittedly, slurm should handle exceedingly large queries like this better - setting some sort of upper bound on per-query resource use to prevent users from DOS'ing slurmdbd, and checking whether our calls to pack() are succeeding or not (rather than continuing to print junk to the log and filling the filesystem), and stopping the query sooner. I can file an enhancement request for 16.05 to explore adding such checks if you'd like.

> - the slurmdbd log fills with errors like the above and grows by 20-30 GB in
> a few minutes.
> 
> For the slurm user:
> ulimit -a
> core file size          (blocks, -c) 1
> data seg size           (kbytes, -d) unlimited
> scheduling priority             (-e) 0
> file size               (blocks, -f) unlimited
> pending signals                 (-i) 257437
> max locked memory       (kbytes, -l) 64
> max memory size         (kbytes, -m) unlimited
> open files                      (-n) 1024
> pipe size            (512 bytes, -p) 8
> POSIX message queues     (bytes, -q) 819200
> real-time priority              (-r) 0
> stack size              (kbytes, -s) unlimited
> cpu time               (seconds, -t) unlimited
> max user processes              (-u) 257437
> virtual memory          (kbytes, -v) unlimited
> file locks                      (-x) unlimited
> 
> There haven't been any changes to slurmdbd.conf since May 2015 (except for
> increasing the debug level just now to try to debug this problem). 
> 
> The slurmacct database has about 11M jobs + steps. 
> 
> Happy to provide any additional info you need.
Comment 2 Susan Chacko 2016-01-14 03:44:49 MST
(In reply to Tim Wickberg from comment #1)

> 
> select count(*) from cluster_job_table where time_start >
> unix_timestamp('2016-01-13');

Looks like about 110K jobs/steps 

> Actually - I see the problem here - you're asking for over a year of records
> in one shot (2015-01-13T09:00:00). I don't know if that was intentional or
> not - it's 2016 now. I could easily see a year+ of your system resulting in
> a huge result that goes over our response buffer. You'd also probably not
> like to experience a terminal window getting 4GB of text dumped to it.

Shoot, you're right. At least some of our test queries had the wrong year. It's possible one of our users was also running queries with the wrong timestamp. 

This brings up another couple of questions:  At this point our db contains all jobs (~ 11 M jobs/steps) from about a year ago when we first started up Slurm. Have we reached the point where we should be purging old completed jobs from the db? And is there a recommended method of archiving old job data from the database? We've thought of stopping slurmdbd, then copying the db file (or doing a mysql dump), deleting old completed jobs, and then restarting slurmdbd. Is that the best method? 

> Admittedly, slurm should handle exceedingly large queries like this better -
> setting some sort of upper bound on per-query resource use to prevent users
> from DOS'ing slurmdbd, and checking whether our calls to pack() are
> succeeding or not (rather than continuing to print junk to the log and
> filling the filesystem), and stopping the query sooner. I can file an
> enhancement request for 16.05 to explore adding such checks if you'd like.

That would be very useful. We can be careful going forward, but it's always possible that a user will inadvertently run a large query and cause this problem.

Thanks!
Susan
Comment 3 Tim Wickberg 2016-01-14 04:12:04 MST
> Shoot, you're right. At least some of our test queries had the wrong year. It's
> possible one of our users was also running queries with the wrong timestamp.
>
> This brings up another couple of questions:  At this point our db contains all
> jobs (~ 11 M jobs/steps) from about a year ago when we first started up Slurm.
> Have we reached the point where we should be purging old completed jobs from
> the db? And is there a recommended method of archiving old job data from the
> database? We've thought of stopping slurmdbd, then copying the db file (or
> doing a mysql dump), deleting old completed jobs, and then restarting slurmdbd.
> Is that the best method?

You've probably passed the point that the historical records are useful, 
but that does vary by customer demands.

slurmdbd can do all that for you... we usually try to avoid having 
people interact directly with MySQL except for debugging purposes - we 
do change some of the table structures and schema each release, and our 
supported API is through sacct / sacctmgr, not direct to MySQL.

You can load these archive files back in at any time with sacctmgr.

Please see http://slurm.schedmd.com/slurmdbd.conf.html for more details.

ArchiveDir=/path/to/archives/
ArchiveJobs=yes
# these aren't as important, you may want to leave them off:
ArchiveEvents=yes
ArchiveResvs=yes
ArchiveSteps=yes
# figure out which records to get rid of, and at what frequency
PurgeEventAfter=1month
PurgeJobAfter=12month
PurgeResvAfter=1month
PurgeStepAfter=1month
PurgeSuspendAfter=1month

>> Admittedly, slurm should handle exceedingly large queries like this better -
>> setting some sort of upper bound on per-query resource use to prevent users
>> from DOS'ing slurmdbd, and checking whether our calls to pack() are
>> succeeding or not (rather than continuing to print junk to the log and
>> filling the filesystem), and stopping the query sooner. I can file an
>> enhancement request for 16.05 to explore adding such checks if you'd like.
>
> That would be very useful. We can be careful going forward, but it's always
> possible that a user will inadvertently run a large query and cause this
> problem.

Purging sufficient records such that they couldn't possible request that 
much data is the easiest fix for now. I'm guessing if you keep ~2-3 
months of data you'd be safe for now.

I've opened two enhancement requests related to this bug now, and we'll 
consider them as part of our normal release engineering process. 
(They're marked as internal, I can add you to them if you're interested.)

- Tim
Comment 4 Susan Chacko 2016-01-14 04:43:24 MST
(In reply to Tim Wickberg from comment #3)

Thanks a lot, Tim. We know what we need to do now, so you can close this. 

> I've opened two enhancement requests related to this bug now, and we'll 
> consider them as part of our normal release engineering process. 
> (They're marked as internal, I can add you to them if you're interested.)

Yes, please. 

Susan.
Comment 5 Tim Wickberg 2016-01-14 05:51:20 MST
Closing.

I've added you both to the enhancement requests now as you've probably seen. No promises on when we get to those though, 16.05 is coming up soon and we've got a fairly full plate for that release.

- Tim