Ticket 6712 - slurm_pack_list: size limit exceeded
Summary: slurm_pack_list: size limit exceeded
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Database (show other tickets)
Version: 18.08.3
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Marshall Garey
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-03-18 10:13 MDT by surendra
Modified: 2019-03-25 09:36 MDT (History)
0 users

See Also:
Site: NREL
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description surendra 2019-03-18 10:13:48 MDT
we are running in to "sacct: error: slurmdbd: Query result exceeds size limit" when trying to query accounting data. 

sacct -b -a --parsable2 --starttime=2019-02-21T12:00:00 --endtime=2019-02-21T13:00:00

Here is the log message during that timeframe on the slurmdbd

==> /var/log/slurmdbd.log <==
[2019-03-18T09:54:01.513] error: slurm_pack_list: size limit exceeded

Here are the mysql setting that we have for innodb_buffer_pool

innodb_buffer_pool_size=4096M
innodb_buffer_pool_instances=8

innodb_lock_wait_timeout=100


Can you advice us if this is due to innodb_buffer_pool_size and recommended settings?
Comment 1 Marshall Garey 2019-03-18 14:26:38 MDT
This size limit is a built-in limit of 3 GB of data. If a query results in a data size of greater than 3 GB, then the slurmdbd will return the error ESLURM_RESULT_TOO_LARGE, which results in the error message you see. This is mentioned in the slurmdbd.conf man page - see MaxQueryTimeRange

https://slurm.schedmd.com/slurmdbd.conf.html#OPT_MaxQueryTimeRange

Even though you're using -b, that flag (and all other sacct formatting flags) doesn't reduce the amount of data sent from slurmdbd to sacct. The implementation is actually to get the entire job record, then sacct will print only the information it cares about. We're aware of this limitation and 

Normally I'd recommend that you reduce the time period of your query, but I notice that it's just a single hour of data. I'm curious how you have >3 GB worth of job and step records in just a single hour. Each job record is usually between 500 bytes - 1 kB (but it can vary, especially with large comment strings), but step records are a lot heavier than job records, and sacct returns step records as well. Maybe there are a ton of steps. Can you try running that same query, but with the -X flag as well (which will exclude steps from the query)?

sacct -X -b -a --parsable2 --starttime=2019-02-21T12:00:00 --endtime=2019-02-21T13:00:00
Comment 2 surendra 2019-03-18 15:23:05 MDT
I narrowed this down to a time where sacct hangs.

# sacct  -X -b -a --parsable2 --starttime=2019-02-21T12:23:45 --endtime=2019-02-21T12:23:46 --format=jobid,elapsed,ncpus,ntasks,state | wc -l
397

[root@emgmt1 ~]# sacct  -b -a --parsable2 --starttime=2019-02-21T12:23:45 --endtime=2019-02-21T12:23:46 --format=jobid,elapsed,ncpus,ntasks,state | wc -l
1019653

There seem to be ton of job steps within that time frame. is this usual that it would record that many job steps in one second?

588497.39993|00:00:03|8|8|COMPLETED|588497.39993|COMPLETED|0:0
588497.39994|00:00:02|8|8|COMPLETED|588497.39994|COMPLETED|0:0
588497.39995|00:00:03|8|8|COMPLETED|588497.39995|COMPLETED|0:0
588497.39996|00:00:03|8|8|COMPLETED|588497.39996|COMPLETED|0:0
588497.39997|00:00:02|8|8|COMPLETED|588497.39997|COMPLETED|0:0
588497.39998|00:00:03|8|8|COMPLETED|588497.39998|COMPLETED|0:0
588497.39999|00:00:03|8|8|COMPLETED|588497.39999|COMPLETED|0:0
Comment 3 Marshall Garey 2019-03-18 15:42:56 MDT
Nice find. I don't often see how many job steps people are actually running, but if all those steps finish at that same time, then that's totally possible to have 40k steps have the same end time (and apparently happened). The steps record their own end time, which eventually gets passed to the database - so the end time is not when the step records are in the database, but when the steps say "I'm done."

When you see the "query result exceeds size limit" error in the future, then I recommend following the procedure you just did - reduce the time span of the query and/or use the -X flag to eliminate steps. Is this a sufficient workaround?


It's unfortunately not a very elegant solution. We are aware that we're sending back the entire job and step records, instead of just what was requested. However, changing this is not trivial and would likely require sponsorship.
Comment 4 surendra 2019-03-18 16:11:28 MDT
I will get back to you if this would help as a workaround from the team. Also, what is the difference between the batch and exte steps.
Comment 5 Marshall Garey 2019-03-20 09:41:56 MDT
The batch step runs the actual batch script that the user submitted. Each srun is another step. The extern step is created when PrologFlags=contain is set in slurm.conf and is used for pam_slurm_adopt. Its purpose is to create an additional cgroup so that with pam_slurm_adopt a user ssh'ing to the node will get adopted into the extern step cgroup.
Comment 6 Marshall Garey 2019-03-25 09:36:33 MDT
Closing as infogiven. Please re-open if you have further issues.