Created attachment 2627 [details] slurm.conf file Dear SLURM support, this afternoon we started to see the below message in slurmctld.log file [2016-01-21T15:59:34.528] job_allocate: MaxJobCount limit reached (10000) [2016-01-21T15:59:37.278] error: _create_job_record: MaxJobCount reached (10000) [2016-01-21T16:00:09.964] error: _create_job_record: MaxJobCount reached (10000) [2016-01-21T16:00:09.982] job_allocate: MaxJobCount limit reached (10000) [2016-01-21T16:00:09.997] job_allocate: MaxJobCount limit reached (10000) [2016-01-21T16:00:10.984] job_allocate: MaxJobCount limit reached (10000) [2016-01-21T16:00:11.012] job_allocate: MaxJobCount limit reached (10000) [2016-01-21T16:00:13.059] job_allocate: MaxJobCount limit reached (10000) [2016-01-21T16:00:13.407] job_allocate: MaxJobCount limit reached (10000) [2016-01-21T16:00:16.062] job_allocate: MaxJobCount limit reached (10000) [2016-01-21T16:00:16.409] job_allocate: MaxJobCount limit reached (10000) [2016-01-21T16:00:20.064] job_allocate: MaxJobCount limit reached (10000) [2016-01-21T16:00:20.412] job_allocate: MaxJobCount limit reached (10000) [2016-01-21T16:00:25.067] job_allocate: MaxJobCount limit reached (10000) [2016-01-21T16:00:25.414] job_allocate: MaxJobCount limit reached (10000) [2016-01-21T16:00:31.069] job_allocate: MaxJobCount limit reached (10000) [2016-01-21T16:00:31.416] job_allocate: MaxJobCount limit reached (10000) [2016-01-21T16:00:38.071] job_allocate: MaxJobCount limit reached (10000) [2016-01-21T16:00:38.419] job_allocate: MaxJobCount limit reached (10000) [2016-01-21T16:00:43.883] job_allocate: MaxJobCount limit reached (9814 + 1000 >= 10000) [2016-01-21T16:00:44.886] job_allocate: MaxJobCount limit reached (9814 + 1000 >= 10000) [2016-01-21T16:00:47.245] job_allocate: MaxJobCount limit reached (9840 + 1000 >= 10000) [2016-01-21T16:00:50.503] job_allocate: MaxJobCount limit reached (9874 + 1000 >= 10000) users are complaining that they can't submit jobs to the cluster because they have this error: sbatch: error: Slurm temporarily unable to accept job, sleeping and retrying. At the same time the folloing message is started to show up in slurmdbd.log file: [2016-01-21T15:00:05.407] Warning: Note very large processing time from hourly_rollup for sango: usec=5084951 began=15:00:00.322 [2016-01-21T16:00:05.596] Warning: Note very large processing time from hourly_rollup for sango: usec=5165133 began=16:00:00.431 If we issue squeue | wc -l the total number is 875. In attachment you can find our slurm.conf file and below slurmdbd one Please let us know. Thank you for your support /etc/slurm/slurmdbd.conf # Archive info #ArchiveJobs=yes #ArchiveDir="/tmp" #ArchiveSteps=yes #ArchiveScript= #JobPurge=12 #StepPurge=1 # # Authentication info AuthType=auth/munge #AuthInfo=/var/run/munge/munge.socket.2 # # slurmDBD info DbdAddr=localhost DbdHost=localhost DbdPort=7031 SlurmUser=root #MessageTimeout=300 DebugLevel=4 #DefaultQOS=normal,standby LogFile=/var/log/slurm/slurmdbd.log PidFile=/var/run/slurmdbd.pid #PluginDir=/usr/lib/slurm #PrivateData=accounts,users,usage,jobs #TrackWCKey=yes # # Database info StorageType=accounting_storage/mysql #StorageHost=sango-sched1 #StoragePort=1234 #StoragePass=password123 StorageUser=root StorageLoc=slurm_acct_db
Francesca, in your slurm.conf I see: #MaxJobCount As MaxJobCount is commented in your configuration, the maximum number of jobs Slurm can have in its active database at one time defaults to 10000. On the other hand, you have: MinJobAge=3000 This is the minimum age of a completed job before its record is purged from Slurm's active database. 3000 is a quite high value, so apart from the 875 jobs showed by squeue, you have the past (finished) jobs also being computed for 3000 seconds. I would suggest reducing MinJobAge to the default, 300 seconds.
(In reply to Alejandro Sanchez from comment #1) > Francesca, > > in your slurm.conf I see: > > #MaxJobCount > > As MaxJobCount is commented in your configuration, the maximum number of > jobs Slurm can have in its active database at one time defaults to 10000. > > On the other hand, you have: > > MinJobAge=3000 > > This is the minimum age of a completed job before its record is purged from > Slurm's active database. 3000 is a quite high value, so apart from the 875 > jobs showed by squeue, you have the past (finished) jobs also being computed > for 3000 seconds. > > I would suggest reducing MinJobAge to the default, 300 seconds. Dear Alejandro, thank you so much for your reply. If we set MinJobAge as you suggested and we modify MaxJobValue do we have to restart the controller? If so, will the running jobs be preserved? Thanks again
You can execute 'scontrol reconfigure'. Running jobs should be preserved.
(In reply to Alejandro Sanchez from comment #3) > You can execute 'scontrol reconfigure'. Running jobs should be preserved. Sorry, you need to restart the slurmctld daemon so that hash tables can be rebuilt.
From slurm.conf man page: MaxJobCount The maximum number of jobs Slurm can have in its active database ... This value may not be reset via "scontrol reconfig". It only takes effect upon restart of the slurmctld daemon.
(In reply to Moe Jette from comment #4) > (In reply to Alejandro Sanchez from comment #3) > > You can execute 'scontrol reconfigure'. Running jobs should be preserved. > > Sorry, you need to restart the slurmctld daemon so that hash tables can be > rebuilt. Alex has help clarify this for me. You need to restart slurmctld to change MaxJobCount, but MinJobAge can be changed with "scontrol reconfig". You may want to change both. I am also changing the error message to be a bit more clear: job_allocate: MaxJobCount limit from slurm.conf reached (10000)
(In reply to Moe Jette from comment #7) > I am also changing the error message to be a bit more clear: > job_allocate: MaxJobCount limit from slurm.conf reached (10000) Message change in this commit: https://github.com/SchedMD/slurm/commit/f11ec171cd9f5276f2dbc1ea5a4bfdae012b68e8
Is your system running fine now?
(In reply to Moe Jette from comment #9) > Is your system running fine now? Hi Moe, Alejandro thank you so much for your replies. For the moment we changed MinJobAge = 300 sec and everything seems ok now. Today we will keep an eye on the scheduler. Do you have any best practice about setting MaxJobCount to a proper number? Thanks a lot!
(In reply to Francesca Tartaglione from comment #10) > (In reply to Moe Jette from comment #9) > > Is your system running fine now? > > Hi Moe, Alejandro > > thank you so much for your replies. > For the moment we changed > > MinJobAge = 300 sec > > and everything seems ok now. Very good. > Today we will keep an eye on the scheduler. > Do you have any best practice about setting MaxJobCount to a proper number? That depends upon your workload. You could probably set MaxJobCount to at least 50000 with most systems (assuming you have at least a few gigabytes of memory). Some sites run with a value of 1000000 or more.
(In reply to Moe Jette from comment #11) > (In reply to Francesca Tartaglione from comment #10) > > (In reply to Moe Jette from comment #9) > > > Is your system running fine now? > > > > Hi Moe, Alejandro > > > > thank you so much for your replies. > > For the moment we changed > > > > MinJobAge = 300 sec > > > > and everything seems ok now. > > Very good. > > > > Today we will keep an eye on the scheduler. > > Do you have any best practice about setting MaxJobCount to a proper number? > > That depends upon your workload. You could probably set MaxJobCount to at > least 50000 with most systems (assuming you have at least a few gigabytes of > memory). Some sites run with a value of 1000000 or more. Hi Moe, thanks a lot. We are actually planning to modify that value as well. Best regards