Yesterday we upgraded successfully Slurm from 17.02.9 to 17.11.5. However, some 100+ jobs in the queue got a strange state JobState=PENDING Reason=BadConstraints after the upgrade. Most of these incorrect BadConstraints were fixed by holding and then releasing the jobs. However, we have a few jobs that are not fixed by hold+release. One example is: JobId=460138 JobName=TEST-BaBr2-CdI2/nm/ UserId=tdeilm(221341) GroupId=camdvip(1250) MCS_label=N/A Priority=0 Nice=-132215 Account=camdvip QOS=normal JobState=PENDING Reason=BadConstraints Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=2-00:00:00 TimeMin=N/A SubmitTime=Thu 07:52:24 EligibleTime=Thu 07:52:24 StartTime=Fri 09:50:32 EndTime=Sun 10:50:32 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=Fri 09:50:19 Partition=xeon24 AllocNode:Sid=sylg:34837 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=40-40 NumCPUs=960 NumTasks=960 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=960,node=40 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=25 MinMemoryNode=250G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Gres=(null) Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/home/niflheim2/tdeilm/2D_ht/C2DM-2019/BaBr2-CdI2/nm StdErr=/home/niflheim2/tdeilm/2D_ht/C2DM-2019/BaBr2-CdI2/nm/slurm-460138.out StdIn=/dev/null StdOut=/home/niflheim2/tdeilm/2D_ht/C2DM-2019/BaBr2-CdI2/nm/slurm-460138.out Power= The user has cancelled and resubmitted these jobs. But it would be good to eliminate this bogus BadConstraints if possible. Thanks, Ole
Hi. In order to analyze this issue we would need: - slurm.conf and potentially gres.conf/cgroup.conf as well. - Example of job submission request (command line and script if batch job). - Logs from slurmctld.log related to the job example. Thanks.
I'm out of the office until April 3. Jeg er ikke pƄ kontoret, tilbage igen 3. april. Best regards / Venlig hilsen, Ole Holm Nielsen
Created attachment 6522 [details] slurm.conf
Created attachment 6523 [details] gres.conf
Created attachment 6524 [details] cgroup.conf
(In reply to Alejandro Sanchez from comment #1) > - slurm.conf and potentially gres.conf/cgroup.conf as well. Files attached. > - Example of job submission request (command line and script if batch job). This user is still on Easter holiday, I'll have to get his example later. > - Logs from slurmctld.log related to the job example. # zcat slurmctld.log-20180325.gz | grep 460138 [2018-03-22T07:52:24.892] _slurm_rpc_submit_batch_job JobId=460138 usec=1544 [2018-03-22T07:53:09.467] _slurm_rpc_top_job for 460138 usec=651 [2018-03-22T07:53:53.789] _slurm_rpc_top_job for 460138 usec=648 [2018-03-22T13:18:43.175] Recovered JobID=460138 State=0x0 NodeCnt=0 Assoc=216 [2018-03-22T14:28:48.241] Recovered JobID=460138 State=0x0 NodeCnt=0 Assoc=216 [2018-03-22T16:17:48.278] _build_node_list: No nodes satisfy job 460138 requirements in partition xeon24 [2018-03-22T16:17:48.278] sched: schedule: JobID=460138 State=0x0 NodeCnt=0 non-runnable: Requested node configuration is not available [2018-03-22T16:32:05.630] _slurm_rpc_update_job: complete JobId=460138 uid=221341 usec=304 [2018-03-22T16:32:10.478] error: sched: Attempt to modify priority for job 460138 [2018-03-22T16:32:10.478] _slurm_rpc_update_job: JobId=460138 uid=221341: Access/permission denied [2018-03-22T16:32:48.102] _slurm_rpc_update_job: complete JobId=460138 uid=221341 usec=358 [2018-03-22T16:33:04.475] error: sched: Attempt to modify priority for job 460138 [2018-03-22T16:33:04.475] _slurm_rpc_update_job: JobId=460138 uid=221341: Access/permission denied [2018-03-23T09:44:59.620] sched: _release_job_rec: release hold on job_id 460138 by uid 0 [2018-03-23T09:44:59.620] _slurm_rpc_update_job: complete JobId=460138 uid=0 usec=387 [2018-03-23T09:45:00.192] _build_node_list: No nodes satisfy job 460138 requirements in partition xeon24 [2018-03-23T09:45:00.192] sched: schedule: JobID=460138 State=0x0 NodeCnt=0 non-runnable: Requested node configuration is not available [2018-03-23T09:49:55.511] _slurm_rpc_update_job: complete JobId=460138 uid=0 usec=265 [2018-03-23T09:49:55.523] sched: _release_job_rec: release hold on job_id 460138 by uid 0 [2018-03-23T09:49:55.523] _slurm_rpc_update_job: complete JobId=460138 uid=0 usec=333 [2018-03-23T09:49:56.918] _build_node_list: No nodes satisfy job 460138 requirements in partition xeon24 [2018-03-23T09:49:56.918] sched: schedule: JobID=460138 State=0x0 NodeCnt=0 non-runnable: Requested node configuration is not available [2018-03-23T09:50:19.227] _slurm_rpc_update_job: complete JobId=460138 uid=0 usec=336 [2018-03-23T09:50:19.242] sched: _release_job_rec: release hold on job_id 460138 by uid 0 [2018-03-23T09:50:19.242] _slurm_rpc_update_job: complete JobId=460138 uid=0 usec=2334 [2018-03-23T09:50:19.956] _build_node_list: No nodes satisfy job 460138 requirements in partition xeon24 [2018-03-23T09:50:19.956] sched: schedule: JobID=460138 State=0x0 NodeCnt=0 non-runnable: Requested node configuration is not available [2018-03-23T09:50:57.644] _slurm_rpc_update_job: complete JobId=460138 uid=0 usec=318 [2018-03-23T09:50:58.659] sched: _release_job_rec: release hold on job_id 460138 by uid 0 [2018-03-23T09:50:58.659] _slurm_rpc_update_job: complete JobId=460138 uid=0 usec=354 [2018-03-23T09:51:01.166] _build_node_list: No nodes satisfy job 460138 requirements in partition xeon24 [2018-03-23T09:51:01.166] sched: schedule: JobID=460138 State=0x0 NodeCnt=0 non-runnable: Requested node configuration is not available [2018-03-23T09:53:04.617] _slurm_rpc_update_job: complete JobId=460138 uid=221341 usec=349 [2018-03-23T09:53:33.684] error: sched: Attempt to modify priority for job 460138 [2018-03-23T09:53:33.685] _slurm_rpc_update_job: JobId=460138 uid=221341: Access/permission denied [2018-03-23T09:58:38.867] sched: _release_job_rec: release hold on job_id 460138 by uid 0 [2018-03-23T09:58:38.868] _slurm_rpc_update_job: complete JobId=460138 uid=0 usec=401 [2018-03-23T09:58:39.359] _build_node_list: No nodes satisfy job 460138 requirements in partition xeon24 [2018-03-23T09:58:39.359] sched: schedule: JobID=460138 State=0x0 NodeCnt=0 non-runnable: Requested node configuration is not available [2018-03-23T10:02:15.811] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 460138 uid 221341 [2018-03-23T10:03:06.041] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 460138 uid 221341
We now have a number of jobs caught up in the Reason=BadConstraints mode, both jobs submitted under 17.02.9 as well as 17.11.5 (upgraded on March 22). An example of this is: # scontrol show job 470506 JobId=470506 JobName=c2dm.pdos UserId=mogje(22231) GroupId=camdvip(1250) MCS_label=N/A Priority=0 Nice=0 Account=camdvip QOS=normal JobState=PENDING Reason=BadConstraints Dependency=(null) Requeue=1 Restarts=1 BatchFlag=2 Reboot=0 ExitCode=0:1 RunTime=00:00:00 TimeLimit=10:00:00 TimeMin=N/A SubmitTime=2018-03-26T09:58:06 EligibleTime=2018-03-26T09:58:06 StartTime=Unknown EndTime=Unknown Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2018-04-01T03:52:34 Partition=xeon24 AllocNode:Sid=sylg:45931 ReqNodeList=(null) ExcNodeList=(null) NodeList=x178 BatchHost=x178 NumNodes=1 NumCPUs=24 NumTasks=24 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=24,mem=500G,node=1,billing=24 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=49 MinMemoryNode=500G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Gres=(null) Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/home/niflheim/mogje/chalcohalides_workflow/Ti4Cl4O4-HfSCl/nm StdErr=/home/niflheim/mogje/chalcohalides_workflow/Ti4Cl4O4-HfSCl/nm/slurm-c2dm.pdos-470506.err StdIn=/dev/nullNodeList=x178 StdOut=/home/niflheim/mogje/chalcohalides_workflow/Ti4Cl4O4-HfSCl/nm/slurm-c2dm.pdos-470506.out Power= Please note that this job has apparently been assigned to NodeList=x178, which is currently running another job. The slurmctld log files have these lines on job 470506: [2018-03-26T09:58:06.307] _slurm_rpc_allsubmit_batch_job: JobId=470506 InitPrio=9156 usec=1273 [2018-03-28T08:34:31.093] backfill: Started JobID=470506 in xeon24 on x178 [2018-03-28T09:12:28.225] Batch JobId=470506 missing from node 0 (not found BatchStartTime after startup), Requeuing job [2018-03-28T09:12:28.225] _job_complete: JobID=470506 State=0x1 NodeCnt=1 WTERMSIG 126 [2018-03-28T09:12:28.225] _job_complete: JobID=470506 State=0x1 NodeCnt=1 cancelled by node failure [2018-03-28T09:12:28.225] _job_complete: requeue JobID=470506 State=0x8000 NodeCnt=1 due to node failure [2018-03-28T09:12:28.225] _job_complete: JobID=470506 State=0x8000 NodeCnt=1 done [2018-04-01T03:52:34.747] _build_node_list: No nodes satisfy job 470506 requirements in partition xeon24 [2018-04-01T03:52:34.747] sched: schedule: JobID=470506 State=0x0 NodeCnt=1 non-runnable: Requested node configuration is not available There was a power hiccup on 2018-03-28T09:12 causing some nodes to go offline (reason unknown, it may the the Omni-Path fabric being unavailable for some seconds). However, the nodes were returned to the running state shortly afterwards.
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #7) > We now have a number of jobs caught up in the Reason=BadConstraints mode, > both jobs submitted under 17.02.9 as well as 17.11.5 (upgraded on March 22). > > An example of this is: > > # scontrol show job 470506 > JobId=470506 JobName=c2dm.pdos > UserId=mogje(22231) GroupId=camdvip(1250) MCS_label=N/A > Priority=0 Nice=0 Account=camdvip QOS=normal > JobState=PENDING Reason=BadConstraints Dependency=(null) > Requeue=1 Restarts=1 BatchFlag=2 Reboot=0 ExitCode=0:1 > RunTime=00:00:00 TimeLimit=10:00:00 TimeMin=N/A > SubmitTime=2018-03-26T09:58:06 EligibleTime=2018-03-26T09:58:06 > StartTime=Unknown EndTime=Unknown Deadline=N/A > PreemptTime=None SuspendTime=None SecsPreSuspend=0 > LastSchedEval=2018-04-01T03:52:34 > Partition=xeon24 AllocNode:Sid=sylg:45931 > ReqNodeList=(null) ExcNodeList=(null) > NodeList=x178 > BatchHost=x178 > NumNodes=1 NumCPUs=24 NumTasks=24 CPUs/Task=1 ReqB:S:C:T=0:0:*:* > TRES=cpu=24,mem=500G,node=1,billing=24 > Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* > MinCPUsNode=49 MinMemoryNode=500G MinTmpDiskNode=0 > Features=(null) DelayBoot=00:00:00 > Gres=(null) Reservation=(null) > OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) > Command=(null) > WorkDir=/home/niflheim/mogje/chalcohalides_workflow/Ti4Cl4O4-HfSCl/nm > > StdErr=/home/niflheim/mogje/chalcohalides_workflow/Ti4Cl4O4-HfSCl/nm/slurm- > c2dm.pdos-470506.err > StdIn=/dev/nullNodeList=x178 > > StdOut=/home/niflheim/mogje/chalcohalides_workflow/Ti4Cl4O4-HfSCl/nm/slurm- > c2dm.pdos-470506.out > Power= > > Please note that this job has apparently been assigned to NodeList=x178, > which is currently running another job. The slurmctld log files have these > lines on job 470506: > > [2018-03-26T09:58:06.307] _slurm_rpc_allsubmit_batch_job: JobId=470506 > InitPrio=9156 usec=1273 > [2018-03-28T08:34:31.093] backfill: Started JobID=470506 in xeon24 on x178 > [2018-03-28T09:12:28.225] Batch Job[2018-04-03T14:51:43.258] _slurm_rpc_update_job: complete JobId=470506 uid=0 usec=320 [2018-04-03T14:51:44.274] sched: _release_job_rec: release hold on job_id 470506 by uid 0 [2018-04-03T14:51:44.274] _slurm_rpc_update_job: complete JobId=470506 uid=0 usec=378 [2018-04-03T14:51:46.067] _build_node_list: No nodes satisfy job 470506 requirements in partition xeon24 [2018-04-03T14:51:46.067] sched: schedule: JobID=470506 State=0x0 NodeCnt=1 non-runnable: Re[2018-04-03T14:51:43.258] _slurm_rpc_update_job: complete JobId=470506 uid=0 usec=320 [2018-04-03T14:51:44.274] sched: _release_job_rec: release hold on job_id 470506 by uid 0 [2018-04-03T14:51:44.274] _slurm_rpc_update_job: complete JobId=470506 uid=0 usec=378 [2018-04-03T14:51:46.067] _build_node_list: No nodes satisfy job 470506 requirements in partition xeon24 [2018-04-03T14:51:46.067] sched: schedule: JobID=470506 State=0x0 NodeCnt=1 non-runnable: Requested node configuration is not available quested node configuration is not available Id=470506 missing from node 0 (not found > BatchStartTime after startup), Requeuing job > [2018-03-28T09:12:28.225] _job_complete: JobID=470506 State=0x1 NodeCnt=1 > WTERMSIG 126 > [2018-03-28T09:12:28.225] _job_complete: JobID=470506 State=0x1 NodeCnt=1 > cancelled by node failure > [2018-03-28T09:12:28.225] _job_complete: requeue JobID=470506 State=0x8000 > NodeCnt=1 due to node failure > [2018-03-28T09:12:28.225] _job_complete: JobID=470506 State=0x8000 NodeCnt=1 > done > [2018-04-01T03:52:34.747] _build_node_list: No nodes satisfy job 470506 > requirements in partition xeon24 > [2018-04-01T03:52:34.747] sched: schedule: JobID=470506 State=0x0 NodeCnt=1 > non-runnable: Requested node configuration is not available > > There was a power hiccup on 2018-03-28T09:12 causing some nodes to go > offline (reason unknown, it may the the Omni-Path fabric being unavailable > for some seconds). However, the nodes were returned to the running state > shortly afterwards. I did a hold-then-release of JobID=470506, but the job is still in the BadConstraints mode. The slurmctld log says: [2018-04-03T14:51:43.258] _slurm_rpc_update_job: complete JobId=470506 uid=0 usec=320 [2018-04-03T14:51:44.274] sched: _release_job_rec: release hold on job_id 470506 by uid 0 [2018-04-03T14:51:44.274] _slurm_rpc_update_job: complete JobId=470506 uid=0 usec=378 [2018-04-03T14:51:46.067] _build_node_list: No nodes satisfy job 470506 requirements in partition xeon24 [2018-04-03T14:51:46.067] sched: schedule: JobID=470506 State=0x0 NodeCnt=1 non-runnable: Requested node configuration is not available The message "Requested node configuration is not available" appears to be completely wrong.
PartitionName=xeon24 Nodes=x[001-192] DefaultTime=12:00:00 MaxTime=50:00:00 DefMemPerCPU=8000 MaxMemPerCPU=10500 State=UP OverSubscribe=NO NodeName=x[001-168],x[181-192] Weight=10424 Boards=1 SocketsPerBoard=2 CoresPerSocket=12 ThreadsPerCore=1 RealMemory=256000 TmpDisk=32752 Feature=xeon2650v4,opa,xeon24 Isn't DefMemPerCPU=8000 too high?
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #6) > (In reply to Alejandro Sanchez from comment #1) > > - Example of job submission request (command line and script if batch job). The submit command is reported by the user as: sbatch -J TEST --mail-type=FAIL --partition=xeon24 -N 21 -n 504 --time=48:00:00 --mem=0 run.py
(In reply to Alejandro Sanchez from comment #9) > PartitionName=xeon24 Nodes=x[001-192] DefaultTime=12:00:00 MaxTime=50:00:00 > DefMemPerCPU=8000 MaxMemPerCPU=10500 State=UP OverSubscribe=NO > NodeName=x[001-168],x[181-192] Weight=10424 Boards=1 SocketsPerBoard=2 > CoresPerSocket=12 ThreadsPerCore=1 RealMemory=256000 TmpDisk=32752 > Feature=xeon2650v4,opa,xeon24 > > Isn't DefMemPerCPU=8000 too high? The nodes in partition xeon24 have 256 GB of RAM and 24 CPU cores. DefMemPerCPU=8000 would add up to only 192 GB, right?
Yeah sorry, the DefMemPerCPU isn't incorrect, my fault. I find this strange though: JobId=460138 ... Partition=xeon24 AllocNode:Sid=sylg:34837 ... NumNodes=40-40 NumCPUs=960 NumTasks=960 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=960,node=40 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=25 MinMemoryNode=250G MinTmpDiskNode=0 ^^ > # scontrol show job 470506 > JobId=470506 JobName=c2dm.pdos ... > Partition=xeon24 AllocNode:Sid=sylg:45931 ... > NodeList=x178 > BatchHost=x178 > NumNodes=1 NumCPUs=24 NumTasks=24 CPUs/Task=1 ReqB:S:C:T=0:0:*:* > TRES=cpu=24,mem=500G,node=1,billing=24 > Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* > MinCPUsNode=49 MinMemoryNode=500G MinTmpDiskNode=0 ^^ ^^ Is --mincpus requested/updated at some point with a value > 24?
(In reply to Alejandro Sanchez from comment #12) Thanks for identifying strange parameters: > I find this strange though: > > JobId=460138 > ... > Partition=xeon24 AllocNode:Sid=sylg:34837 > ... > NumNodes=40-40 NumCPUs=960 NumTasks=960 CPUs/Task=1 ReqB:S:C:T=0:0:*:* > TRES=cpu=960,node=40 > Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* > MinCPUsNode=25 MinMemoryNode=250G MinTmpDiskNode=0 > ^^ Strange indeed! The node has 256 GB RAM, configured in slurm.conf as only MaxMemPerCPU=10500. The user requests 250 GB equivalent to 10417 MB/cpu which is less than 10500. Could it be that the scheduler miscalculates 250 GB as exceeding MaxMemPerCPU=10500 and therefore requiring 25 cpus (out of 24 available cpus)? This might be a genuine bug! > > # scontrol show job 470506 > > JobId=470506 JobName=c2dm.pdos > ... > > Partition=xeon24 AllocNode:Sid=sylg:45931 > ... > > NodeList=x178 > > BatchHost=x178 > > NumNodes=1 NumCPUs=24 NumTasks=24 CPUs/Task=1 ReqB:S:C:T=0:0:*:* > > TRES=cpu=24,mem=500G,node=1,billing=24 > > Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* > > MinCPUsNode=49 MinMemoryNode=500G MinTmpDiskNode=0 > ^^ ^^ > > Is --mincpus requested/updated at some point with a value > 24? No updating has been done. But I now see that the user has submitted to the wrong partition xeon24 (256 GB nodes) but asking for 500 GB requiring 48 cpus (which may be miscalculated as in the above as 49 cpus) in this partition. I will inform the user of his error.
We have a lot of jobs by another user displaying apparently the same bug in the scheduler's calculation of memory requirements: # scontrol show job 447332 JobId=447332 JobName=c2dm.strains UserId=mikst(28594) GroupId=camdvip(1250) MCS_label=N/A Priority=0 Nice=0 Account=camdvip QOS=normal JobState=PENDING Reason=BadConstraints Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=2-00:00:00 TimeMin=N/A SubmitTime=2018-03-12T09:02:34 EligibleTime=2018-03-12T09:02:34 StartTime=Unknown EndTime=Unknown Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2018-04-03T14:51:00 Partition=xeon24 AllocNode:Sid=thul:4469 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=2-2 NumCPUs=48 NumTasks=48 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=48,node=2 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=25 MinMemoryNode=250G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Gres=(null) Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/home/niflheim/mikst/C2DM-2018/MX3-1T/Zr2I6-1T/fm StdErr=/home/niflheim/mikst/C2DM-2018/MX3-1T/Zr2I6-1T/fm/slurm-c2dm.strains-447332.err StdIn=/dev/null StdOut=/home/niflheim/mikst/C2DM-2018/MX3-1T/Zr2I6-1T/fm/slurm-c2dm.strains-447332.out Power= The user asks for 250 GB per node which is just below the total of 252 GB with the configured MaxMemPerCPU=10500. Or may there be some confusion about specifying memory size as GiB or GB units?
A follow-up question: Our partition xeon24 has the default parameter MaxCPUsPerNode=UNLIMITED: # scontrol show partition xeon24 PartitionName=xeon24 AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=12:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=2-02:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=x[001-192] PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=4608 TotalNodes=192 SelectTypeParameters=NONE DefMemPerCPU=10000 MaxMemPerCPU=10750 (Note that I've increased the DefMemPerCPU,MaxMemPerCPU values slightly today). The slurm.conf man-page only gives a GPU related usage example of MaxCPUsPerNode. I'm thinking that it's a good idea to set MaxCPUsPerNode=24 for 24-core nodes (and similar for other core counts) so that no jobs will ever request >24 cores due to a bad memory request. Question: Can you confirm that it's advisable, or even best practice, so set MaxCPUsPerNode equal to the number of CPU cores in the node? What would happen to a job submission where the requested cores per node and/or memory size exceeds the limits? Hopefully, the job would be rejected at submission, or become Failed shortly thereafter? If we want to reserve some cores for GPUs or other tasks, MaxCPUsPerNode could of course be less than the number of physical cores.
Do you have another example job request with the command line options for submission, with just 2 nodes and with the partition config as described in your last comment? I'm still unable to reproduce the MinCPUsNode=25 situation. Are all the nodes in that partition homogeneous in terms of sockets/cores and memory, right? Let's see if with your job submission reproducer I can locally reproduce too and narrow the problem down.
(In reply to Alejandro Sanchez from comment #16) > Do you have another example job request with the command line options for > submission, with just 2 nodes and with the partition config as described in > your last comment? I'm still unable to reproduce the MinCPUsNode=25 > situation. > > Are all the nodes in that partition homogeneous in terms of sockets/cores > and memory, right? > > Let's see if with your job submission reproducer I can locally reproduce too > and narrow the problem down. I'm setting up tests to make this reproducible, and I have further interesting observations. Firstly, I reserved node x001 to allow only user ohni (myself): scontrol create reservation starttime=now duration=720:00:00 ReservationName=Test1 nodes=x001 user=ohni Secondly, I defined a new xeon24_test partition containing only node x001, and with a ridiculously low DefMemPerCPU=6000 MaxMemPerCPU=7000 for testing purposes only. Our 24-core node now comprise these 3 partitions: PartitionName=xeon24 Nodes=x[001-192] DefaultTime=12:00:00 MaxTime=50:00:00 DefMemPerCPU=10000 MaxMemPerCPU=10750 State=UP OverSubscribe=NO # Nodes x169-x180 have big memory (512G): PartitionName=xeon24_512 Nodes=x[169-180] DefaultTime=12:00:00 MaxTime=50:00:00 DefMemPerCPU=20000 MaxMemPerCPU=21500 State=UP OverSubscribe=NO PriorityJobFactor=5000 # Test partition PartitionName=xeon24_test Nodes=x001 DefaultTime=1:00:00 MaxTime=2:00:00 DefMemPerCPU=6000 MaxMemPerCPU=7000 State=UP OverSubscribe=NO PriorityJobFactor=5000 As soon as I did "scontrol reconfigure", some jobs got into a bad state as reported by squeue: 438338 xeon24 sbatch tdeilm PD 0:00 20 (ReqNodeNotAvail, UnavailableNodes:b072,d067,i042) The UnavailableNodes belong to different partitions (xeon8 and xeon16), and they are down or draining due to hardware errors. Jobs on the xeon24 partition would *never* be scheduleable on nodes in the xeon8 or xeon16 partitions! Thirdly, I made a quick workaround and removed the node x001 from the standard xeon24 partition (and did "scontrol reconfigure"). The ReqNodeNotAvail problem now disappeared when the normal xeon24 partition doesn't contain node x001: PartitionName=xeon24 Nodes=x[002-192] DefaultTime=12:00:00 MaxTime=50:00:00 DefMemPerCPU=10000 MaxMemPerCPU=10750 State=UP OverSubscribe=NO Summary: I created a reservation of node x001 exclusively for myself. Then I created a new reservation purposely with too small a value for MaxMemPerCPU. The scheduler apparently considers job 438338 (of user: tdeilm) to run on the reserved node x001, and decides that this node's memory is too small, hence blocks this job with a ReqNodeNotAvail status. The scheduler also points to totally irrelevant UnavailableNodes:b072,d067,i042.
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #17) With the partition xeon24_test defined in comment #17 I've submitted this test script: #!/bin/bash -x #SBATCH --reservation=Test1 #SBATCH --nodes=1 #SBATCH --mem=250G #SBATCH --ntasks-per-node=24 #SBATCH --time=00:00:20 #SBATCH --partition=xeon24_test #SBATCH --mail-type ALL ulimit -a Due to the ridiculously low DefMemPerCPU=6000 MaxMemPerCPU=7000 for testing purposes, the job gets into a PENDING state: JobId=488578 JobName=limit_test.sh UserId=ohni(1775) GroupId=camdvip(1250) MCS_label=N/A Priority=127369 Nice=0 Account=camdvip QOS=normal JobState=PENDING Reason=Priority Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=00:01:00 TimeMin=N/A SubmitTime=2018-04-13T11:22:36 EligibleTime=2018-04-13T11:22:36 StartTime=Unknown EndTime=Unknown Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2018-04-13T11:25:52 Partition=xeon24_test AllocNode:Sid=sylg:19174 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1-1 NumCPUs=24 NumTasks=24 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=24,mem=250G,node=1 Socks/Node=* NtasksPerN:B:S:C=24:0:*:* CoreSpec=* MinCPUsNode=37 MinMemoryNode=250G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Gres=(null) Reservation=Test1 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/home/niflheim/ohni/limit_test.sh WorkDir=/home/niflheim/ohni StdErr=/home/niflheim/ohni/slurm-488578.out StdIn=/dev/null StdOut=/home/niflheim/ohni/slurm-488578.out Power= Please note the MinCPUsNode=37 !!! Asking for 250GB of memory, when only 24*7000 ~= 164 GB is possible, the scheduler apparently concludes that the number of CPU cores required is 250*1024/7000 = 36.57 ~= 37 cores. Question: Is this the expected behavior of the Slurm scheduler? If so, perhaps we really need to redefine the partition's default value MaxCPUsPerNode=UNLIMITED to become the number of physical cores in the node, as I suggested in comment #15. If I replace the memory requirement in the script by a zero: #SBATCH --mem=0 then such a job runs just fine with MinCPUsNode=24: JobId=488670 JobName=limit_test.sh UserId=ohni(1775) GroupId=camdvip(1250) MCS_label=N/A Priority=127364 Nice=0 Account=camdvip QOS=normal JobState=COMPLETED Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:02 TimeLimit=00:01:00 TimeMin=N/A SubmitTime=2018-04-13T11:31:31 EligibleTime=2018-04-13T11:31:31 StartTime=2018-04-13T11:32:08 EndTime=2018-04-13T11:32:10 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2018-04-13T11:32:08 Partition=xeon24_test AllocNode:Sid=sylg:19174 ReqNodeList=(null) ExcNodeList=(null) NodeList=x001 BatchHost=x001 NumNodes=1 NumCPUs=24 NumTasks=24 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=24,mem=250G,node=1,billing=24 Socks/Node=* NtasksPerN:B:S:C=24:0:*:* CoreSpec=* MinCPUsNode=24 MinMemoryNode=250G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Gres=(null) Reservation=Test1 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/home/niflheim/ohni/limit_test.sh WorkDir=/home/niflheim/ohni StdErr=/home/niflheim/ohni/slurm-488670.out StdIn=/dev/null StdOut=/home/niflheim/ohni/slurm-488670.out Power= Please note the TRES=cpu=24,mem=250G. How could the mem=250G ever be calculated by the scheduler? The MaxMemPerCPU=7000 corresponds to 164.0 GB of node memory.
I can finally reproduce now. There's definitely something wrong increasing the cpus due to memory. Thanks for the info. Regarding comment 17, Changes in node configuration (e.g. adding nodes, changing their processor count, etc.) require restarting both the slurmctld daemon and the slurmd daemons. All slurmd daemons must know each node in the system to forward messages in support of hierarchical communications.
(In reply to Alejandro Sanchez from comment #19) > I can finally reproduce now. There's definitely something wrong increasing > the cpus due to memory. Thanks for the info. I'm delighted that you can reproduce this! Looking forward to a patch. > Regarding comment 17, Changes in node configuration (e.g. adding nodes, > changing their processor count, etc.) require restarting both the > slurmctld daemon and the slurmd daemons. All slurmd daemons must know each > node in the system to forward messages in support of hierarchical > communications. I know this, but I didn't add any nodes nor change processor count, etc. Which of the changes I did do you think requires restarting slurmctld and possibly slurmd? In comment #18 a strange mem=250G for the node appears without any obvious cause. How could this memory size ever come about when I've configured the partition for 164 GB?
After doing some more tests there are a few things we can discuss. I'm using your original configuration: NodeName=x[001-002] NodeHostname=ibiza Weight=10424 Boards=1 SocketsPerBoard=2 CoresPerSocket=12 ThreadsPerCore=1 RealMemory=256000 TmpDisk=32752 Feature=xeon2650v4,opa,xeon24 Port=61201-61202 PartitionName=xeon24 Nodes=x[001-002] DefaultTime=12:00:00 MaxTime=50:00:00 DefMemPerCPU=8000 MaxMemPerCPU=10500 State=UP OverSubscribe=NO No reservations. So the nodes have relevant info: RealMemory=256000 CPUTot=24 And the partition: DefMemPerCPU=8000 MaxMemPerCPU=10500 So with these resources and limits, a job can request theoretically: MaxMemPerCPU=10500 * CPUTot=24 = 252000MiB / 1024 = 246.09375GiB. alex@ibiza:~/t$ sbatch -p xeon24 --mem=246G --wrap="true" Submitted batch job 20002 alex@ibiza:~/t$ scontrol show job 20002 JobId=20002 JobName=wrap UserId=alex(1000) GroupId=alex(1000) MCS_label=N/A Priority=110260 Nice=0 Account=acct1 QOS=normal WCKey=* JobState=COMPLETED Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:01 TimeLimit=12:00:00 TimeMin=N/A SubmitTime=2018-04-16T16:40:08 EligibleTime=2018-04-16T16:40:08 StartTime=2018-04-16T16:40:08 EndTime=2018-04-16T16:40:09 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2018-04-16T16:40:08 Partition=xeon24 AllocNode:Sid=ibiza:30893 ReqNodeList=(null) ExcNodeList=(null) NodeList=x001 BatchHost=x001 NumNodes=1 NumCPUs=24 NumTasks=0 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=24,mem=246G,node=1,billing=24 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=24 MinMemoryNode=246G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Gres=(null) Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/home/alex/t StdErr=/home/alex/t/slurm-20002.out StdIn=/dev/null StdOut=/home/alex/t/slurm-20002.out Power= alex@ibiza:~/t$ OK, as expected job completes. Now, if we request more --mem than the maximum in the partition, for example --mem=250G (that's 250*1024=25600MiB, and is higher than the limit of 246.09375GiB or 252000MiB): alex@ibiza:~/t$ sbatch -p xeon24 --mem=250G --wrap="true" sbatch: error: CPU count per node can not be satisfied sbatch: error: Batch job submission failed: Requested node configuration is not available alex@ibiza:~/t$ and these messages are logged to ctld.log: slurmctld: debug: Setting job's pn_min_cpus to 25 due to memory limit slurmctld: _build_node_list: No nodes satisfy job 20003 requirements in partition xeon24 slurmctld: _slurm_rpc_submit_batch_job: Requested node configuration is not available Note this one: slurmctld: debug: Setting job's pn_min_cpus to 25 due to memory limit This automatic adjustment of Slurm can be found in the logic in the function: src/slurmctld/job_mgr.c _valid_pn_min_mem(). The reasoning behind this is due to these old commits: https://github.com/SchedMD/slurm/commit/b6ce37103b6573 https://github.com/SchedMD/slurm/commit/0fbdb9e2bf5faf And what the adjustments try to do is "ok the memory you request is higher than the maxmempercpu, I'm gonna try to modify the request so that more cpus are requested and then since you divide by more cpus perhaps the limit can now be satisfied". It's arguable wether these automatic adjustments should happen underneath to try to make the job request fit the limits. Actually we're considering removing them in 18.08 (since it'll be a change in functionality). But, for now Slurm works as expected for the shown requests. Where I think there's a bug is when you request --mem=0: alex@ibiza:~/t$ sbatch -N1 -p xeon24 --mem=0G --wrap="true" Submitted batch job 20006 alex@ibiza:~/t$ scontrol show job 20006 JobId=20006 JobName=wrap UserId=alex(1000) GroupId=alex(1000) MCS_label=N/A Priority=87362 Nice=0 Account=acct1 QOS=normal WCKey=* JobState=COMPLETED Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=12:00:00 TimeMin=N/A SubmitTime=2018-04-16T16:58:07 EligibleTime=2018-04-16T16:58:07 StartTime=2018-04-16T16:58:07 EndTime=2018-04-16T16:58:07 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2018-04-16T16:58:07 Partition=xeon24 AllocNode:Sid=ibiza:30893 ReqNodeList=(null) ExcNodeList=(null) NodeList=x001 BatchHost=x001 NumNodes=1 NumCPUs=1 NumTasks=0 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=250G,node=1,billing=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=250G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Gres=(null) Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/home/alex/t StdErr=/home/alex/t/slurm-20006.out StdIn=/dev/null StdOut=/home/alex/t/slurm-20006.out Power= alex@ibiza:~/t$ From the --mem sbatch man page: NOTE: A memory size specification of zero is treated as a special case and grants the job access to all of the memory on each node. So --mem=0 should be treated as if the job requested all memory on the node. Since RealMemory=256000, then--mem=0 should be equivalent to --mem=250G (256000/1024). But in practice --mem=0 is accepted and incorrectly bypasses MaxMemPerCPU limit (this is a bug) and --mem=250G is correctly rejected. So I'm gonna investigate the --mem=0 case, but hope the rest makes sense to you. Otherwise please ask further. Thanks.
(In reply to Alejandro Sanchez from comment #26) ... > Now, if we request more --mem than the maximum in the partition, for example > --mem=250G (that's 250*1024=25600MiB, and is higher than the limit of > 246.09375GiB or 252000MiB): > > alex@ibiza:~/t$ sbatch -p xeon24 --mem=250G --wrap="true" > sbatch: error: CPU count per node can not be satisfied > sbatch: error: Batch job submission failed: Requested node configuration is > not available This is NOT what I see with my xeon24_test partition: The job is actually accepted into the queue: $ sbatch -p xeon24_test --reservation=Test1 --mem=250G --wrap="true" Submitted batch job 495096 > and these messages are logged to ctld.log: > > slurmctld: debug: Setting job's pn_min_cpus to 25 due to memory limit > slurmctld: _build_node_list: No nodes satisfy job 20003 requirements in > partition xeon24 > slurmctld: _slurm_rpc_submit_batch_job: Requested node configuration is not > available > > Note this one: > > slurmctld: debug: Setting job's pn_min_cpus to 25 due to memory limit Instead I get a different message in slurmctld.log: # grep 495096 /var/log/slurm/slurmctld.log [2018-04-17T11:24:48.151] _build_node_list: No nodes satisfy job 495096 requirements in partition xeon24_test [2018-04-17T11:24:48.151] _slurm_rpc_submit_batch_job: JobId=495096 InitPrio=128373 usec=822 Please note that we have production partitions xeon24 (256 GB RAM) and xeon24_512 (512 GB RAM) which could accomodate the job, so perhaps that's the reason for the difference in behavior? However, my job should certainly not be considered for these partitions by the scheduler! The job has now been assigned to MinCPUsNode=37 (which don't exist anywhere in our cluster): $ scontrol show job 495096 JobId=495096 JobName=wrap UserId=ohni(1775) GroupId=camdvip(1250) MCS_label=N/A Priority=128390 Nice=0 Account=camdvip QOS=normal JobState=PENDING Reason=Priority Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=01:00:00 TimeMin=N/A SubmitTime=2018-04-17T11:24:48 EligibleTime=2018-04-17T11:24:48 StartTime=Unknown EndTime=Unknown Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2018-04-17T11:31:13 Partition=xeon24_test AllocNode:Sid=sylg:19174 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=250G,node=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=37 MinMemoryNode=250G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Gres=(null) Reservation=Test1 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/home/niflheim/ohni StdErr=/home/niflheim/ohni/slurm-495096.out StdIn=/dev/null StdOut=/home/niflheim/ohni/slurm-495096.out Power= > This automatic adjustment of Slurm can be found in the logic in the function: > src/slurmctld/job_mgr.c _valid_pn_min_mem(). The reasoning behind this is > due to these old commits: > > https://github.com/SchedMD/slurm/commit/b6ce37103b6573 > https://github.com/SchedMD/slurm/commit/0fbdb9e2bf5faf > > And what the adjustments try to do is "ok the memory you request is higher > than the maxmempercpu, I'm gonna try to modify the request so that more cpus > are requested and then since you divide by more cpus perhaps the limit can > now be satisfied". > > It's arguable wether these automatic adjustments should happen underneath to > try to make the job request fit the limits. Actually we're considering > removing them in 18.08 (since it'll be a change in functionality). This behavior is definitely confusing in our examples! IMHO, it's better to get a clear job rejection in stead of automatic adjustments which apparently are not checked against the actually available partitions. This leads to confusing job blocks which requires expert knowledge to sort out. > But, for now Slurm works as expected for the shown requests. The error message to the user "sbatch: error: CPU count per node can not be satisfied" which you see would be great, but unfortunately this is not what we observe :-(
I think there are 3 confirmed issues derived from this bug we need to address: 1. Job rejection behavior differs depending on job requesting reservation or not. If it requests a reservation, job is accepted at submission time and left PD (Reservation). If job doesn't request reservation then it is rejected at submission time with the user messages: alex@ibiza:~/t$ sbatch -p xeon24 --mem=250G --wrap="true" sbatch: error: CPU count per node can not be satisfied sbatch: error: Batch job submission failed: Requested node configuration is not available Both cases (reservation or not), EnforcePartLimits=YES and job mem higher than partition limit. So in both cases job should me rejected at submission time. Can you confirm you get the user errors if the node isn't in a reservation and the job doesn't request the reservation bu the memory is higher than the limit? 2. The automatic adjustments Slurm does increasing pn_min_cpus when the job mem is higher than the limit are arguably removable. Specially since the CPU increase goes not only above MaxCPUSPerNode but above the number of CPUs available on any of the job partitions' nodes. 3. --mem[-per-cpu]=0 is a special case setting the memory to the smallest memory size of any of the job allocated nodes. The problem is that this happens after memory limit validation thus bypassing the limits set in the cluster/partition. So there's a couple of things to be fixed here.
(In reply to Alejandro Sanchez from comment #31) > I think there are 3 confirmed issues derived from this bug we need to > address: > > 1. Job rejection behavior differs depending on job requesting reservation or > not. If it requests a reservation, job is accepted at submission time and > left PD (Reservation). If job doesn't request reservation then it is > rejected at submission time with the user messages: > > alex@ibiza:~/t$ sbatch -p xeon24 --mem=250G --wrap="true" > sbatch: error: CPU count per node can not be satisfied > sbatch: error: Batch job submission failed: Requested node configuration is > not available > > Both cases (reservation or not), EnforcePartLimits=YES and job mem higher > than partition limit. So in both cases job should me rejected at submission > time. > > Can you confirm you get the user errors if the node isn't in a reservation > and the job doesn't request the reservation bu the memory is higher than the > limit? Confirmed. I deleted the ReservationName=Test1 and submitted a job: $ sbatch -p xeon24_test --mem=250G --wrap="true" sbatch: error: CPU count per node can not be satisfied sbatch: error: Batch job submission failed: Requested node configuration is not available > 2. The automatic adjustments Slurm does increasing pn_min_cpus when the job > mem is higher than the limit are arguably removable. Specially since the CPU > increase goes not only above MaxCPUSPerNode but above the number of CPUs > available on any of the job partitions' nodes. > > 3. --mem[-per-cpu]=0 is a special case setting the memory to the smallest > memory size of any of the job allocated nodes. The problem is that this > happens after memory limit validation thus bypassing the limits set in the > cluster/partition. > > So there's a couple of things to be fixed here. That's a fine analysis! I hope bug fixes can make it into 17.11.6 (or later). Are there any further tests which I need to do?
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #32) > That's a fine analysis! I hope bug fixes can make it into 17.11.6 (or > later). I think 1) and 3) most probably can be fixed for 17.11. Not sure if .6 or .7 since we planned on releasing .6 this week. 2) most probably will land in 18.08 since it's a change in functionality. I don't want to promise anything though since I first want to have fixes ready and discuss internally in what versions we land them. > Are there any further tests which I need to do? Not for now, thanks for your follow-up comments and feedback; helped me identifying the issues.
Ole, just to update there's a draft patch we're discussing internally that seem to alleviate the issues here. We still need to refine it though and discuss where to land it, or if we should split it into different parts that could land on different versions.
Hi Ole, a patch has been finally committed to: https://github.com/SchedMD/slurm/commit/bf4cb0b1b01f3e and available starting from 17.11.7. You can apply it at your earliest convenience by appending ".patch" to the github URL. This fix seems to address the three issues we identified here + another one involving the same code paths in bug 4895. We're gonna go ahead and mark this as resolved/fixed. Please, reopen if there's any new issue you find after applying it. Thanks.