Ticket 4976 - After upgrade to 17.11 some jobs get a JobState=PENDING Reason=BadConstraints
Summary: After upgrade to 17.11 some jobs get a JobState=PENDING Reason=BadConstraints
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 17.11.5
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Alejandro Sanchez
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-03-23 02:57 MDT by Ole.H.Nielsen@fysik.dtu.dk
Modified: 2024-05-01 06:46 MDT (History)
2 users (show)

See Also:
Site: DTU Physics
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 17.11.7 18.08.0-pre2
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf (4.90 KB, text/plain)
2018-04-03 06:24 MDT, Ole.H.Nielsen@fysik.dtu.dk
Details
gres.conf (62 bytes, text/plain)
2018-04-03 06:24 MDT, Ole.H.Nielsen@fysik.dtu.dk
Details
cgroup.conf (106 bytes, text/plain)
2018-04-03 06:24 MDT, Ole.H.Nielsen@fysik.dtu.dk
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Ole.H.Nielsen@fysik.dtu.dk 2018-03-23 02:57:35 MDT
Yesterday we upgraded successfully Slurm from 17.02.9 to 17.11.5.  However, some 100+ jobs in the queue got a strange state JobState=PENDING Reason=BadConstraints after the upgrade.  Most of these incorrect BadConstraints were fixed by holding and then releasing the jobs.  However, we have a few jobs that are not fixed by hold+release.  One example is:

JobId=460138 JobName=TEST-BaBr2-CdI2/nm/
   UserId=tdeilm(221341) GroupId=camdvip(1250) MCS_label=N/A
   Priority=0 Nice=-132215 Account=camdvip QOS=normal
   JobState=PENDING Reason=BadConstraints Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=2-00:00:00 TimeMin=N/A
   SubmitTime=Thu 07:52:24 EligibleTime=Thu 07:52:24
   StartTime=Fri 09:50:32 EndTime=Sun 10:50:32 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=Fri 09:50:19
   Partition=xeon24 AllocNode:Sid=sylg:34837
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=40-40 NumCPUs=960 NumTasks=960 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=960,node=40
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=25 MinMemoryNode=250G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/home/niflheim2/tdeilm/2D_ht/C2DM-2019/BaBr2-CdI2/nm
   StdErr=/home/niflheim2/tdeilm/2D_ht/C2DM-2019/BaBr2-CdI2/nm/slurm-460138.out
   StdIn=/dev/null
   StdOut=/home/niflheim2/tdeilm/2D_ht/C2DM-2019/BaBr2-CdI2/nm/slurm-460138.out
   Power=

The user has cancelled and resubmitted these jobs.  But it would be good to eliminate this bogus BadConstraints if possible.

Thanks,
Ole
Comment 1 Alejandro Sanchez 2018-03-26 03:55:44 MDT
Hi. In order to analyze this issue we would need:

- slurm.conf and potentially gres.conf/cgroup.conf as well.
- Example of job submission request (command line and script if batch job).
- Logs from slurmctld.log related to the job example.

Thanks.
Comment 2 Ole.H.Nielsen@fysik.dtu.dk 2018-03-26 03:57:34 MDT
I'm out of the office until April 3.
Jeg er ikke pƄ kontoret, tilbage igen 3. april.

Best regards / Venlig hilsen,
Ole Holm Nielsen
Comment 3 Ole.H.Nielsen@fysik.dtu.dk 2018-04-03 06:24:04 MDT
Created attachment 6522 [details]
slurm.conf
Comment 4 Ole.H.Nielsen@fysik.dtu.dk 2018-04-03 06:24:27 MDT
Created attachment 6523 [details]
gres.conf
Comment 5 Ole.H.Nielsen@fysik.dtu.dk 2018-04-03 06:24:49 MDT
Created attachment 6524 [details]
cgroup.conf
Comment 6 Ole.H.Nielsen@fysik.dtu.dk 2018-04-03 06:37:23 MDT
(In reply to Alejandro Sanchez from comment #1)
> - slurm.conf and potentially gres.conf/cgroup.conf as well.

Files attached.

> - Example of job submission request (command line and script if batch job).

This user is still on Easter holiday, I'll have to get his example later.

> - Logs from slurmctld.log related to the job example.

# zcat slurmctld.log-20180325.gz | grep 460138 
[2018-03-22T07:52:24.892] _slurm_rpc_submit_batch_job JobId=460138 usec=1544
[2018-03-22T07:53:09.467] _slurm_rpc_top_job for 460138 usec=651
[2018-03-22T07:53:53.789] _slurm_rpc_top_job for 460138 usec=648
[2018-03-22T13:18:43.175] Recovered JobID=460138 State=0x0 NodeCnt=0 Assoc=216
[2018-03-22T14:28:48.241] Recovered JobID=460138 State=0x0 NodeCnt=0 Assoc=216
[2018-03-22T16:17:48.278] _build_node_list: No nodes satisfy job 460138 requirements in partition xeon24
[2018-03-22T16:17:48.278] sched: schedule: JobID=460138 State=0x0 NodeCnt=0 non-runnable: Requested node configuration is not available
[2018-03-22T16:32:05.630] _slurm_rpc_update_job: complete JobId=460138 uid=221341 usec=304
[2018-03-22T16:32:10.478] error: sched: Attempt to modify priority for job 460138
[2018-03-22T16:32:10.478] _slurm_rpc_update_job: JobId=460138 uid=221341: Access/permission denied
[2018-03-22T16:32:48.102] _slurm_rpc_update_job: complete JobId=460138 uid=221341 usec=358
[2018-03-22T16:33:04.475] error: sched: Attempt to modify priority for job 460138
[2018-03-22T16:33:04.475] _slurm_rpc_update_job: JobId=460138 uid=221341: Access/permission denied
[2018-03-23T09:44:59.620] sched: _release_job_rec: release hold on job_id 460138 by uid 0
[2018-03-23T09:44:59.620] _slurm_rpc_update_job: complete JobId=460138 uid=0 usec=387
[2018-03-23T09:45:00.192] _build_node_list: No nodes satisfy job 460138 requirements in partition xeon24
[2018-03-23T09:45:00.192] sched: schedule: JobID=460138 State=0x0 NodeCnt=0 non-runnable: Requested node configuration is not available
[2018-03-23T09:49:55.511] _slurm_rpc_update_job: complete JobId=460138 uid=0 usec=265
[2018-03-23T09:49:55.523] sched: _release_job_rec: release hold on job_id 460138 by uid 0
[2018-03-23T09:49:55.523] _slurm_rpc_update_job: complete JobId=460138 uid=0 usec=333
[2018-03-23T09:49:56.918] _build_node_list: No nodes satisfy job 460138 requirements in partition xeon24
[2018-03-23T09:49:56.918] sched: schedule: JobID=460138 State=0x0 NodeCnt=0 non-runnable: Requested node configuration is not available
[2018-03-23T09:50:19.227] _slurm_rpc_update_job: complete JobId=460138 uid=0 usec=336
[2018-03-23T09:50:19.242] sched: _release_job_rec: release hold on job_id 460138 by uid 0
[2018-03-23T09:50:19.242] _slurm_rpc_update_job: complete JobId=460138 uid=0 usec=2334
[2018-03-23T09:50:19.956] _build_node_list: No nodes satisfy job 460138 requirements in partition xeon24
[2018-03-23T09:50:19.956] sched: schedule: JobID=460138 State=0x0 NodeCnt=0 non-runnable: Requested node configuration is not available
[2018-03-23T09:50:57.644] _slurm_rpc_update_job: complete JobId=460138 uid=0 usec=318
[2018-03-23T09:50:58.659] sched: _release_job_rec: release hold on job_id 460138 by uid 0
[2018-03-23T09:50:58.659] _slurm_rpc_update_job: complete JobId=460138 uid=0 usec=354
[2018-03-23T09:51:01.166] _build_node_list: No nodes satisfy job 460138 requirements in partition xeon24
[2018-03-23T09:51:01.166] sched: schedule: JobID=460138 State=0x0 NodeCnt=0 non-runnable: Requested node configuration is not available
[2018-03-23T09:53:04.617] _slurm_rpc_update_job: complete JobId=460138 uid=221341 usec=349
[2018-03-23T09:53:33.684] error: sched: Attempt to modify priority for job 460138
[2018-03-23T09:53:33.685] _slurm_rpc_update_job: JobId=460138 uid=221341: Access/permission denied
[2018-03-23T09:58:38.867] sched: _release_job_rec: release hold on job_id 460138 by uid 0
[2018-03-23T09:58:38.868] _slurm_rpc_update_job: complete JobId=460138 uid=0 usec=401
[2018-03-23T09:58:39.359] _build_node_list: No nodes satisfy job 460138 requirements in partition xeon24
[2018-03-23T09:58:39.359] sched: schedule: JobID=460138 State=0x0 NodeCnt=0 non-runnable: Requested node configuration is not available
[2018-03-23T10:02:15.811] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 460138 uid 221341
[2018-03-23T10:03:06.041] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 460138 uid 221341
Comment 7 Ole.H.Nielsen@fysik.dtu.dk 2018-04-03 06:48:11 MDT
We now have a number of jobs caught up in the Reason=BadConstraints mode, both jobs submitted under 17.02.9 as well as 17.11.5 (upgraded on March 22).

An example of this is:

# scontrol show job 470506
JobId=470506 JobName=c2dm.pdos
   UserId=mogje(22231) GroupId=camdvip(1250) MCS_label=N/A
   Priority=0 Nice=0 Account=camdvip QOS=normal
   JobState=PENDING Reason=BadConstraints Dependency=(null)
   Requeue=1 Restarts=1 BatchFlag=2 Reboot=0 ExitCode=0:1
   RunTime=00:00:00 TimeLimit=10:00:00 TimeMin=N/A
   SubmitTime=2018-03-26T09:58:06 EligibleTime=2018-03-26T09:58:06
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2018-04-01T03:52:34
   Partition=xeon24 AllocNode:Sid=sylg:45931
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=x178
   BatchHost=x178
   NumNodes=1 NumCPUs=24 NumTasks=24 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=24,mem=500G,node=1,billing=24
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=49 MinMemoryNode=500G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/home/niflheim/mogje/chalcohalides_workflow/Ti4Cl4O4-HfSCl/nm
   StdErr=/home/niflheim/mogje/chalcohalides_workflow/Ti4Cl4O4-HfSCl/nm/slurm-c2dm.pdos-470506.err
   StdIn=/dev/nullNodeList=x178
   StdOut=/home/niflheim/mogje/chalcohalides_workflow/Ti4Cl4O4-HfSCl/nm/slurm-c2dm.pdos-470506.out
   Power=

Please note that this job has apparently been assigned to NodeList=x178, which is currently running another job.  The slurmctld log files have these lines on job 470506:

[2018-03-26T09:58:06.307] _slurm_rpc_allsubmit_batch_job: JobId=470506 InitPrio=9156 usec=1273
[2018-03-28T08:34:31.093] backfill: Started JobID=470506 in xeon24 on x178
[2018-03-28T09:12:28.225] Batch JobId=470506 missing from node 0 (not found BatchStartTime after startup), Requeuing job
[2018-03-28T09:12:28.225] _job_complete: JobID=470506 State=0x1 NodeCnt=1 WTERMSIG 126
[2018-03-28T09:12:28.225] _job_complete: JobID=470506 State=0x1 NodeCnt=1 cancelled by node failure
[2018-03-28T09:12:28.225] _job_complete: requeue JobID=470506 State=0x8000 NodeCnt=1 due to node failure
[2018-03-28T09:12:28.225] _job_complete: JobID=470506 State=0x8000 NodeCnt=1 done
[2018-04-01T03:52:34.747] _build_node_list: No nodes satisfy job 470506 requirements in partition xeon24
[2018-04-01T03:52:34.747] sched: schedule: JobID=470506 State=0x0 NodeCnt=1 non-runnable: Requested node configuration is not available

There was a power hiccup on 2018-03-28T09:12 causing some nodes to go offline (reason unknown, it may the the Omni-Path fabric being unavailable for some seconds).  However, the nodes were returned to the running state shortly afterwards.
Comment 8 Ole.H.Nielsen@fysik.dtu.dk 2018-04-03 06:57:51 MDT
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #7)
> We now have a number of jobs caught up in the Reason=BadConstraints mode,
> both jobs submitted under 17.02.9 as well as 17.11.5 (upgraded on March 22).
> 
> An example of this is:
> 
> # scontrol show job 470506
> JobId=470506 JobName=c2dm.pdos
>    UserId=mogje(22231) GroupId=camdvip(1250) MCS_label=N/A
>    Priority=0 Nice=0 Account=camdvip QOS=normal
>    JobState=PENDING Reason=BadConstraints Dependency=(null)
>    Requeue=1 Restarts=1 BatchFlag=2 Reboot=0 ExitCode=0:1
>    RunTime=00:00:00 TimeLimit=10:00:00 TimeMin=N/A
>    SubmitTime=2018-03-26T09:58:06 EligibleTime=2018-03-26T09:58:06
>    StartTime=Unknown EndTime=Unknown Deadline=N/A
>    PreemptTime=None SuspendTime=None SecsPreSuspend=0
>    LastSchedEval=2018-04-01T03:52:34
>    Partition=xeon24 AllocNode:Sid=sylg:45931
>    ReqNodeList=(null) ExcNodeList=(null)
>    NodeList=x178
>    BatchHost=x178
>    NumNodes=1 NumCPUs=24 NumTasks=24 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>    TRES=cpu=24,mem=500G,node=1,billing=24
>    Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>    MinCPUsNode=49 MinMemoryNode=500G MinTmpDiskNode=0
>    Features=(null) DelayBoot=00:00:00
>    Gres=(null) Reservation=(null)
>    OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>    Command=(null)
>    WorkDir=/home/niflheim/mogje/chalcohalides_workflow/Ti4Cl4O4-HfSCl/nm
>   
> StdErr=/home/niflheim/mogje/chalcohalides_workflow/Ti4Cl4O4-HfSCl/nm/slurm-
> c2dm.pdos-470506.err
>    StdIn=/dev/nullNodeList=x178
>   
> StdOut=/home/niflheim/mogje/chalcohalides_workflow/Ti4Cl4O4-HfSCl/nm/slurm-
> c2dm.pdos-470506.out
>    Power=
> 
> Please note that this job has apparently been assigned to NodeList=x178,
> which is currently running another job.  The slurmctld log files have these
> lines on job 470506:
> 
> [2018-03-26T09:58:06.307] _slurm_rpc_allsubmit_batch_job: JobId=470506
> InitPrio=9156 usec=1273
> [2018-03-28T08:34:31.093] backfill: Started JobID=470506 in xeon24 on x178
> [2018-03-28T09:12:28.225] Batch Job[2018-04-03T14:51:43.258] _slurm_rpc_update_job: complete JobId=470506 uid=0 usec=320
[2018-04-03T14:51:44.274] sched: _release_job_rec: release hold on job_id 470506 by uid 0
[2018-04-03T14:51:44.274] _slurm_rpc_update_job: complete JobId=470506 uid=0 usec=378
[2018-04-03T14:51:46.067] _build_node_list: No nodes satisfy job 470506 requirements in partition xeon24
[2018-04-03T14:51:46.067] sched: schedule: JobID=470506 State=0x0 NodeCnt=1 non-runnable: Re[2018-04-03T14:51:43.258] _slurm_rpc_update_job: complete JobId=470506 uid=0 usec=320
[2018-04-03T14:51:44.274] sched: _release_job_rec: release hold on job_id 470506 by uid 0
[2018-04-03T14:51:44.274] _slurm_rpc_update_job: complete JobId=470506 uid=0 usec=378
[2018-04-03T14:51:46.067] _build_node_list: No nodes satisfy job 470506 requirements in partition xeon24
[2018-04-03T14:51:46.067] sched: schedule: JobID=470506 State=0x0 NodeCnt=1 non-runnable: Requested node configuration is not available
quested node configuration is not available
Id=470506 missing from node 0 (not found
> BatchStartTime after startup), Requeuing job
> [2018-03-28T09:12:28.225] _job_complete: JobID=470506 State=0x1 NodeCnt=1
> WTERMSIG 126
> [2018-03-28T09:12:28.225] _job_complete: JobID=470506 State=0x1 NodeCnt=1
> cancelled by node failure
> [2018-03-28T09:12:28.225] _job_complete: requeue JobID=470506 State=0x8000
> NodeCnt=1 due to node failure
> [2018-03-28T09:12:28.225] _job_complete: JobID=470506 State=0x8000 NodeCnt=1
> done
> [2018-04-01T03:52:34.747] _build_node_list: No nodes satisfy job 470506
> requirements in partition xeon24
> [2018-04-01T03:52:34.747] sched: schedule: JobID=470506 State=0x0 NodeCnt=1
> non-runnable: Requested node configuration is not available
> 
> There was a power hiccup on 2018-03-28T09:12 causing some nodes to go
> offline (reason unknown, it may the the Omni-Path fabric being unavailable
> for some seconds).  However, the nodes were returned to the running state
> shortly afterwards.

I did a hold-then-release of JobID=470506, but the job is still in the BadConstraints mode.  The slurmctld log says:

[2018-04-03T14:51:43.258] _slurm_rpc_update_job: complete JobId=470506 uid=0 usec=320
[2018-04-03T14:51:44.274] sched: _release_job_rec: release hold on job_id 470506 by uid 0
[2018-04-03T14:51:44.274] _slurm_rpc_update_job: complete JobId=470506 uid=0 usec=378
[2018-04-03T14:51:46.067] _build_node_list: No nodes satisfy job 470506 requirements in partition xeon24
[2018-04-03T14:51:46.067] sched: schedule: JobID=470506 State=0x0 NodeCnt=1 non-runnable: Requested node configuration is not available

The message "Requested node configuration is not available" appears to be completely wrong.
Comment 9 Alejandro Sanchez 2018-04-03 07:34:27 MDT
PartitionName=xeon24 Nodes=x[001-192] DefaultTime=12:00:00 MaxTime=50:00:00 DefMemPerCPU=8000 MaxMemPerCPU=10500 State=UP OverSubscribe=NO
NodeName=x[001-168],x[181-192] Weight=10424 Boards=1 SocketsPerBoard=2 CoresPerSocket=12 ThreadsPerCore=1 RealMemory=256000 TmpDisk=32752 Feature=xeon2650v4,opa,xeon24

Isn't DefMemPerCPU=8000 too high?
Comment 10 Ole.H.Nielsen@fysik.dtu.dk 2018-04-03 08:31:16 MDT
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #6)
> (In reply to Alejandro Sanchez from comment #1)
> > - Example of job submission request (command line and script if batch job).

The submit command is reported by the user as:

sbatch -J TEST --mail-type=FAIL --partition=xeon24 -N 21 -n 504 --time=48:00:00 --mem=0 run.py
Comment 11 Ole.H.Nielsen@fysik.dtu.dk 2018-04-03 08:32:46 MDT
(In reply to Alejandro Sanchez from comment #9)
> PartitionName=xeon24 Nodes=x[001-192] DefaultTime=12:00:00 MaxTime=50:00:00
> DefMemPerCPU=8000 MaxMemPerCPU=10500 State=UP OverSubscribe=NO
> NodeName=x[001-168],x[181-192] Weight=10424 Boards=1 SocketsPerBoard=2
> CoresPerSocket=12 ThreadsPerCore=1 RealMemory=256000 TmpDisk=32752
> Feature=xeon2650v4,opa,xeon24
> 
> Isn't DefMemPerCPU=8000 too high?

The nodes in partition xeon24 have 256 GB of RAM and 24 CPU cores.  DefMemPerCPU=8000 would add up to only 192 GB, right?
Comment 12 Alejandro Sanchez 2018-04-03 09:15:50 MDT
Yeah sorry, the DefMemPerCPU isn't incorrect, my fault.

I find this strange though:

JobId=460138
...
Partition=xeon24 AllocNode:Sid=sylg:34837
...
NumNodes=40-40 NumCPUs=960 NumTasks=960 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=960,node=40
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=25 MinMemoryNode=250G MinTmpDiskNode=0
   ^^

> # scontrol show job 470506
> JobId=470506 JobName=c2dm.pdos
...
>    Partition=xeon24 AllocNode:Sid=sylg:45931
...
>    NodeList=x178
>    BatchHost=x178
>    NumNodes=1 NumCPUs=24 NumTasks=24 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>    TRES=cpu=24,mem=500G,node=1,billing=24
>    Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>    MinCPUsNode=49 MinMemoryNode=500G MinTmpDiskNode=0
     ^^             ^^

Is --mincpus requested/updated at some point with a value > 24?
Comment 13 Ole.H.Nielsen@fysik.dtu.dk 2018-04-03 09:30:33 MDT
(In reply to Alejandro Sanchez from comment #12)

Thanks for identifying strange parameters:

> I find this strange though:
> 
> JobId=460138
> ...
> Partition=xeon24 AllocNode:Sid=sylg:34837
> ...
> NumNodes=40-40 NumCPUs=960 NumTasks=960 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>    TRES=cpu=960,node=40
>    Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>    MinCPUsNode=25 MinMemoryNode=250G MinTmpDiskNode=0
>    ^^

Strange indeed!  The node has 256 GB RAM, configured in slurm.conf as only MaxMemPerCPU=10500.  The user requests 250 GB equivalent to 10417 MB/cpu which is less than 10500.

Could it be that the scheduler miscalculates 250 GB as exceeding MaxMemPerCPU=10500 and therefore requiring 25 cpus (out of 24 available cpus)? 

This might be a genuine bug!

> > # scontrol show job 470506
> > JobId=470506 JobName=c2dm.pdos
> ...
> >    Partition=xeon24 AllocNode:Sid=sylg:45931
> ...
> >    NodeList=x178
> >    BatchHost=x178
> >    NumNodes=1 NumCPUs=24 NumTasks=24 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
> >    TRES=cpu=24,mem=500G,node=1,billing=24
> >    Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
> >    MinCPUsNode=49 MinMemoryNode=500G MinTmpDiskNode=0
>      ^^             ^^
> 
> Is --mincpus requested/updated at some point with a value > 24?

No updating has been done.  But I now see that the user has submitted to the wrong partition xeon24 (256 GB nodes) but asking for 500 GB requiring 48 cpus (which may be miscalculated as in the above as 49 cpus) in this partition.  I will inform the user of his error.
Comment 14 Ole.H.Nielsen@fysik.dtu.dk 2018-04-03 09:35:43 MDT
We have a lot of jobs by another user displaying apparently the same bug in the scheduler's calculation of memory requirements:

# scontrol show job 447332
JobId=447332 JobName=c2dm.strains
   UserId=mikst(28594) GroupId=camdvip(1250) MCS_label=N/A
   Priority=0 Nice=0 Account=camdvip QOS=normal
   JobState=PENDING Reason=BadConstraints Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=2-00:00:00 TimeMin=N/A
   SubmitTime=2018-03-12T09:02:34 EligibleTime=2018-03-12T09:02:34
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2018-04-03T14:51:00
   Partition=xeon24 AllocNode:Sid=thul:4469
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=2-2 NumCPUs=48 NumTasks=48 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=48,node=2
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=25 MinMemoryNode=250G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/home/niflheim/mikst/C2DM-2018/MX3-1T/Zr2I6-1T/fm
   StdErr=/home/niflheim/mikst/C2DM-2018/MX3-1T/Zr2I6-1T/fm/slurm-c2dm.strains-447332.err
   StdIn=/dev/null
   StdOut=/home/niflheim/mikst/C2DM-2018/MX3-1T/Zr2I6-1T/fm/slurm-c2dm.strains-447332.out
   Power=

The user asks for 250 GB per node which is just below the total of 252 GB with the configured MaxMemPerCPU=10500.

Or may there be some confusion about specifying memory size as GiB or GB units?
Comment 15 Ole.H.Nielsen@fysik.dtu.dk 2018-04-04 02:10:16 MDT
A follow-up question: Our partition xeon24 has the default parameter MaxCPUsPerNode=UNLIMITED:

# scontrol show partition xeon24
PartitionName=xeon24
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=12:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=2-02:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=x[001-192]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=4608 TotalNodes=192 SelectTypeParameters=NONE
   DefMemPerCPU=10000 MaxMemPerCPU=10750

(Note that I've increased the DefMemPerCPU,MaxMemPerCPU values slightly today).

The slurm.conf man-page only gives a GPU related usage example of MaxCPUsPerNode.  I'm thinking that it's a good idea to set MaxCPUsPerNode=24 for 24-core nodes (and similar for other core counts) so that no jobs will ever request >24 cores due to a bad memory request.

Question: Can you confirm that it's advisable, or even best practice, so set MaxCPUsPerNode equal to the number of CPU cores in the node?

What would happen to a job submission where the requested cores per node and/or memory size exceeds the limits?  Hopefully, the job would be rejected at submission, or become Failed shortly thereafter?

If we want to reserve some cores for GPUs or other tasks, MaxCPUsPerNode could of course be less than the number of physical cores.
Comment 16 Alejandro Sanchez 2018-04-09 09:47:15 MDT
Do you have another example job request with the command line options for submission, with just 2 nodes and with the partition config as described in your last comment? I'm still unable to reproduce the MinCPUsNode=25 situation.

Are all the nodes in that partition homogeneous in terms of sockets/cores and memory, right?

Let's see if with your job submission reproducer I can locally reproduce too and narrow the problem down.
Comment 17 Ole.H.Nielsen@fysik.dtu.dk 2018-04-13 03:01:38 MDT
(In reply to Alejandro Sanchez from comment #16)
> Do you have another example job request with the command line options for
> submission, with just 2 nodes and with the partition config as described in
> your last comment? I'm still unable to reproduce the MinCPUsNode=25
> situation.
> 
> Are all the nodes in that partition homogeneous in terms of sockets/cores
> and memory, right?
> 
> Let's see if with your job submission reproducer I can locally reproduce too
> and narrow the problem down.

I'm setting up tests to make this reproducible, and I have further interesting observations.

Firstly, I reserved node x001 to allow only user ohni (myself):

scontrol create reservation starttime=now duration=720:00:00 ReservationName=Test1 nodes=x001 user=ohni

Secondly, I defined a new xeon24_test partition containing only node x001, and with a ridiculously low DefMemPerCPU=6000 MaxMemPerCPU=7000 for testing purposes only.  Our 24-core node now comprise these 3 partitions:

PartitionName=xeon24 Nodes=x[001-192] DefaultTime=12:00:00 MaxTime=50:00:00 DefMemPerCPU=10000 MaxMemPerCPU=10750 State=UP OverSubscribe=NO
# Nodes x169-x180 have big memory (512G):
PartitionName=xeon24_512 Nodes=x[169-180] DefaultTime=12:00:00 MaxTime=50:00:00 DefMemPerCPU=20000 MaxMemPerCPU=21500 State=UP OverSubscribe=NO PriorityJobFactor=5000
# Test partition
PartitionName=xeon24_test Nodes=x001 DefaultTime=1:00:00 MaxTime=2:00:00 DefMemPerCPU=6000 MaxMemPerCPU=7000 State=UP OverSubscribe=NO PriorityJobFactor=5000

As soon as I did "scontrol reconfigure", some jobs got into a bad state as reported by squeue:

    438338    xeon24   sbatch   tdeilm PD       0:00     20 (ReqNodeNotAvail, UnavailableNodes:b072,d067,i042)

The UnavailableNodes belong to different partitions (xeon8 and xeon16), and they  are down or draining due to hardware errors.  Jobs on the xeon24 partition would *never* be scheduleable on nodes in the xeon8 or xeon16 partitions!

Thirdly, I made a quick workaround and removed the node x001 from the standard xeon24 partition (and did "scontrol reconfigure").  The ReqNodeNotAvail problem now disappeared when the normal xeon24 partition doesn't contain node x001:

PartitionName=xeon24 Nodes=x[002-192] DefaultTime=12:00:00 MaxTime=50:00:00 DefMemPerCPU=10000 MaxMemPerCPU=10750 State=UP OverSubscribe=NO

Summary:

I created a reservation of node x001 exclusively for myself.  Then I created a new reservation purposely with too small a value for MaxMemPerCPU.  

The scheduler apparently considers job 438338 (of user: tdeilm) to run on the reserved node x001, and decides that this node's memory is too small, hence blocks this job with a ReqNodeNotAvail status.  The scheduler also points to totally irrelevant UnavailableNodes:b072,d067,i042.
Comment 18 Ole.H.Nielsen@fysik.dtu.dk 2018-04-13 03:44:19 MDT
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #17)
With the partition xeon24_test defined in comment #17 I've submitted this test script:

#!/bin/bash -x
#SBATCH --reservation=Test1
#SBATCH --nodes=1
#SBATCH --mem=250G
#SBATCH --ntasks-per-node=24
#SBATCH --time=00:00:20
#SBATCH --partition=xeon24_test
#SBATCH --mail-type ALL
ulimit -a

Due to the ridiculously low DefMemPerCPU=6000 MaxMemPerCPU=7000 for testing purposes, the job gets into a PENDING state:

JobId=488578 JobName=limit_test.sh
   UserId=ohni(1775) GroupId=camdvip(1250) MCS_label=N/A
   Priority=127369 Nice=0 Account=camdvip QOS=normal
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:01:00 TimeMin=N/A
   SubmitTime=2018-04-13T11:22:36 EligibleTime=2018-04-13T11:22:36
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2018-04-13T11:25:52
   Partition=xeon24_test AllocNode:Sid=sylg:19174
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1-1 NumCPUs=24 NumTasks=24 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=24,mem=250G,node=1
   Socks/Node=* NtasksPerN:B:S:C=24:0:*:* CoreSpec=*
   MinCPUsNode=37 MinMemoryNode=250G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=Test1
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/niflheim/ohni/limit_test.sh
   WorkDir=/home/niflheim/ohni
   StdErr=/home/niflheim/ohni/slurm-488578.out
   StdIn=/dev/null
   StdOut=/home/niflheim/ohni/slurm-488578.out
   Power=

Please note the MinCPUsNode=37 !!!  Asking for 250GB of memory, when only 24*7000 ~= 164 GB is possible, the scheduler apparently concludes that the number of CPU cores required is 250*1024/7000 = 36.57 ~= 37 cores.

Question: Is this the expected behavior of the Slurm scheduler?  If so, perhaps we really need to redefine the partition's default value MaxCPUsPerNode=UNLIMITED to become the number of physical cores in the node, as I suggested in comment #15.


If I replace the memory requirement in the script by a zero:

#SBATCH --mem=0

then such a job runs just fine with MinCPUsNode=24:

JobId=488670 JobName=limit_test.sh
   UserId=ohni(1775) GroupId=camdvip(1250) MCS_label=N/A
   Priority=127364 Nice=0 Account=camdvip QOS=normal
   JobState=COMPLETED Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:02 TimeLimit=00:01:00 TimeMin=N/A
   SubmitTime=2018-04-13T11:31:31 EligibleTime=2018-04-13T11:31:31
   StartTime=2018-04-13T11:32:08 EndTime=2018-04-13T11:32:10 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2018-04-13T11:32:08
   Partition=xeon24_test AllocNode:Sid=sylg:19174
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=x001
   BatchHost=x001
   NumNodes=1 NumCPUs=24 NumTasks=24 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=24,mem=250G,node=1,billing=24
   Socks/Node=* NtasksPerN:B:S:C=24:0:*:* CoreSpec=*
   MinCPUsNode=24 MinMemoryNode=250G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=Test1
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/niflheim/ohni/limit_test.sh
   WorkDir=/home/niflheim/ohni
   StdErr=/home/niflheim/ohni/slurm-488670.out
   StdIn=/dev/null
   StdOut=/home/niflheim/ohni/slurm-488670.out
   Power=

Please note the TRES=cpu=24,mem=250G.  How could the mem=250G ever be calculated by the scheduler?  The MaxMemPerCPU=7000 corresponds to 164.0 GB of node memory.
Comment 19 Alejandro Sanchez 2018-04-13 10:52:50 MDT
I can finally reproduce now. There's definitely something wrong increasing the cpus due to memory. Thanks for the info.

Regarding comment 17, Changes in node configuration (e.g.  adding  nodes,  changing their  processor  count, etc.) require restarting both the slurmctld daemon and the slurmd daemons.  All slurmd daemons must know each node in the system to forward messages in support of hierarchical communications.
Comment 22 Ole.H.Nielsen@fysik.dtu.dk 2018-04-13 13:02:33 MDT
(In reply to Alejandro Sanchez from comment #19)
> I can finally reproduce now. There's definitely something wrong increasing
> the cpus due to memory. Thanks for the info.

I'm delighted that you can reproduce this!  Looking forward to a patch.

> Regarding comment 17, Changes in node configuration (e.g.  adding  nodes, 
> changing their  processor  count, etc.) require restarting both the
> slurmctld daemon and the slurmd daemons.  All slurmd daemons must know each
> node in the system to forward messages in support of hierarchical
> communications.

I know this, but I didn't add any nodes nor change processor count, etc.  Which of the changes I did do you think requires restarting slurmctld and possibly slurmd?

In comment #18 a strange mem=250G for the node appears without any obvious cause.  How could this memory size ever come about when I've configured the partition for 164 GB?
Comment 26 Alejandro Sanchez 2018-04-16 09:06:58 MDT
After doing some more tests there are a few things we can discuss. I'm using your original configuration:

NodeName=x[001-002] NodeHostname=ibiza Weight=10424 Boards=1 SocketsPerBoard=2 CoresPerSocket=12 ThreadsPerCore=1 RealMemory=256000 TmpDisk=32752 Feature=xeon2650v4,opa,xeon24 Port=61201-61202
PartitionName=xeon24 Nodes=x[001-002] DefaultTime=12:00:00 MaxTime=50:00:00 DefMemPerCPU=8000 MaxMemPerCPU=10500 State=UP OverSubscribe=NO

No reservations.

So the nodes have relevant info:
RealMemory=256000
CPUTot=24

And the partition:
DefMemPerCPU=8000 MaxMemPerCPU=10500

So with these resources and limits, a job can request theoretically:
MaxMemPerCPU=10500 * CPUTot=24 = 252000MiB / 1024 = 246.09375GiB.

alex@ibiza:~/t$ sbatch -p xeon24 --mem=246G --wrap="true"
Submitted batch job 20002
alex@ibiza:~/t$ scontrol show job 20002
JobId=20002 JobName=wrap
   UserId=alex(1000) GroupId=alex(1000) MCS_label=N/A
   Priority=110260 Nice=0 Account=acct1 QOS=normal WCKey=*
   JobState=COMPLETED Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:01 TimeLimit=12:00:00 TimeMin=N/A
   SubmitTime=2018-04-16T16:40:08 EligibleTime=2018-04-16T16:40:08
   StartTime=2018-04-16T16:40:08 EndTime=2018-04-16T16:40:09 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2018-04-16T16:40:08
   Partition=xeon24 AllocNode:Sid=ibiza:30893
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=x001
   BatchHost=x001
   NumNodes=1 NumCPUs=24 NumTasks=0 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=24,mem=246G,node=1,billing=24
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=24 MinMemoryNode=246G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/home/alex/t
   StdErr=/home/alex/t/slurm-20002.out
   StdIn=/dev/null
   StdOut=/home/alex/t/slurm-20002.out
   Power=
   

alex@ibiza:~/t$

OK, as expected job completes.

Now, if we request more --mem than the maximum in the partition, for example --mem=250G (that's 250*1024=25600MiB, and is higher than the limit of 246.09375GiB or 252000MiB):

alex@ibiza:~/t$ sbatch -p xeon24 --mem=250G --wrap="true"
sbatch: error: CPU count per node can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available
alex@ibiza:~/t$

and these messages are logged to ctld.log:

slurmctld: debug:  Setting job's pn_min_cpus to 25 due to memory limit
slurmctld: _build_node_list: No nodes satisfy job 20003 requirements in partition xeon24
slurmctld: _slurm_rpc_submit_batch_job: Requested node configuration is not available

Note this one:

slurmctld: debug:  Setting job's pn_min_cpus to 25 due to memory limit

This automatic adjustment of Slurm can be found in the logic in the function:
src/slurmctld/job_mgr.c _valid_pn_min_mem(). The reasoning behind this is due to these old commits:

https://github.com/SchedMD/slurm/commit/b6ce37103b6573
https://github.com/SchedMD/slurm/commit/0fbdb9e2bf5faf

And what the adjustments try to do is "ok the memory you request is higher than the maxmempercpu, I'm gonna try to modify the request so that more cpus are requested and then since you divide by more cpus perhaps the limit can now be satisfied".

It's arguable wether these automatic adjustments should happen underneath to try to make the job request fit the limits. Actually we're considering removing them in 18.08 (since it'll be a change in functionality).

But, for now Slurm works as expected for the shown requests.

Where I think there's a bug is when you request --mem=0:

alex@ibiza:~/t$ sbatch -N1 -p xeon24 --mem=0G --wrap="true"
Submitted batch job 20006
alex@ibiza:~/t$ scontrol show job 20006
JobId=20006 JobName=wrap
   UserId=alex(1000) GroupId=alex(1000) MCS_label=N/A
   Priority=87362 Nice=0 Account=acct1 QOS=normal WCKey=*
   JobState=COMPLETED Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=12:00:00 TimeMin=N/A
   SubmitTime=2018-04-16T16:58:07 EligibleTime=2018-04-16T16:58:07
   StartTime=2018-04-16T16:58:07 EndTime=2018-04-16T16:58:07 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2018-04-16T16:58:07
   Partition=xeon24 AllocNode:Sid=ibiza:30893
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=x001
   BatchHost=x001
   NumNodes=1 NumCPUs=1 NumTasks=0 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=250G,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=250G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/home/alex/t
   StdErr=/home/alex/t/slurm-20006.out
   StdIn=/dev/null
   StdOut=/home/alex/t/slurm-20006.out
   Power=
   

alex@ibiza:~/t$

From the --mem sbatch man page:

NOTE:  A  memory  size specification of zero is treated as a special case and grants the job access to all of the memory on each node.

So --mem=0 should be treated as if the job requested all memory on the node. Since RealMemory=256000, then--mem=0 should be equivalent to --mem=250G (256000/1024). But in practice --mem=0 is accepted and incorrectly bypasses MaxMemPerCPU limit (this is a bug) and --mem=250G is correctly rejected.

So I'm gonna investigate the --mem=0 case, but hope the rest makes sense to you. Otherwise please ask further. Thanks.
Comment 28 Ole.H.Nielsen@fysik.dtu.dk 2018-04-17 03:41:00 MDT
(In reply to Alejandro Sanchez from comment #26)
...
> Now, if we request more --mem than the maximum in the partition, for example
> --mem=250G (that's 250*1024=25600MiB, and is higher than the limit of
> 246.09375GiB or 252000MiB):
> 
> alex@ibiza:~/t$ sbatch -p xeon24 --mem=250G --wrap="true"
> sbatch: error: CPU count per node can not be satisfied
> sbatch: error: Batch job submission failed: Requested node configuration is
> not available

This is NOT what I see with my xeon24_test partition:  The job is actually accepted into the queue:

$ sbatch -p xeon24_test --reservation=Test1 --mem=250G --wrap="true"
Submitted batch job 495096

> and these messages are logged to ctld.log:
> 
> slurmctld: debug:  Setting job's pn_min_cpus to 25 due to memory limit
> slurmctld: _build_node_list: No nodes satisfy job 20003 requirements in
> partition xeon24
> slurmctld: _slurm_rpc_submit_batch_job: Requested node configuration is not
> available
> 
> Note this one:
> 
> slurmctld: debug:  Setting job's pn_min_cpus to 25 due to memory limit

Instead I get a different message in slurmctld.log:

# grep 495096 /var/log/slurm/slurmctld.log
[2018-04-17T11:24:48.151] _build_node_list: No nodes satisfy job 495096 requirements in partition xeon24_test
[2018-04-17T11:24:48.151] _slurm_rpc_submit_batch_job: JobId=495096 InitPrio=128373 usec=822

Please note that we have production partitions xeon24 (256 GB RAM) and xeon24_512 (512 GB RAM) which could accomodate the job, so perhaps that's the reason for the difference in behavior?  However, my job should certainly not be considered for these partitions by the scheduler!

The job has now been assigned to MinCPUsNode=37 (which don't exist anywhere in our cluster):

$ scontrol show job 495096
JobId=495096 JobName=wrap
   UserId=ohni(1775) GroupId=camdvip(1250) MCS_label=N/A
   Priority=128390 Nice=0 Account=camdvip QOS=normal
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2018-04-17T11:24:48 EligibleTime=2018-04-17T11:24:48
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2018-04-17T11:31:13
   Partition=xeon24_test AllocNode:Sid=sylg:19174
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=250G,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=37 MinMemoryNode=250G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=Test1
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/home/niflheim/ohni
   StdErr=/home/niflheim/ohni/slurm-495096.out
   StdIn=/dev/null
   StdOut=/home/niflheim/ohni/slurm-495096.out
   Power=

> This automatic adjustment of Slurm can be found in the logic in the function:
> src/slurmctld/job_mgr.c _valid_pn_min_mem(). The reasoning behind this is
> due to these old commits:
> 
> https://github.com/SchedMD/slurm/commit/b6ce37103b6573
> https://github.com/SchedMD/slurm/commit/0fbdb9e2bf5faf
> 
> And what the adjustments try to do is "ok the memory you request is higher
> than the maxmempercpu, I'm gonna try to modify the request so that more cpus
> are requested and then since you divide by more cpus perhaps the limit can
> now be satisfied".
> 
> It's arguable wether these automatic adjustments should happen underneath to
> try to make the job request fit the limits. Actually we're considering
> removing them in 18.08 (since it'll be a change in functionality).

This behavior is definitely confusing in our examples!  IMHO, it's better to get a clear job rejection in stead of automatic adjustments which apparently are not checked against the actually available partitions.  This leads to confusing job blocks which requires expert knowledge to sort out.

> But, for now Slurm works as expected for the shown requests.

The error message to the user "sbatch: error: CPU count per node can not be satisfied" which you see would be great, but unfortunately this is not what we observe :-(
Comment 31 Alejandro Sanchez 2018-04-17 04:38:20 MDT
I think there are 3 confirmed issues derived from this bug we need to address:

1. Job rejection behavior differs depending on job requesting reservation or not. If it requests a reservation, job is accepted at submission time and left PD (Reservation). If job doesn't request reservation then it is rejected at submission time with the user messages:

alex@ibiza:~/t$ sbatch -p xeon24 --mem=250G --wrap="true"
sbatch: error: CPU count per node can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is
not available

Both cases (reservation or not), EnforcePartLimits=YES and job mem higher than partition limit. So in both cases job should me rejected at submission time.

Can you confirm you get the user errors if the node isn't in a reservation and the job doesn't request the reservation bu the memory is higher than the limit?

2. The automatic adjustments Slurm does increasing pn_min_cpus when the job mem is higher than the limit are arguably removable. Specially since the CPU increase goes not only above MaxCPUSPerNode but above the number of CPUs available on any of the job partitions' nodes.

3. --mem[-per-cpu]=0 is a special case setting the memory to the smallest memory size of any of the job allocated nodes. The problem is that this happens after memory limit validation thus bypassing the limits set in the cluster/partition.

So there's a couple of things to be fixed here.
Comment 32 Ole.H.Nielsen@fysik.dtu.dk 2018-04-17 04:59:02 MDT
(In reply to Alejandro Sanchez from comment #31)
> I think there are 3 confirmed issues derived from this bug we need to
> address:
> 
> 1. Job rejection behavior differs depending on job requesting reservation or
> not. If it requests a reservation, job is accepted at submission time and
> left PD (Reservation). If job doesn't request reservation then it is
> rejected at submission time with the user messages:
> 
> alex@ibiza:~/t$ sbatch -p xeon24 --mem=250G --wrap="true"
> sbatch: error: CPU count per node can not be satisfied
> sbatch: error: Batch job submission failed: Requested node configuration is
> not available
> 
> Both cases (reservation or not), EnforcePartLimits=YES and job mem higher
> than partition limit. So in both cases job should me rejected at submission
> time.
> 
> Can you confirm you get the user errors if the node isn't in a reservation
> and the job doesn't request the reservation bu the memory is higher than the
> limit?

Confirmed.  I deleted the ReservationName=Test1 and submitted a job:

$ sbatch -p xeon24_test --mem=250G --wrap="true"
sbatch: error: CPU count per node can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available


> 2. The automatic adjustments Slurm does increasing pn_min_cpus when the job
> mem is higher than the limit are arguably removable. Specially since the CPU
> increase goes not only above MaxCPUSPerNode but above the number of CPUs
> available on any of the job partitions' nodes.
> 
> 3. --mem[-per-cpu]=0 is a special case setting the memory to the smallest
> memory size of any of the job allocated nodes. The problem is that this
> happens after memory limit validation thus bypassing the limits set in the
> cluster/partition.
> 
> So there's a couple of things to be fixed here.

That's a fine analysis!  I hope bug fixes can make it into 17.11.6 (or later).

Are there any further tests which I need to do?
Comment 33 Alejandro Sanchez 2018-04-17 05:04:54 MDT
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #32)
> That's a fine analysis!  I hope bug fixes can make it into 17.11.6 (or
> later).

I think 1) and 3) most probably can be fixed for 17.11. Not sure if .6 or .7 since we planned on releasing .6 this week. 2) most probably will land in 18.08 since it's a change in functionality. I don't want to promise anything though since I first want to have fixes ready and discuss internally in what versions we land them.

> Are there any further tests which I need to do?

Not for now, thanks for your follow-up comments and feedback; helped me identifying the issues.
Comment 35 Alejandro Sanchez 2018-05-03 09:50:05 MDT
Ole, just to update there's a draft patch we're discussing internally that seem to alleviate the issues here. We still need to refine it though and discuss where to land it, or if we should split it into different parts that could land on different versions.
Comment 36 Alejandro Sanchez 2018-05-10 02:44:03 MDT
Hi Ole, a patch has been finally committed to:

https://github.com/SchedMD/slurm/commit/bf4cb0b1b01f3e

and available starting from 17.11.7. You can apply it at your earliest convenience by appending ".patch" to the github URL. This fix seems to address the three issues we identified here + another one involving the same code paths in bug 4895.

We're gonna go ahead and mark this as resolved/fixed. Please, reopen if there's any new issue you find after applying it. Thanks.