Ticket 5382 - Requested partition configuration not available now when using '--mem-per-cpus'
Summary: Requested partition configuration not available now when using '--mem-per-cpus'
Status: RESOLVED DUPLICATE of ticket 5240
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 17.11.7
Hardware: Linux Linux
: --- 2 - High Impact
Assignee: Alejandro Sanchez
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-07-04 04:08 MDT by Sebastien Varrette
Modified: 2018-07-04 08:58 MDT (History)
1 user (show)

See Also:
Site: University of Luxembourg
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Main slurm.conf (7.79 KB, text/plain)
2018-07-04 04:08 MDT, Sebastien Varrette
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Sebastien Varrette 2018-07-04 04:08:56 MDT
Created attachment 7228 [details]
Main slurm.conf

We recently upgraded Slurm toward the version 17.11.7 to cover the security issues. 
However since this upgrade, any attempt to allocate more memory per cpu than the standard raise an error: 

$> srun -p interactive -N 1 --mem-per-cpu=8G --pty bash
srun: error: Unable to allocate resources: Requested partition configuration not available now

(revealed also in the logs of the slurmctld daemon: 

[2018-07-04T12:03:43.539] _slurm_rpc_allocate_resources: Requested partition configuration not available now

Note that using '--mem' seems to work. 

I attach the main configuration files. 

It's probably linked to the fact that this new release seems to enforce the maximum amount of memory per cpu: 

$> scontrol show config | grep -i mempercpu
DefMemPerCPU            = 4096
MaxMemPerCPU            = 4196

Any advice to correct this problem ?
Comment 1 Alejandro Sanchez 2018-07-04 04:53:31 MDT
Hi Sebastien,

This definitely is a duplicate of bug 5240. Historically when a job requested more memory than the configured MaxMemPer* limit, Slurm was doing automatic adjustments to try to make the job request fit the limits, including

"increasing cpus_per_task and decreasing mem_per_cpu by factor of X based upon mem_per_cpu limits"

or

"Setting job's pn_min_cpus to Y due to memory limit"

I (and some other people) personally don't like to modify what the user requested and if memory exceeded the limit, I preferred to get the job rejected (based upon EnforcePartLimits value at submit time) or left it pending with reason MaxMemPerLimit. Problem is this change in behavior should had been added in the master branch and documented, instead of check it in 17.11.7 were I unfortunately and incorrectly decided to land the commit bf4cb0b1b01f3e165bf.

In bug 5240 comment 24 we've decided to revert such change in commit

d52d8f4f0ce1a5b86bb0691630da0dc3dace1683

and we added this commit on top of the revert:

f07f53fc138b22485e7c26903968fa470cc9d98f

to fix a problem on multi-partition requests. They will be in 17.11.8 and onwards, but can be both applied at your earliest convenience. Appending ".patch" to the GitHub commit URL will generate a patch formatted document available to be applied if needed.

Please, let me know if you have further questions. Thanks.

*** This ticket has been marked as a duplicate of ticket 5240 ***
Comment 2 Sebastien Varrette 2018-07-04 07:53:02 MDT
Dear Alejandro, 

Many thanks for the explanation. 

May I still suggest to adapt the error message in this context as 'Requested partition configuration not available now' does not seems fully appropriate in this case.
Comment 3 Alejandro Sanchez 2018-07-04 08:58:40 MDT
(In reply to Sebastien Varrette from comment #2)
> Dear Alejandro, 
> 
> Many thanks for the explanation. 
> 
> May I still suggest to adapt the error message in this context as 'Requested
> partition configuration not available now' does not seems fully appropriate
> in this case.

With the two commits I suggested before and included since .8 the error code should be more concise:

alex@ibiza:~/t$ sbatch --mem-per-cpu=860 --wrap "sleep 9999"
sbatch: error: Batch job submission failed: Memory required by task is not available
alex@ibiza:~/t$