Ticket 9250 - --mem-per-cpu not working from the #SBATCH header, but working only from the srun option when executing steps in the background with --exclusive and &
Summary: --mem-per-cpu not working from the #SBATCH header, but working only from the ...
Status: RESOLVED TIMEDOUT
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 19.05.5
Hardware: Cray XC Linux
: --- 4 - Minor Issue
Assignee: Director of Support
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-06-17 18:24 MDT by Alexis
Modified: 2020-09-02 13:27 MDT (History)
1 user (show)

See Also:
Site: Pawsey
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: Other
Machine Name:
CLE Version: CLE6 UP07
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Seven examples are given, description of the examples are in my initial comment (5.66 KB, application/x-tar)
2020-06-17 18:24 MDT, Alexis
Details
Slurm configuration as requested (2.67 KB, application/x-bzip)
2020-06-18 19:03 MDT, Kevin Buckley
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Alexis 2020-06-17 18:24:57 MDT
Created attachment 14708 [details]
Seven examples are given, description of the examples are in my initial comment

I need to execute 10 job steps, each need 2.5Gb of memory

If I do:
#SBATCH --ntasks=10
#SBATCH --mem-per-cpu=2500
The scheduler complains to exceed memory availability even if 25000 < 64000 Gb available in the node.
The scheduler is calculating a max of 1250Gb per node (as if 48 tasks were going to use the node) but we are explicitly saying: 10
(example 101_ in the tar file)
(tar file is attached. Scripts will not run on your side due to lack of tools, but the tar file contains the slurm output files or error messages to screen)

When using --mem-per-cpu=1250 fails as the steps need more memory (example 102_ in the tar file)

So, the only way to make this work is to avoid setting of (--mem-per-cpu) in the header, and define it in the srun command (example 106_ in the tar file)

Some other options that look like "tricks" rather than formal settings are (using numbers as in the examples in the tar file):
    103_ : to use a null value for --mem-per-cpu in the srun line
    105_ : to use --mem-per-cpu=0 in the srun line

Some other failed options are:
    104_ : do not use --mem-per-cpu setting at all, but this wont allow the job steps to run at the same time
    107_ : try to convince the scheduler again from the header, with --ntasks-per-node=10, but fails the same way as in 101_

So, my point here is: I think the settings using the header:
#SBATCH --ntasks=10
#SBATCH --ntasks-per-node=10
#SBATCH --mem-per-cpu=2500
 should work and allow the script to run (together with the use of srun --mem-per-cpu=$SLURM_MEM_PER_CPU --exclusive .... &)

Also, can you also please comment if you consider examples 103_ and 105_ as correct settings (or more like tricks)?

Thanks
Comment 1 Jason Booth 2020-06-18 15:45:36 MDT
Would you please attach your slurm.conf, and any include files containing your node configuration?
Comment 2 Kevin Buckley 2020-06-18 19:03:20 MDT
Created attachment 14733 [details]
Slurm configuration as requested

This configuration, modulo Slurm version changes, is unlikely 
to very much different to previously supplied files, nor to 
the one TimW gave a once-over and thumbs-up to, when last on 
site here.

The Inlcude, in respect of the control IP address, files are 
provided for completeness but contain has no scheduling or 
resource configuration info.
Comment 4 Michael Hinton 2020-08-05 17:01:54 MDT
Hello, sorry for the delay.

I think the problem is that the partitions (including debugq) have OverSubscribe=EXCLUSIVE set (via the DEFAULT partition settings). That means that whole nodes are being allocated to jobs. Since whole nodes are being allocated each time, I believe all CPUs are implicitly allocated to the job, even if n=10. This probably throws off the mem-per-cpu calculations.

At any rate, OverSubscribe=EXCLUSIVE is definitely not what you want if you need to share the nodes, which seems to be the assumption.

Thanks,
-Michael
Comment 5 Kevin Buckley 2020-08-05 23:13:06 MDT
> At any rate, OverSubscribe=EXCLUSIVE is definitely not
> what you want if you need to share the nodes, which
> seems to be the assumption.

We do not share nodes on the production Crays here.

A job, or part of a job, where a job spans more than
one node, gets the whole node.
Comment 6 Michael Hinton 2020-08-06 09:07:08 MDT
(In reply to Alexis from comment #0)
> If I do:
> #SBATCH --ntasks=10
> #SBATCH --mem-per-cpu=2500
> The scheduler complains to exceed memory availability even if 25000 < 64000
> Gb available in the node.
> The scheduler is calculating a max of 1250Gb per node (as if 48 tasks were
> going to use the node) but we are explicitly saying: 10
This is because --mem-per-cpu is per *allocated* CPU, not per task. Since you are allocating whole nodes in this cluster, that's 48 CPUs implicitly allocated per node.

From https://slurm.schedmd.com/sbatch.html#OPT_mem-per-cpu:

--mem-per-cpu=<size[units]>

Minimum memory required per allocated CPU... If resources are allocated by core, socket, or whole nodes, then the number of CPUs allocated to a job may be higher than the task count and the value of --mem-per-cpu should be adjusted accordingly.
Comment 7 Michael Hinton 2020-08-06 11:22:37 MDT
I'm not quite sure what exactly you are trying to accomplish, since you are allocating whole nodes. But I would do this: Don't specify "#SBATCH --mem-per-cpu" at all, so your sbatch job itself has no memory limits. Then, for each srun, just do --mem-per-cpu=2500, and in my testing it should limit each task to 2.5 GB of memory. If the step tries to go over that limit, it will fail and show OUT_OF_MEMORY as the state in sacct.
Comment 8 Michael Hinton 2020-08-06 11:32:55 MDT
(In reply to Alexis from comment #0)
> Some other failed options are:
>     104_ : do not use --mem-per-cpu setting at all, but this wont allow the
> job steps to run at the same time
What I'm suggesting is close to example 104. I'm not sure why job steps were not running in parallel for you - they are for me in my testing.
Comment 9 Alexis 2020-08-07 17:21:36 MDT
(In reply to Michael Hinton from comment #7)
> I'm not quite sure what exactly you are trying to accomplish, since you are
> allocating whole nodes. 

What I'm trying to accomplish is to run 10 tasks in the node. (I do not need more than those 10). But each of the tasks requires at least 2500Mb of memory.
Comment 10 Alexis 2020-08-07 17:25:52 MDT
(In reply to Michael Hinton from comment #8)
> (In reply to Alexis from comment #0)
> > Some other failed options are:
> >     104_ : do not use --mem-per-cpu setting at all, but this wont allow the
> > job steps to run at the same time
> What I'm suggesting is close to example 104. I'm not sure why job steps were
> not running in parallel for you - they are for me in my testing.

Are you sure your test is running in parallel all the tasks? (Probably you need to test with long tasks which take a few minutes, then you will be clearly able to see if they run in parallel or one by one)

Here they are not running in parallel. Steps are not failing, but they are ran 1 by 1, as if each of them requires the whole memory of the node. For me this makes sense as there is no indication of memory need. Then, by default, each step is assigned to use all the memory, then each steps needs to wait for the previous to finish before starting the next one.
Comment 11 Alexis 2020-08-07 17:28:28 MDT
(In reply to Michael Hinton from comment #6)
> (In reply to Alexis from comment #0)
> > If I do:
> > #SBATCH --ntasks=10
> > #SBATCH --mem-per-cpu=2500
> > The scheduler complains to exceed memory availability even if 25000 < 64000
> > Gb available in the node.
> > The scheduler is calculating a max of 1250Gb per node (as if 48 tasks were
> > going to use the node) but we are explicitly saying: 10
> This is because --mem-per-cpu is per *allocated* CPU, not per task. Since
> you are allocating whole nodes in this cluster, that's 48 CPUs implicitly
> allocated per node.
> 
> From https://slurm.schedmd.com/sbatch.html#OPT_mem-per-cpu:
> 
> --mem-per-cpu=<size[units]>
> 
> Minimum memory required per allocated CPU... If resources are allocated by
> core, socket, or whole nodes, then the number of CPUs allocated to a job may
> be higher than the task count and the value of --mem-per-cpu should be
> adjusted accordingly.

So, for the configuration we are using, there is no way to tell the scheduler (from #SBATCH directives) to assign more memory for each task? Ok. I do not like it, but I accept the way things are.
Comment 12 Alexis 2020-08-07 17:39:19 MDT
(In reply to Alexis from comment #0)

> So, the only way to make this work is to avoid setting of (--mem-per-cpu) in
> the header, and define it in the srun command (example 106_ in the tar file)

> Some other options that look like "tricks" rather than formal settings are
> (using numbers as in the examples in the tar file):
>     103_ : to use a null value for --mem-per-cpu in the srun line
>     105_ : to use --mem-per-cpu=0 in the srun line

Then options #105 and #106 are the right ways to go (being the 105 really what you wanted to achieve with your suggestion similar to 104). Am I right?

Many thanks.
Alexis
Comment 13 Michael Hinton 2020-08-17 17:42:36 MDT
(In reply to Alexis from comment #12)
> Then options #105 and #106 are the right ways to go (being the 105 really
> what you wanted to achieve with your suggestion similar to 104). Am I right?
Sorry, I didn’t look closely. I would expect #105 and #106 to work as well. #106 is basically what I did in my testing:

    srun -n 1 --export=all --mem-per-cpu=2500 --exclusive darshan-job-summary.pl "$archivo" &

(In reply to Alexis from comment #9)
> What I'm trying to accomplish is to run 10 tasks in the node. (I do not need
> more than those 10). But each of the tasks requires at least 2500Mb of
> memory.
Since you are allocating the whole node, the entire memory of that node is automatically available to your job. And it’s exclusive. So for 10 tasks at 2500 MB/task, there is no need to specify/limit memory for the tasks. 10*2500 = 25000 MB, which is still < 60000. So they should all be able to run in parallel without hitting the node’s memory limit.

If you wanted to run e.g. 100 tasks, that’s when you would need to set --mem-per-cpu like in #106 so Slurm can force the tasks to take turns using the limited memory.

I’m not sure yet why your tasks are running serially. They are running in parallel for me, and I’ve verified that. Are there any other resources besides memory that might be causing the tasks to be blocked? Is the program itself blocked on some shared resource? Are there any environment variables messing things up? Add -v to srun to help debug the memory options being used.

I would run `scontrol show steps` to help you debug as well. It will show you if Slurm thinks steps are pending or running, as well as show the memory TRES for each step.

-Michael
Comment 14 Michael Hinton 2020-08-24 13:33:13 MDT
Hi Alexis, does that clear things up, or did you still have questions?
Comment 15 Michael Hinton 2020-09-02 13:27:01 MDT
Hopefully that helps. Feel free to reopen if you have any more questions.

Thanks,
-Michael