Ticket 1032 - Configuration guidance - condo model
Summary: Configuration guidance - condo model
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Configuration (show other tickets)
Version: 14.03.6
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Moe Jette
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2014-08-11 07:16 MDT by Kilian Cavalotti
Modified: 2022-08-16 21:52 MDT (History)
1 user (show)

See Also:
Site: Stanford
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 14.03.7
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
start job in highest priority parttion by preemption rathter than using lower prio partition (3.25 KB, patch)
2014-08-18 06:07 MDT, Moe Jette
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Kilian Cavalotti 2014-08-11 07:16:54 MDT
Hi,

I'm looking for some guidance to implement the following setup for our HPC cluster.

We're trying to implement a condo model, where users can purchase their own compute nodes, and be guaranteed to have priority access to those, while still being able to run on the generally available nodes.

We currently have a pool of shared compute nodes (the "normal" partition) that anybody can run on.
In addition to that, we're deploying new groups of nodes that individual PIs purchased. We created a partition for each of those groups of nodes. Each PI has a Slurm "account" and several users are associated with those accounts.

So we have the following partitions:
- normal
- part_PI_A
- part_PI_B

And the following users/accounts:
- user_A1 and user_A2 under account acct_PI_A
- user_B1 and user_B2 under account acct_PI_B


So we're looking for:

!. a way to let user_A1 and user_A2 run mainly in partition part_PI_A, but also in the normal partition if part_PI_A is full. Same thing for PI B. So I guess we're looking for a per-user default partition setting. Does that exist?

2. a way to let users in acct_PI_A run on part_PI_B is those nodes are idle, and to preempt their jobs when users in acct_PI_B submit jobs to run on their own partition. Is there a way to achieve this?


Thanks a lot for any insight you could provide.
Comment 1 Moe Jette 2014-08-11 11:41:30 MDT
The more partitions that you create, the more difficult it will be to keep each of them fully utilized. So while I would generally recommend against the condo model, Slurm can support it.

Here is some guidance to make the best of it and feel free to ask follow up questions.

You can submit jobs to multiple partitions. The job will get scheduled in whichever partition permits it to start earliest. Partitions with a higher Priority value get tested first.

While Slurm does not support per-user default partitions, you can use a job_submit plugin for that.

You can establish preemption rules on a per partition or per QOS basis

You can create overlapping partitions. For example the nodes in partition part_PI_A can also be in partition normal.

You can assign nodes a "weight" so that nodes with lower weights get used before those with a higher weight within a partition.

============================================================

Here is a suggested configuration (making up node names)

JobSubmitPlugins=stanford   # described below
PreemptMode=requeue
PreemptType=preempt/partition_prio

PartitionName=normal    Nodes=tux[000-099] Weight=20 Default=YES
PartitionName=part_PI_A Nodes=tux[100-199] Weight=30 Default=NO AllowGroups=PI_A
PartitionName=part_PI_B Nodes=tux[200-299] Weight=30 Default=NO AllowGroups=PI_B
PartitionName=overflow  Nodes=tux[100-299] Weight=10 Default=NO AllowGroups=PI_A,PI_B

============================================================

What your job submit plugin would do is
1. read some configuration information about who should be able to access the various partitions
2. If user specifies some partition, just return.
3. For users in acct_PI_A, set the partition parameter to "part_PI_A,normal,overflow"
4. For users in acct_PI_B, set the partition parameter to "part_PI_B,normal,overflow"

============================================================

For more information, please see:
http://slurm.schedmd.com/preempt.html
http://slurm.schedmd.com/job_submit_plugins.html
http://slurm.schedmd.com/slurm.conf.html
Comment 2 Kilian Cavalotti 2014-08-12 04:52:38 MDT
Hi Moe,

Thanks a lot for the guidance, much appreciated. I have additional questions below.

(In reply to Moe Jette from comment #1)
> The more partitions that you create, the more difficult it will be to keep
> each of them fully utilized. So while I would generally recommend against
> the condo model, Slurm can support it.

Besides partitions, what would you recommend to better handle this "shared pool"/"owned nodes" scheme? 

> You can submit jobs to multiple partitions. The job will get scheduled in
> whichever partition permits it to start earliest. 

I did a quick test with 2 partitions, and it doesn't seem to work as I expected. It seems to work fine with sbatch, but not with srun. 

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
test         up   infinite      2   idle sh-5-[33-34]
owned        up 7-00:00:00      1   idle sh-7-22

$ sbatch -p owned,test -n 16 --wrap="sleep 10000"
Submitted batch job 171344
$ sbatch -p owned,test -n 16 --wrap="sleep 10000"
Submitted batch job 171346
$ squeue -u kilian
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            171344     owned   sbatch   kilian  R       0:14      1 sh-7-22
            171346      test   sbatch   kilian  R       0:01      1 sh-5-33

But with srun, the 2nd job waits with (Resources):

$ srun -p owned,test -n 16 sleep 10000
sh-7-22:~$

and in another terminal:
$ srun -p owned,test -n 16 sleep 10000
srun: job 171348 queued and waiting for resources

$ squeue  -u kilian
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            171347     owned    sleep   kilian  R       0:29      1 sh-7-22
            171348 owned,tes    sleep   kilian PD       0:00      1 (Resources)

Is that expected?


> Partitions with a higher
> Priority value get tested first.
>
> While Slurm does not support per-user default partitions, you can use a
> job_submit plugin for that.

What about setting a SLURM_PARTITION env variable per-user? SLURM_PARTITION=part_PI_A,normal would submit in part_PI_A if possible, then in normal, and could still be overridden by --partition option in the command-line, right? Preliminary test seem to indicate that this could work as a per-user default partition, but I'm not sure about potential side-effects

Thanks a lot for all the other pointers too, I'll look into those options.
Comment 3 Moe Jette 2014-08-12 05:13:47 MDT
See responses in-line below.

(In reply to Kilian Cavalotti from comment #2)
> Hi Moe,
> 
> Thanks a lot for the guidance, much appreciated. I have additional questions
> below.
> 
> (In reply to Moe Jette from comment #1)
> > The more partitions that you create, the more difficult it will be to keep
> > each of them fully utilized. So while I would generally recommend against
> > the condo model, Slurm can support it.
> 
> Besides partitions, what would you recommend to better handle this "shared
> pool"/"owned nodes" scheme? 

How about fair share scheduling?
That would permit you to set a target allocation of the entire system on a per account and user basis. for more information, see:
http://slurm.schedmd.com/priority_multifactor.html


> > You can submit jobs to multiple partitions. The job will get scheduled in
> > whichever partition permits it to start earliest. 
> 
> I did a quick test with 2 partitions, and it doesn't seem to work as I
> expected. It seems to work fine with sbatch, but not with srun. 
> 
> $ sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
> test         up   infinite      2   idle sh-5-[33-34]
> owned        up 7-00:00:00      1   idle sh-7-22
> 
> $ sbatch -p owned,test -n 16 --wrap="sleep 10000"
> Submitted batch job 171344
> $ sbatch -p owned,test -n 16 --wrap="sleep 10000"
> Submitted batch job 171346
> $ squeue -u kilian
>              JOBID PARTITION     NAME     USER ST       TIME  NODES
> NODELIST(REASON)
>             171344     owned   sbatch   kilian  R       0:14      1 sh-7-22
>             171346      test   sbatch   kilian  R       0:01      1 sh-5-33
> 
> But with srun, the 2nd job waits with (Resources):
> 
> $ srun -p owned,test -n 16 sleep 10000
> sh-7-22:~$
> 
> and in another terminal:
> $ srun -p owned,test -n 16 sleep 10000
> srun: job 171348 queued and waiting for resources
> 
> $ squeue  -u kilian
>              JOBID PARTITION     NAME     USER ST       TIME  NODES
> NODELIST(REASON)
>             171347     owned    sleep   kilian  R       0:29      1 sh-7-22
>             171348 owned,tes    sleep   kilian PD       0:00      1
> (Resources)
> 
> Is that expected?

No. I'll need to investigate and get back to you later on this.


> > Partitions with a higher
> > Priority value get tested first.
> >
> > While Slurm does not support per-user default partitions, you can use a
> > job_submit plugin for that.
> 
> What about setting a SLURM_PARTITION env variable per-user?
> SLURM_PARTITION=part_PI_A,normal would submit in part_PI_A if possible, then
> in normal, and could still be overridden by --partition option in the
> command-line, right? Preliminary test seem to indicate that this could work
> as a per-user default partition, but I'm not sure about potential
> side-effects

That will work fine. The command line option does override environment variables.


> Thanks a lot for all the other pointers too, I'll look into those options.
Comment 4 Kilian Cavalotti 2014-08-12 06:05:39 MDT
> How about fair share scheduling?
> That would permit you to set a target allocation of the entire system on a
> per account and user basis. for more information, see:
> http://slurm.schedmd.com/priority_multifactor.html

Well, the thing is that we don't only want owners to get a privileged access to some compute resources, we want them to get exclusive access to *their* physical nodes. Because they can have different hardware configurations, such as memory amount, local storage options, and so on. That's why mapping users to partitions using AllowGroups seemed to make sense.

Would fair share allow to prevent users to run on specific nodes? (we don't want the general population to run on owners' nodes)  


> No. I'll need to investigate and get back to you later on this.

Thanks. Do you need me to open a specific ticket for this?

> That will work fine. The command line option does override environment
> variables.

Great!
Thanks again.
Comment 5 Moe Jette 2014-08-12 06:13:17 MDT
(In reply to Kilian Cavalotti from comment #4)
> > How about fair share scheduling?
> > That would permit you to set a target allocation of the entire system on a
> > per account and user basis. for more information, see:
> > http://slurm.schedmd.com/priority_multifactor.html
> 
> Well, the thing is that we don't only want owners to get a privileged access
> to some compute resources, we want them to get exclusive access to *their*
> physical nodes. Because they can have different hardware configurations,
> such as memory amount, local storage options, and so on. That's why mapping
> users to partitions using AllowGroups seemed to make sense.
> 
> Would fair share allow to prevent users to run on specific nodes? (we don't
> want the general population to run on owners' nodes)  

No, The scheme we are working on with multiple partitions would be best to accomplish that.


> > No. I'll need to investigate and get back to you later on this.
> 
> Thanks. Do you need me to open a specific ticket for this?

No need for a new ticket. I was able to replicate this and have a fix for you. The change will be in version 14.03.7 ( when released probably next week) or you can use the patch here:
https://github.com/SchedMD/slurm/commit/e941364920a5f910144d2564ee14f91a61c1b3cb.patch
Comment 6 Kilian Cavalotti 2014-08-12 06:24:20 MDT
(In reply to Moe Jette from comment #5)
> No, The scheme we are working on with multiple partitions would be best to
> accomplish that.

Ok, so I have requirement 1. working with setting SLURM_PARTITION for users in owners accounts and your srun patch.

Now I'm gonna experiment with preemption for requirement 2. BTW, what happens if I mix nodes with different amounts of memory in the same partition?

> No need for a new ticket. I was able to replicate this and have a fix for
> you. The change will be in version 14.03.7 ( when released probably next
> week) or you can use the patch here:
> https://github.com/SchedMD/slurm/commit/
> e941364920a5f910144d2564ee14f91a61c1b3cb.patch

That was fast, thank you!
Comment 7 Moe Jette 2014-08-12 06:27:31 MDT
(In reply to Kilian Cavalotti from comment #6)
> (In reply to Moe Jette from comment #5)
> > No, The scheme we are working on with multiple partitions would be best to
> > accomplish that.
> 
> Ok, so I have requirement 1. working with setting SLURM_PARTITION for users
> in owners accounts and your srun patch.
> 
> Now I'm gonna experiment with preemption for requirement 2. BTW, what
> happens if I mix nodes with different amounts of memory in the same
> partition?

No problem. I would recommend using the node "Weight" configuration option so that nodes with smaller memory sizes are used first and the larger memory nodes tend to be saved for jobs that need it.
Comment 8 Kilian Cavalotti 2014-08-12 08:03:20 MDT
I have one more question: I'm trying to submit to multiple partitions, with the overflow partition concept you mentioned in #c1.

I use sbatch -p part_PI_A,overflow,normal, but it looks like the job only gets to part_PI_A and overflow. Slurmctld logs the following:

 _valid_job_part: can't check multiple partitions with partition based associations

I'm not sure what it means, could you please give me some details?


I indeed have partition based associations, our current setup is, for each user:
   Cluster    Account       User  Partition     Share GrpJobs GrpNodes  GrpCPUs  GrpMem GrpSubmit     GrpWall  GrpCPUMins MaxJobs MaxNodes  MaxCPUs MaxSubmit     MaxWall  MaxCPUMins                  QOS   Def QOS GrpCPURunMins
---------- ---------- ---------- ---------- --------- ------- -------- -------- ------- --------- ----------- ----------- ------- -------- -------- --------- ----------- ----------- -------------------- --------- -------------
   cluster      accnt       user     normal         1                                                                                                                                          long,normal
   cluster      accnt       user        dev         1                                                                                                                                                  dev
   cluster      accnt       user                    1                                                                                                                                               normal

Thanks.
Comment 9 Moe Jette 2014-08-12 08:37:26 MDT
(In reply to Kilian Cavalotti from comment #8)
> I have one more question: I'm trying to submit to multiple partitions, with
> the overflow partition concept you mentioned in #c1.
> 
> I use sbatch -p part_PI_A,overflow,normal, but it looks like the job only
> gets to part_PI_A and overflow. Slurmctld logs the following:
> 
>  _valid_job_part: can't check multiple partitions with partition based
> associations
> 
> I'm not sure what it means, could you please give me some details?
> 
> 
> I indeed have partition based associations, our current setup is, for each
> user:
>    Cluster    Account       User  Partition     Share GrpJobs GrpNodes 
> GrpCPUs  GrpMem GrpSubmit     GrpWall  GrpCPUMins MaxJobs MaxNodes  MaxCPUs
> MaxSubmit     MaxWall  MaxCPUMins                  QOS   Def QOS
> GrpCPURunMins
> ---------- ---------- ---------- ---------- --------- ------- --------
> -------- ------- --------- ----------- ----------- ------- -------- --------
> --------- ----------- ----------- -------------------- ---------
> -------------
>    cluster      accnt       user     normal         1                       
> long,normal
>    cluster      accnt       user        dev         1                       
> dev
>    cluster      accnt       user                    1                       
> normal
> 
> Thanks.

Are you establishing different limits or fair-share for users depending upon the partition they run in?
If no, there is no reason to configure a separate association for each partition.
If yes, they you will need to configure an association for the overflow partition as well.

Given my limited understanding of your environment, I'm not sure that separate limits or fair-share by partition benefits you.
Comment 10 Kilian Cavalotti 2014-08-13 07:06:16 MDT
Hi Moe, 

(In reply to Moe Jette from comment #9)
> Are you establishing different limits or fair-share for users depending upon
> the partition they run in?

Yes. Here is some context:

- we have some general limits we want to impose on all users (MaxCPUsPU, MaxJobsPU, MaxWall...) so we have a "normal" QOS.

- we want users to be able to quickly get an interactive shell on a compute node for compiling, debugging, or testing code. To that end, we dedicated a couple of nodes, and created a partition (dev) for them. To get everyone a chance to get a shell when they need it, we also wanted to limit the number of jobs and tasks that a user can submit to those nodes, so we created a QOS (dev) that we mapped to that partition.   

- we have this MaxWall limit of 2 days in the normal QOS, but we also have some users that need to run jobs for a longer time. So we added a "long" QOS with a MaxWall of 7 days, so that users who want to use it explicitly need to request it. We also don't want long jobs to run on our special-purposes nodes (large memory and GPU nodes), so we only want this QOS to be usable on the "normal" partition.

So we allow:
- dev QOS on dev partition
- normal and long QOS on normal partition
- normal QOS on all (the other) partitions 

If you have other recommendations to achieve this kind of setup, I'm open to suggestions.

> If no, there is no reason to configure a separate association for each
> partition.
> If yes, they you will need to configure an association for the overflow
> partition as well.

Got it, I'll do that.
 
> Given my limited understanding of your environment, I'm not sure that
> separate limits or fair-share by partition benefits you.

I hope the explanation above makes some sense. If not, I can provide the conf files and database outputs if you want.
Comment 11 Moe Jette 2014-08-13 07:12:08 MDT
(In reply to Kilian Cavalotti from comment #10)
> Hi Moe, 
> 
> (In reply to Moe Jette from comment #9)
> > Are you establishing different limits or fair-share for users depending upon
> > the partition they run in?
> 
> Yes. Here is some context:
> 
> - we have some general limits we want to impose on all users (MaxCPUsPU,
> MaxJobsPU, MaxWall...) so we have a "normal" QOS.
> 
> - we want users to be able to quickly get an interactive shell on a compute
> node for compiling, debugging, or testing code. To that end, we dedicated a
> couple of nodes, and created a partition (dev) for them. To get everyone a
> chance to get a shell when they need it, we also wanted to limit the number
> of jobs and tasks that a user can submit to those nodes, so we created a QOS
> (dev) that we mapped to that partition.   
> 
> - we have this MaxWall limit of 2 days in the normal QOS, but we also have
> some users that need to run jobs for a longer time. So we added a "long" QOS
> with a MaxWall of 7 days, so that users who want to use it explicitly need
> to request it. We also don't want long jobs to run on our special-purposes
> nodes (large memory and GPU nodes), so we only want this QOS to be usable on
> the "normal" partition.
> 
> So we allow:
> - dev QOS on dev partition
> - normal and long QOS on normal partition
> - normal QOS on all (the other) partitions 
> 
> If you have other recommendations to achieve this kind of setup, I'm open to
> suggestions.

That seems like a reasonable approach.
The job_submit plugin that I previously referenced may be helpful to set QOS that match partition name if desired (i.e. dev)
Comment 12 Kilian Cavalotti 2014-08-13 11:00:45 MDT
(In reply to Moe Jette from comment #11)
> That seems like a reasonable approach.
> The job_submit plugin that I previously referenced may be helpful to set QOS
> that match partition name if desired (i.e. dev)

Well, it seems that having partition based associations will prevent to submit to multiple partitions, is this correct?

In my limited testing, if I have a QOS associated with each of the partitions, if I submit a job to multiple partitions, I get "_valid_job_part: can't check multiple partitions with partition based associations" in the log, and "scontrol show job" seems to indicate that the job is only submitted to the first partition listed.

# sacctmgr list assoc user=kilian format=User,Partition,QOS
      User  Partition           QOS
---------- ---------- -------------
    kilian        gpu        normal
    kilian      owned        normal
    kilian     normal   long,normal
    kilian        dev           dev
    kilian  part_PI_A        normal

$ sbatch -p "part_PI_A,owned,normal" -n 16 -N1 --wrap="sleep 10000"
Submitted batch job 172082
$ sbatch -p "part_PI_A,owned,normal" -n 16 -N1 --wrap="sleep 10000"
Submitted batch job 172083
$ squeue -u kilian
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            172083 part_PI_A   sbatch   kilian PD       0:00      1 (None)
            172082 part_PI_A   sbatch   kilian  R       0:02      1 sh-7-22
$ scontrol show job 172083
JobId=172083 Name=sbatch
   UserId=kilian(215845) GroupId=ruthm(32264)
   Priority=10211 Nice=0 Account=ruthm QOS=normal
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=02:00:00 TimeMin=N/A
   SubmitTime=2014-08-13T15:56:54 EligibleTime=2014-08-13T15:56:54
   StartTime=Unknown EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
>> Partition=part_PI_A AllocNode:Sid=sh-ln02:28027
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1-1 NumCPUs=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=0
   MinCPUsNode=1 MinMemoryCPU=4000M MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=0 Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/home/kilian
   StdErr=/home/kilian/slurm-172083.out
   StdIn=/dev/null
   StdOut=/home/kilian/slurm-172083.out



So unless there's a way to submit to multiple partition with partition-based associations, I guess I'll have to remove those partition-based associations and develop a job_submit plugin instead.
Comment 13 Moe Jette 2014-08-13 12:07:08 MDT
(In reply to Kilian Cavalotti from comment #12)
> (In reply to Moe Jette from comment #11)
> > That seems like a reasonable approach.
> > The job_submit plugin that I previously referenced may be helpful to set QOS
> > that match partition name if desired (i.e. dev)
> 
> Well, it seems that having partition based associations will prevent to
> submit to multiple partitions, is this correct?

I was not expecting that, but have confirmed this behaviour. I'll check with the author to see if he has other suggestions, but a job_submit plugin is looking more attractive right now.


The relevant bit of code from src/slurmctld/job_mgr.c:

		while ((part_ptr_tmp = (struct part_record *)list_next(iter))) {
			/* FIXME: When dealing with multiple partitions we
			 * currently can't deal with partition based
			 * associations.
			 */
			memset(&assoc_rec, 0,
			       sizeof(slurmdb_association_rec_t));
			if (assoc_ptr) {
				assoc_rec.acct      = assoc_ptr->acct;
				assoc_rec.partition = part_ptr_tmp->name;
				assoc_rec.uid       = job_desc->user_id;

				assoc_mgr_fill_in_assoc(
					acct_db_conn, &assoc_rec,
					accounting_enforce, NULL);
			}

			if (assoc_ptr && assoc_rec.id != assoc_ptr->id) {
				info("_valid_job_part: can't check multiple "
				     "partitions with partition based "
				     "associations");
				rc = SLURM_ERROR;
Comment 14 Kilian Cavalotti 2014-08-13 12:25:08 MDT
> I was not expecting that, but have confirmed this behaviour. I'll check with
> the author to see if he has other suggestions, 

Aah, I see there's a FIXME in the code, so hopefully it can be fixed. :)

> but a job_submit plugin is
> looking more attractive right now.

I'm willing to explore that path too, but I kind of wanted to stay away from it as long as possible, in order to limit site-specific customizations (esp. hard-coding partition and account names in the plugin), so we can limit our specifics to the conf files and the database. 

So I'm definitely interested in what Danny could say about this.

Thanks again for all the info!
Comment 15 Danny Auble 2014-08-14 00:17:50 MDT
Kilian, so the bad news is this FIXME is quite involved and most likely wouldn't be worth the effort.  The good news is there are a couple of other (in my opinion) more attractive options than using partition based associations in this set up.  Hopefully one of these will work for you.

1. Don't use partition based associations at all.  Given your setup I don't see a reason they are needed (yet).  You already put most of your limits you need in a QOS to do the limits, so it doesn't appear you need them for association based limits, this is good.  Having multiple associations for users would only prove useful to limit the QOS that can be used in partitions (it appears).  If this is the case you can limit the QOS that is allowed in the partition with the AllowQOS partition option.  You can write a simple job submit plugin that will set the QOS appropriately if you don't want to force your users to do it every time.  This will allow you to only have the one association per user/account and probably give you better accounting since right now you would get accounting based primarily on partition.

2. Only have a partition association for the dev partition, use the normal/non-partition association for all other partitions.  This will allow you to get the desired behaviour of multiple-partition submission as long as "dev" isn't one of the partitions. The FIXME code only comes into play when the default association isn't used.  This method will at least fix this scenario, but I am hoping we don't have to use this unless there is a reason I am forgetting/not understanding. 

Let me know if one of these options will work for you.  The partition based associations are sort of a pain to work with and I will rarely advise people to use them, so I would opt for option 1 if possible.  If these don't work out please let me know what they are missing and we can work on other options to get the partition associations out or at least minimize their use.
Comment 16 Kilian Cavalotti 2014-08-14 06:19:39 MDT
> 1. Don't use partition based associations at all.  Given your setup I don't
> see a reason they are needed (yet).  You already put most of your limits you
> need in a QOS to do the limits, so it doesn't appear you need them for
> association based limits, this is good.  Having multiple associations for
> users would only prove useful to limit the QOS that can be used in
> partitions (it appears).  

That's right, that's the only way we found to statically "map" QOSes to partitions  and not have users specify the QOS at submission time.

What we were looking for, really, is a way to set, for instance, MaxCPUsPU for a partition. Generally speaking, we tend to have different limits on each partition, so we were trying to find a way to set a default QOS per partition (so that users don't have to add --qos to their submission command), and limit the QOSes that could be used in each partition. 

> If this is the case you can limit the QOS that is
> allowed in the partition with the AllowQOS partition option. 

That would help, I missed this option, thanks.

> You can write
> a simple job submit plugin that will set the QOS appropriately if you don't
> want to force your users to do it every time.  

That's the part I wanted to avoid (see my previous comment) to limit site-specific customizations, but it looks like it would be the best way to achieve our goals anyway. So I'm gonna explore this.

> This will allow you to only
> have the one association per user/account and probably give you better
> accounting since right now you would get accounting based primarily on
> partition.

Right, it's a bit of a mess right now. :)
We basically have 3 associations per user in the DB, so reconciling all of them for global accounting takes extra effort.

> 2. Only have a partition association for the dev partition, use the
> normal/non-partition association for all other partitions.  This will allow
> you to get the desired behaviour of multiple-partition submission as long as
> "dev" isn't one of the partitions. The FIXME code only comes into play when
> the default association isn't used.  This method will at least fix this
> scenario, but I am hoping we don't have to use this unless there is a reason
> I am forgetting/not understanding. 

Well, in our current setup, we have:
1. "dev" QOS only on "dev" partition and nothing else
2. "long" QOS on "normal" partition but nowhere else
3. "normal" QOS on all partitions (incl. "normal) except "dev"

So maybe we can get 2. with the AllowQOS partition option. 

> Let me know if one of these options will work for you.  The partition based
> associations are sort of a pain to work with and I will rarely advise people
> to use them, so I would opt for option 1 if possible.  If these don't work
> out please let me know what they are missing and we can work on other
> options to get the partition associations out or at least minimize their use.

I'm gonna experiment a bit and see how it goes. I'll definitely let you know what we end up with.

Do you mind keeping this ticket open for a few more days, in case I have more questions?

Anyway, thanks a ton to both of you for your great feedback and advice!
Comment 17 Moe Jette 2014-08-14 06:47:42 MDT
(In reply to Kilian Cavalotti from comment #16)
> Do you mind keeping this ticket open for a few more days, in case I have
> more questions?

No problem.

> Anyway, thanks a ton to both of you for your great feedback and advice!

That's what we are here for.
Comment 18 Kilian Cavalotti 2014-08-14 12:56:25 MDT
I now have a weird issue I can't seem to diagnose. I modify my setup to set AllowGroups and {Allow,Deny}QOS on partitions instead of using partition-based associations. So I now have in my slurm.conf file:

  PartitionName=normal \
    Default=YES \
    AllowQos=normal,long \
        nodes=...
  
  PartitionName=dev \
    AllowQos=dev \
        nodes=...

  PartitionName=part_PI_A \
        nodes=...

  PartitionName=owners \
        nodes={all the nodes that are also in part_PI_X partitions}

And my associations are now one per user, in the form of:
   Account       User  Partition              QOS   Def QOS
---------- ---------- ---------- ---------------- ---------
      PI_A     kilian             dev,long,normal    normal

That works fine, except when I define SLURM_PARTITION for the user so that the job is submitted to multiple partitions. Sbatch jobs seem to be submitted to the default partition no matter what. Using --partition in the command line works as expected, though.


$ export SLURM_PARTITION=part_PI_A,owners,normal
$ sbatch -vvv -n 16 -N 1 --wrap="sleep 10000"
sbatch: defined options for program `sbatch'
sbatch: ----------------- ---------------------
sbatch: user              : `kilian'
sbatch: uid               : 215845
sbatch: gid               : 32264
sbatch: cwd               : /home/kilian
sbatch: ntasks            : 16 (set)
sbatch: nodes             : 1-1
sbatch: jobid             : 4294967294 (default)
sbatch: partition         : default         <<<< Not what I expect
sbatch: profile           : `NotSet'

So the job doesn't make it to the part_PI_A partition at all. 

With --partition, it works fine:

$ sbatch -vvv -p part_PI_A,owners,normal -n 16 -N1 --wrap="sleep 10000"
sbatch: defined options for program `sbatch'
sbatch: ----------------- ---------------------
sbatch: user              : `kilian'
sbatch: uid               : 215845
sbatch: gid               : 32264
sbatch: cwd               : /home/kilian
sbatch: ntasks            : 16 (set)
sbatch: nodes             : 1-1
sbatch: jobid             : 4294967294 (default)
sbatch: partition         : part_PI_A,owners,normal
sbatch: profile           : `NotSet'

With SLURM_PARTITION and srun, it works too:

$ export SLURM_PARTITION=part_PI_A,owners,normal
$ srun --pty bash
srun: job 173698 queued and waiting for resources
^Z
[1]+  Stopped                 srun --pty bash

$ scontrol show job  173698
JobId=173698 Name=bash
[...]
   Partition=part_PI_A,owners,normal AllocNode:Sid=sh-ln01:35736
[...]


Any idea what I may be missing?
Thanks.
Comment 19 Moe Jette 2014-08-15 04:33:16 MDT
(In reply to Kilian Cavalotti from comment #18)
> $ export SLURM_PARTITION=part_PI_A,owners,normal
> $ sbatch -vvv -n 16 -N 1 --wrap="sleep 10000"
> sbatch: defined options for program `sbatch'
> sbatch: ----------------- ---------------------
> sbatch: user              : `kilian'
> sbatch: uid               : 215845
> sbatch: gid               : 32264
> sbatch: cwd               : /home/kilian
> sbatch: ntasks            : 16 (set)
> sbatch: nodes             : 1-1
> sbatch: jobid             : 4294967294 (default)
> sbatch: partition         : default         <<<< Not what I expect
> sbatch: profile           : `NotSet'

Each of the resource allocation commands uses a different env var:
sbatch/opt.c:  {"SBATCH_PARTITION",     OPT_STRING,     &opt.partition,    
salloc/opt.c:  {"SALLOC_PARTITION",     OPT_STRING,     &opt.partition,     NULL          },
srun/libsrun/opt.c:{"SLURM_PARTITION",     OPT_STRING,     &opt.partition,     NULL             },

That offers a bit more flexibility than using a single env var.
Comment 20 Kilian Cavalotti 2014-08-15 05:29:22 MDT
> Each of the resource allocation commands uses a different env var:
> sbatch/opt.c:  {"SBATCH_PARTITION",     OPT_STRING,     &opt.partition,    
> salloc/opt.c:  {"SALLOC_PARTITION",     OPT_STRING,     &opt.partition,    
> NULL          },
> srun/libsrun/opt.c:{"SLURM_PARTITION",     OPT_STRING,     &opt.partition,  
> NULL             },
> 
> That offers a bit more flexibility than using a single env var.

Aaah, thanks! It makes a lot of sense, but I've been inclined to believe that SLURM_PARTITION would cover them all (as opposed to something like SRUN_PARTITION, for instance), but that's just because I didn't look at the salloc and sbatch man pages.

Thanks.
Comment 21 Kilian Cavalotti 2014-08-15 11:50:26 MDT
I'm getting there, things are looking good, but I have one more question.

I'm trying to implement my 2nd requirement, which is:
- having members of each owner account primarily run on their nodes, 
- allowing them to run on other owners' nodes when their own nodes are all busy,
- preempt those jobs when the rightful owner wants to run.

To give a concrete example:
- "part_PI_A", "part_PI_B" and "part_PI_C" are each 1 node partitions.
- "owners" is a 3 nodes overflow partitions (part_PI_{A,B,C})
- userA from PI_A submits 2 exclusive jobs with -p "part_PI_A,owners": the first one runs on part_PI_A, the second one runs on part_PI_B, which is what I want.
- userB from PI_B submits a job with -p "part_PI_B,owners". I'd like this job to preempt userA's 2nd job which runs on part_PI_B. But what I see instead is that this jobs runs on part_PI_C

I have: 
PreemptMode=suspend
PreemptType=preempt/partition_prio
and
PartitionName=part_PI_A Priority=1000
PartitionName=part_PI_B Priority=1000
PartitionName=part_PI_C Priority=1000
PartitionName=owners Priority=100


So I guess my question is: is there a way to preempt a job even if there are some other nodes available?
Comment 22 Moe Jette 2014-08-15 12:03:52 MDT
(In reply to Kilian Cavalotti from comment #21)
> I'm getting there, things are looking good, but I have one more question.
> 
> I'm trying to implement my 2nd requirement, which is:
> - having members of each owner account primarily run on their nodes, 
> - allowing them to run on other owners' nodes when their own nodes are all
> busy,
> - preempt those jobs when the rightful owner wants to run.
> 
> To give a concrete example:
> - "part_PI_A", "part_PI_B" and "part_PI_C" are each 1 node partitions.
> - "owners" is a 3 nodes overflow partitions (part_PI_{A,B,C})
> - userA from PI_A submits 2 exclusive jobs with -p "part_PI_A,owners": the
> first one runs on part_PI_A, the second one runs on part_PI_B, which is what
> I want.
> - userB from PI_B submits a job with -p "part_PI_B,owners". I'd like this
> job to preempt userA's 2nd job which runs on part_PI_B. But what I see
> instead is that this jobs runs on part_PI_C
> 
> I have: 
> PreemptMode=suspend
> PreemptType=preempt/partition_prio
> and
> PartitionName=part_PI_A Priority=1000
> PartitionName=part_PI_B Priority=1000
> PartitionName=part_PI_C Priority=1000
> PartitionName=owners Priority=100
> 
> 
> So I guess my question is: is there a way to preempt a job even if there are
> some other nodes available?


I see what you are saying.
Slurm will start jobs without preemption if possible, so that is why userB's job runs on part_PI_C rather than preempting userA's job in part_PI_B.
Off the top of my head I can't think of a way to do what you want without code changes, but let me give the matter more thought.
Comment 23 Moe Jette 2014-08-18 05:14:35 MDT
(In reply to Moe Jette from comment #22)
> (In reply to Kilian Cavalotti from comment #21)
> > I'm getting there, things are looking good, but I have one more question.
> > 
> > I'm trying to implement my 2nd requirement, which is:
> > - having members of each owner account primarily run on their nodes, 
> > - allowing them to run on other owners' nodes when their own nodes are all
> > busy,
> > - preempt those jobs when the rightful owner wants to run.
> > 
> > To give a concrete example:
> > - "part_PI_A", "part_PI_B" and "part_PI_C" are each 1 node partitions.
> > - "owners" is a 3 nodes overflow partitions (part_PI_{A,B,C})
> > - userA from PI_A submits 2 exclusive jobs with -p "part_PI_A,owners": the
> > first one runs on part_PI_A, the second one runs on part_PI_B, which is what
> > I want.
> > - userB from PI_B submits a job with -p "part_PI_B,owners". I'd like this
> > job to preempt userA's 2nd job which runs on part_PI_B. But what I see
> > instead is that this jobs runs on part_PI_C
> > 
> > I have: 
> > PreemptMode=suspend
> > PreemptType=preempt/partition_prio
> > and
> > PartitionName=part_PI_A Priority=1000
> > PartitionName=part_PI_B Priority=1000
> > PartitionName=part_PI_C Priority=1000
> > PartitionName=owners Priority=100
> > 
> > 
> > So I guess my question is: is there a way to preempt a job even if there are
> > some other nodes available?
> 
> 
> I see what you are saying.
> Slurm will start jobs without preemption if possible, so that is why userB's
> job runs on part_PI_C rather than preempting userA's job in part_PI_B.
> Off the top of my head I can't think of a way to do what you want without
> code changes, but let me give the matter more thought.

This will definitely require some code changes, but they should be relatively minor. I'll try to get you something soon.
Comment 24 Kilian Cavalotti 2014-08-18 05:20:19 MDT
> This will definitely require some code changes, but they should be
> relatively minor. I'll try to get you something soon.

That would be excellent, thanks!
Comment 25 Moe Jette 2014-08-18 06:07:38 MDT
Created attachment 1145 [details]
start job in highest priority parttion by preemption rathter than using lower prio partition

This patch is designed to start a job in the highest priority partition possible, even if it requires preempting other jobs, rather than using a lower priority partition. I have not tested this extensively yet, but it seems to work fine in that testing I have done. Let me know how this works for you.
Comment 26 Kilian Cavalotti 2014-08-18 06:33:23 MDT
(In reply to Moe Jette from comment #25)
> Created attachment 1145 [details]
> start job in highest priority parttion by preemption rathter than using
> lower prio partition
> 
> This patch is designed to start a job in the highest priority partition
> possible, even if it requires preempting other jobs, rather than using a
> lower priority partition. I have not tested this extensively yet, but it
> seems to work fine in that testing I have done. Let me know how this works
> for you.

Thank you! I'm gonna test this.
Just to be completely sure: I've seen the patch touches the backfill plugin. Does that mean it needs to be deployed on compute nodes too?
Comment 27 Moe Jette 2014-08-18 06:35:31 MDT
(In reply to Kilian Cavalotti from comment #26)
> Thank you! I'm gonna test this.
> Just to be completely sure: I've seen the patch touches the backfill plugin.
> Does that mean it needs to be deployed on compute nodes too?

You only need to update the head node(s) for with patch.
Comment 28 Kilian Cavalotti 2014-08-18 07:06:30 MDT
(In reply to Moe Jette from comment #27)
> (In reply to Kilian Cavalotti from comment #26)
> > Thank you! I'm gonna test this.
> > Just to be completely sure: I've seen the patch touches the backfill plugin.
> > Does that mean it needs to be deployed on compute nodes too?
> 
> You only need to update the head node(s) for with patch.

Thanks.
And it seems to work great! I just submitted 2 jobs as user A to part_PI_A,owners, so the first ran on part_PI_A and the 2nd on part_PI_B. Then I made user B submit a job, and while there was still room available in the "owners" partition, it indeed preempted user's A job in part_PI_B. So that looks awesome.

Thanks a lot!

Is it a behavior you feel should be the default or do you consider making it a configuration option? Just asking to understand what our best long-term option is, regarding configuration: I currently rebuilt a locally patched RPM, but wonder if this would make it to the next version, so I don't need to maintain that local RPM.
Comment 29 Moe Jette 2014-08-18 07:09:34 MDT
(In reply to Kilian Cavalotti from comment #28)
> Is it a behavior you feel should be the default or do you consider making it
> a configuration option? Just asking to understand what our best long-term
> option is, regarding configuration: I currently rebuilt a locally patched
> RPM, but wonder if this would make it to the next version, so I don't need
> to maintain that local RPM.

I plan to make this the default behaviour and include it in version 14.03.7 when released, probably in a week or so.
Comment 30 Kilian Cavalotti 2014-08-18 07:12:52 MDT
> I plan to make this the default behaviour and include it in version 14.03.7
> when released, probably in a week or so.

Very good! So I won't have to maintain that specific patch here, thanks.
Comment 31 Moe Jette 2014-08-26 07:19:12 MDT
I'm going to close this. Please re-open or open a new trouble ticket as needed.