Hi, I'm looking for some guidance to implement the following setup for our HPC cluster. We're trying to implement a condo model, where users can purchase their own compute nodes, and be guaranteed to have priority access to those, while still being able to run on the generally available nodes. We currently have a pool of shared compute nodes (the "normal" partition) that anybody can run on. In addition to that, we're deploying new groups of nodes that individual PIs purchased. We created a partition for each of those groups of nodes. Each PI has a Slurm "account" and several users are associated with those accounts. So we have the following partitions: - normal - part_PI_A - part_PI_B And the following users/accounts: - user_A1 and user_A2 under account acct_PI_A - user_B1 and user_B2 under account acct_PI_B So we're looking for: !. a way to let user_A1 and user_A2 run mainly in partition part_PI_A, but also in the normal partition if part_PI_A is full. Same thing for PI B. So I guess we're looking for a per-user default partition setting. Does that exist? 2. a way to let users in acct_PI_A run on part_PI_B is those nodes are idle, and to preempt their jobs when users in acct_PI_B submit jobs to run on their own partition. Is there a way to achieve this? Thanks a lot for any insight you could provide.
The more partitions that you create, the more difficult it will be to keep each of them fully utilized. So while I would generally recommend against the condo model, Slurm can support it. Here is some guidance to make the best of it and feel free to ask follow up questions. You can submit jobs to multiple partitions. The job will get scheduled in whichever partition permits it to start earliest. Partitions with a higher Priority value get tested first. While Slurm does not support per-user default partitions, you can use a job_submit plugin for that. You can establish preemption rules on a per partition or per QOS basis You can create overlapping partitions. For example the nodes in partition part_PI_A can also be in partition normal. You can assign nodes a "weight" so that nodes with lower weights get used before those with a higher weight within a partition. ============================================================ Here is a suggested configuration (making up node names) JobSubmitPlugins=stanford # described below PreemptMode=requeue PreemptType=preempt/partition_prio PartitionName=normal Nodes=tux[000-099] Weight=20 Default=YES PartitionName=part_PI_A Nodes=tux[100-199] Weight=30 Default=NO AllowGroups=PI_A PartitionName=part_PI_B Nodes=tux[200-299] Weight=30 Default=NO AllowGroups=PI_B PartitionName=overflow Nodes=tux[100-299] Weight=10 Default=NO AllowGroups=PI_A,PI_B ============================================================ What your job submit plugin would do is 1. read some configuration information about who should be able to access the various partitions 2. If user specifies some partition, just return. 3. For users in acct_PI_A, set the partition parameter to "part_PI_A,normal,overflow" 4. For users in acct_PI_B, set the partition parameter to "part_PI_B,normal,overflow" ============================================================ For more information, please see: http://slurm.schedmd.com/preempt.html http://slurm.schedmd.com/job_submit_plugins.html http://slurm.schedmd.com/slurm.conf.html
Hi Moe, Thanks a lot for the guidance, much appreciated. I have additional questions below. (In reply to Moe Jette from comment #1) > The more partitions that you create, the more difficult it will be to keep > each of them fully utilized. So while I would generally recommend against > the condo model, Slurm can support it. Besides partitions, what would you recommend to better handle this "shared pool"/"owned nodes" scheme? > You can submit jobs to multiple partitions. The job will get scheduled in > whichever partition permits it to start earliest. I did a quick test with 2 partitions, and it doesn't seem to work as I expected. It seems to work fine with sbatch, but not with srun. $ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST test up infinite 2 idle sh-5-[33-34] owned up 7-00:00:00 1 idle sh-7-22 $ sbatch -p owned,test -n 16 --wrap="sleep 10000" Submitted batch job 171344 $ sbatch -p owned,test -n 16 --wrap="sleep 10000" Submitted batch job 171346 $ squeue -u kilian JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 171344 owned sbatch kilian R 0:14 1 sh-7-22 171346 test sbatch kilian R 0:01 1 sh-5-33 But with srun, the 2nd job waits with (Resources): $ srun -p owned,test -n 16 sleep 10000 sh-7-22:~$ and in another terminal: $ srun -p owned,test -n 16 sleep 10000 srun: job 171348 queued and waiting for resources $ squeue -u kilian JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 171347 owned sleep kilian R 0:29 1 sh-7-22 171348 owned,tes sleep kilian PD 0:00 1 (Resources) Is that expected? > Partitions with a higher > Priority value get tested first. > > While Slurm does not support per-user default partitions, you can use a > job_submit plugin for that. What about setting a SLURM_PARTITION env variable per-user? SLURM_PARTITION=part_PI_A,normal would submit in part_PI_A if possible, then in normal, and could still be overridden by --partition option in the command-line, right? Preliminary test seem to indicate that this could work as a per-user default partition, but I'm not sure about potential side-effects Thanks a lot for all the other pointers too, I'll look into those options.
See responses in-line below. (In reply to Kilian Cavalotti from comment #2) > Hi Moe, > > Thanks a lot for the guidance, much appreciated. I have additional questions > below. > > (In reply to Moe Jette from comment #1) > > The more partitions that you create, the more difficult it will be to keep > > each of them fully utilized. So while I would generally recommend against > > the condo model, Slurm can support it. > > Besides partitions, what would you recommend to better handle this "shared > pool"/"owned nodes" scheme? How about fair share scheduling? That would permit you to set a target allocation of the entire system on a per account and user basis. for more information, see: http://slurm.schedmd.com/priority_multifactor.html > > You can submit jobs to multiple partitions. The job will get scheduled in > > whichever partition permits it to start earliest. > > I did a quick test with 2 partitions, and it doesn't seem to work as I > expected. It seems to work fine with sbatch, but not with srun. > > $ sinfo > PARTITION AVAIL TIMELIMIT NODES STATE NODELIST > test up infinite 2 idle sh-5-[33-34] > owned up 7-00:00:00 1 idle sh-7-22 > > $ sbatch -p owned,test -n 16 --wrap="sleep 10000" > Submitted batch job 171344 > $ sbatch -p owned,test -n 16 --wrap="sleep 10000" > Submitted batch job 171346 > $ squeue -u kilian > JOBID PARTITION NAME USER ST TIME NODES > NODELIST(REASON) > 171344 owned sbatch kilian R 0:14 1 sh-7-22 > 171346 test sbatch kilian R 0:01 1 sh-5-33 > > But with srun, the 2nd job waits with (Resources): > > $ srun -p owned,test -n 16 sleep 10000 > sh-7-22:~$ > > and in another terminal: > $ srun -p owned,test -n 16 sleep 10000 > srun: job 171348 queued and waiting for resources > > $ squeue -u kilian > JOBID PARTITION NAME USER ST TIME NODES > NODELIST(REASON) > 171347 owned sleep kilian R 0:29 1 sh-7-22 > 171348 owned,tes sleep kilian PD 0:00 1 > (Resources) > > Is that expected? No. I'll need to investigate and get back to you later on this. > > Partitions with a higher > > Priority value get tested first. > > > > While Slurm does not support per-user default partitions, you can use a > > job_submit plugin for that. > > What about setting a SLURM_PARTITION env variable per-user? > SLURM_PARTITION=part_PI_A,normal would submit in part_PI_A if possible, then > in normal, and could still be overridden by --partition option in the > command-line, right? Preliminary test seem to indicate that this could work > as a per-user default partition, but I'm not sure about potential > side-effects That will work fine. The command line option does override environment variables. > Thanks a lot for all the other pointers too, I'll look into those options.
> How about fair share scheduling? > That would permit you to set a target allocation of the entire system on a > per account and user basis. for more information, see: > http://slurm.schedmd.com/priority_multifactor.html Well, the thing is that we don't only want owners to get a privileged access to some compute resources, we want them to get exclusive access to *their* physical nodes. Because they can have different hardware configurations, such as memory amount, local storage options, and so on. That's why mapping users to partitions using AllowGroups seemed to make sense. Would fair share allow to prevent users to run on specific nodes? (we don't want the general population to run on owners' nodes) > No. I'll need to investigate and get back to you later on this. Thanks. Do you need me to open a specific ticket for this? > That will work fine. The command line option does override environment > variables. Great! Thanks again.
(In reply to Kilian Cavalotti from comment #4) > > How about fair share scheduling? > > That would permit you to set a target allocation of the entire system on a > > per account and user basis. for more information, see: > > http://slurm.schedmd.com/priority_multifactor.html > > Well, the thing is that we don't only want owners to get a privileged access > to some compute resources, we want them to get exclusive access to *their* > physical nodes. Because they can have different hardware configurations, > such as memory amount, local storage options, and so on. That's why mapping > users to partitions using AllowGroups seemed to make sense. > > Would fair share allow to prevent users to run on specific nodes? (we don't > want the general population to run on owners' nodes) No, The scheme we are working on with multiple partitions would be best to accomplish that. > > No. I'll need to investigate and get back to you later on this. > > Thanks. Do you need me to open a specific ticket for this? No need for a new ticket. I was able to replicate this and have a fix for you. The change will be in version 14.03.7 ( when released probably next week) or you can use the patch here: https://github.com/SchedMD/slurm/commit/e941364920a5f910144d2564ee14f91a61c1b3cb.patch
(In reply to Moe Jette from comment #5) > No, The scheme we are working on with multiple partitions would be best to > accomplish that. Ok, so I have requirement 1. working with setting SLURM_PARTITION for users in owners accounts and your srun patch. Now I'm gonna experiment with preemption for requirement 2. BTW, what happens if I mix nodes with different amounts of memory in the same partition? > No need for a new ticket. I was able to replicate this and have a fix for > you. The change will be in version 14.03.7 ( when released probably next > week) or you can use the patch here: > https://github.com/SchedMD/slurm/commit/ > e941364920a5f910144d2564ee14f91a61c1b3cb.patch That was fast, thank you!
(In reply to Kilian Cavalotti from comment #6) > (In reply to Moe Jette from comment #5) > > No, The scheme we are working on with multiple partitions would be best to > > accomplish that. > > Ok, so I have requirement 1. working with setting SLURM_PARTITION for users > in owners accounts and your srun patch. > > Now I'm gonna experiment with preemption for requirement 2. BTW, what > happens if I mix nodes with different amounts of memory in the same > partition? No problem. I would recommend using the node "Weight" configuration option so that nodes with smaller memory sizes are used first and the larger memory nodes tend to be saved for jobs that need it.
I have one more question: I'm trying to submit to multiple partitions, with the overflow partition concept you mentioned in #c1. I use sbatch -p part_PI_A,overflow,normal, but it looks like the job only gets to part_PI_A and overflow. Slurmctld logs the following: _valid_job_part: can't check multiple partitions with partition based associations I'm not sure what it means, could you please give me some details? I indeed have partition based associations, our current setup is, for each user: Cluster Account User Partition Share GrpJobs GrpNodes GrpCPUs GrpMem GrpSubmit GrpWall GrpCPUMins MaxJobs MaxNodes MaxCPUs MaxSubmit MaxWall MaxCPUMins QOS Def QOS GrpCPURunMins ---------- ---------- ---------- ---------- --------- ------- -------- -------- ------- --------- ----------- ----------- ------- -------- -------- --------- ----------- ----------- -------------------- --------- ------------- cluster accnt user normal 1 long,normal cluster accnt user dev 1 dev cluster accnt user 1 normal Thanks.
(In reply to Kilian Cavalotti from comment #8) > I have one more question: I'm trying to submit to multiple partitions, with > the overflow partition concept you mentioned in #c1. > > I use sbatch -p part_PI_A,overflow,normal, but it looks like the job only > gets to part_PI_A and overflow. Slurmctld logs the following: > > _valid_job_part: can't check multiple partitions with partition based > associations > > I'm not sure what it means, could you please give me some details? > > > I indeed have partition based associations, our current setup is, for each > user: > Cluster Account User Partition Share GrpJobs GrpNodes > GrpCPUs GrpMem GrpSubmit GrpWall GrpCPUMins MaxJobs MaxNodes MaxCPUs > MaxSubmit MaxWall MaxCPUMins QOS Def QOS > GrpCPURunMins > ---------- ---------- ---------- ---------- --------- ------- -------- > -------- ------- --------- ----------- ----------- ------- -------- -------- > --------- ----------- ----------- -------------------- --------- > ------------- > cluster accnt user normal 1 > long,normal > cluster accnt user dev 1 > dev > cluster accnt user 1 > normal > > Thanks. Are you establishing different limits or fair-share for users depending upon the partition they run in? If no, there is no reason to configure a separate association for each partition. If yes, they you will need to configure an association for the overflow partition as well. Given my limited understanding of your environment, I'm not sure that separate limits or fair-share by partition benefits you.
Hi Moe, (In reply to Moe Jette from comment #9) > Are you establishing different limits or fair-share for users depending upon > the partition they run in? Yes. Here is some context: - we have some general limits we want to impose on all users (MaxCPUsPU, MaxJobsPU, MaxWall...) so we have a "normal" QOS. - we want users to be able to quickly get an interactive shell on a compute node for compiling, debugging, or testing code. To that end, we dedicated a couple of nodes, and created a partition (dev) for them. To get everyone a chance to get a shell when they need it, we also wanted to limit the number of jobs and tasks that a user can submit to those nodes, so we created a QOS (dev) that we mapped to that partition. - we have this MaxWall limit of 2 days in the normal QOS, but we also have some users that need to run jobs for a longer time. So we added a "long" QOS with a MaxWall of 7 days, so that users who want to use it explicitly need to request it. We also don't want long jobs to run on our special-purposes nodes (large memory and GPU nodes), so we only want this QOS to be usable on the "normal" partition. So we allow: - dev QOS on dev partition - normal and long QOS on normal partition - normal QOS on all (the other) partitions If you have other recommendations to achieve this kind of setup, I'm open to suggestions. > If no, there is no reason to configure a separate association for each > partition. > If yes, they you will need to configure an association for the overflow > partition as well. Got it, I'll do that. > Given my limited understanding of your environment, I'm not sure that > separate limits or fair-share by partition benefits you. I hope the explanation above makes some sense. If not, I can provide the conf files and database outputs if you want.
(In reply to Kilian Cavalotti from comment #10) > Hi Moe, > > (In reply to Moe Jette from comment #9) > > Are you establishing different limits or fair-share for users depending upon > > the partition they run in? > > Yes. Here is some context: > > - we have some general limits we want to impose on all users (MaxCPUsPU, > MaxJobsPU, MaxWall...) so we have a "normal" QOS. > > - we want users to be able to quickly get an interactive shell on a compute > node for compiling, debugging, or testing code. To that end, we dedicated a > couple of nodes, and created a partition (dev) for them. To get everyone a > chance to get a shell when they need it, we also wanted to limit the number > of jobs and tasks that a user can submit to those nodes, so we created a QOS > (dev) that we mapped to that partition. > > - we have this MaxWall limit of 2 days in the normal QOS, but we also have > some users that need to run jobs for a longer time. So we added a "long" QOS > with a MaxWall of 7 days, so that users who want to use it explicitly need > to request it. We also don't want long jobs to run on our special-purposes > nodes (large memory and GPU nodes), so we only want this QOS to be usable on > the "normal" partition. > > So we allow: > - dev QOS on dev partition > - normal and long QOS on normal partition > - normal QOS on all (the other) partitions > > If you have other recommendations to achieve this kind of setup, I'm open to > suggestions. That seems like a reasonable approach. The job_submit plugin that I previously referenced may be helpful to set QOS that match partition name if desired (i.e. dev)
(In reply to Moe Jette from comment #11) > That seems like a reasonable approach. > The job_submit plugin that I previously referenced may be helpful to set QOS > that match partition name if desired (i.e. dev) Well, it seems that having partition based associations will prevent to submit to multiple partitions, is this correct? In my limited testing, if I have a QOS associated with each of the partitions, if I submit a job to multiple partitions, I get "_valid_job_part: can't check multiple partitions with partition based associations" in the log, and "scontrol show job" seems to indicate that the job is only submitted to the first partition listed. # sacctmgr list assoc user=kilian format=User,Partition,QOS User Partition QOS ---------- ---------- ------------- kilian gpu normal kilian owned normal kilian normal long,normal kilian dev dev kilian part_PI_A normal $ sbatch -p "part_PI_A,owned,normal" -n 16 -N1 --wrap="sleep 10000" Submitted batch job 172082 $ sbatch -p "part_PI_A,owned,normal" -n 16 -N1 --wrap="sleep 10000" Submitted batch job 172083 $ squeue -u kilian JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 172083 part_PI_A sbatch kilian PD 0:00 1 (None) 172082 part_PI_A sbatch kilian R 0:02 1 sh-7-22 $ scontrol show job 172083 JobId=172083 Name=sbatch UserId=kilian(215845) GroupId=ruthm(32264) Priority=10211 Nice=0 Account=ruthm QOS=normal JobState=PENDING Reason=Resources Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0 RunTime=00:00:00 TimeLimit=02:00:00 TimeMin=N/A SubmitTime=2014-08-13T15:56:54 EligibleTime=2014-08-13T15:56:54 StartTime=Unknown EndTime=Unknown PreemptTime=None SuspendTime=None SecsPreSuspend=0 >> Partition=part_PI_A AllocNode:Sid=sh-ln02:28027 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1-1 NumCPUs=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:* Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=0 MinCPUsNode=1 MinMemoryCPU=4000M MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=0 Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/home/kilian StdErr=/home/kilian/slurm-172083.out StdIn=/dev/null StdOut=/home/kilian/slurm-172083.out So unless there's a way to submit to multiple partition with partition-based associations, I guess I'll have to remove those partition-based associations and develop a job_submit plugin instead.
(In reply to Kilian Cavalotti from comment #12) > (In reply to Moe Jette from comment #11) > > That seems like a reasonable approach. > > The job_submit plugin that I previously referenced may be helpful to set QOS > > that match partition name if desired (i.e. dev) > > Well, it seems that having partition based associations will prevent to > submit to multiple partitions, is this correct? I was not expecting that, but have confirmed this behaviour. I'll check with the author to see if he has other suggestions, but a job_submit plugin is looking more attractive right now. The relevant bit of code from src/slurmctld/job_mgr.c: while ((part_ptr_tmp = (struct part_record *)list_next(iter))) { /* FIXME: When dealing with multiple partitions we * currently can't deal with partition based * associations. */ memset(&assoc_rec, 0, sizeof(slurmdb_association_rec_t)); if (assoc_ptr) { assoc_rec.acct = assoc_ptr->acct; assoc_rec.partition = part_ptr_tmp->name; assoc_rec.uid = job_desc->user_id; assoc_mgr_fill_in_assoc( acct_db_conn, &assoc_rec, accounting_enforce, NULL); } if (assoc_ptr && assoc_rec.id != assoc_ptr->id) { info("_valid_job_part: can't check multiple " "partitions with partition based " "associations"); rc = SLURM_ERROR;
> I was not expecting that, but have confirmed this behaviour. I'll check with > the author to see if he has other suggestions, Aah, I see there's a FIXME in the code, so hopefully it can be fixed. :) > but a job_submit plugin is > looking more attractive right now. I'm willing to explore that path too, but I kind of wanted to stay away from it as long as possible, in order to limit site-specific customizations (esp. hard-coding partition and account names in the plugin), so we can limit our specifics to the conf files and the database. So I'm definitely interested in what Danny could say about this. Thanks again for all the info!
Kilian, so the bad news is this FIXME is quite involved and most likely wouldn't be worth the effort. The good news is there are a couple of other (in my opinion) more attractive options than using partition based associations in this set up. Hopefully one of these will work for you. 1. Don't use partition based associations at all. Given your setup I don't see a reason they are needed (yet). You already put most of your limits you need in a QOS to do the limits, so it doesn't appear you need them for association based limits, this is good. Having multiple associations for users would only prove useful to limit the QOS that can be used in partitions (it appears). If this is the case you can limit the QOS that is allowed in the partition with the AllowQOS partition option. You can write a simple job submit plugin that will set the QOS appropriately if you don't want to force your users to do it every time. This will allow you to only have the one association per user/account and probably give you better accounting since right now you would get accounting based primarily on partition. 2. Only have a partition association for the dev partition, use the normal/non-partition association for all other partitions. This will allow you to get the desired behaviour of multiple-partition submission as long as "dev" isn't one of the partitions. The FIXME code only comes into play when the default association isn't used. This method will at least fix this scenario, but I am hoping we don't have to use this unless there is a reason I am forgetting/not understanding. Let me know if one of these options will work for you. The partition based associations are sort of a pain to work with and I will rarely advise people to use them, so I would opt for option 1 if possible. If these don't work out please let me know what they are missing and we can work on other options to get the partition associations out or at least minimize their use.
> 1. Don't use partition based associations at all. Given your setup I don't > see a reason they are needed (yet). You already put most of your limits you > need in a QOS to do the limits, so it doesn't appear you need them for > association based limits, this is good. Having multiple associations for > users would only prove useful to limit the QOS that can be used in > partitions (it appears). That's right, that's the only way we found to statically "map" QOSes to partitions and not have users specify the QOS at submission time. What we were looking for, really, is a way to set, for instance, MaxCPUsPU for a partition. Generally speaking, we tend to have different limits on each partition, so we were trying to find a way to set a default QOS per partition (so that users don't have to add --qos to their submission command), and limit the QOSes that could be used in each partition. > If this is the case you can limit the QOS that is > allowed in the partition with the AllowQOS partition option. That would help, I missed this option, thanks. > You can write > a simple job submit plugin that will set the QOS appropriately if you don't > want to force your users to do it every time. That's the part I wanted to avoid (see my previous comment) to limit site-specific customizations, but it looks like it would be the best way to achieve our goals anyway. So I'm gonna explore this. > This will allow you to only > have the one association per user/account and probably give you better > accounting since right now you would get accounting based primarily on > partition. Right, it's a bit of a mess right now. :) We basically have 3 associations per user in the DB, so reconciling all of them for global accounting takes extra effort. > 2. Only have a partition association for the dev partition, use the > normal/non-partition association for all other partitions. This will allow > you to get the desired behaviour of multiple-partition submission as long as > "dev" isn't one of the partitions. The FIXME code only comes into play when > the default association isn't used. This method will at least fix this > scenario, but I am hoping we don't have to use this unless there is a reason > I am forgetting/not understanding. Well, in our current setup, we have: 1. "dev" QOS only on "dev" partition and nothing else 2. "long" QOS on "normal" partition but nowhere else 3. "normal" QOS on all partitions (incl. "normal) except "dev" So maybe we can get 2. with the AllowQOS partition option. > Let me know if one of these options will work for you. The partition based > associations are sort of a pain to work with and I will rarely advise people > to use them, so I would opt for option 1 if possible. If these don't work > out please let me know what they are missing and we can work on other > options to get the partition associations out or at least minimize their use. I'm gonna experiment a bit and see how it goes. I'll definitely let you know what we end up with. Do you mind keeping this ticket open for a few more days, in case I have more questions? Anyway, thanks a ton to both of you for your great feedback and advice!
(In reply to Kilian Cavalotti from comment #16) > Do you mind keeping this ticket open for a few more days, in case I have > more questions? No problem. > Anyway, thanks a ton to both of you for your great feedback and advice! That's what we are here for.
I now have a weird issue I can't seem to diagnose. I modify my setup to set AllowGroups and {Allow,Deny}QOS on partitions instead of using partition-based associations. So I now have in my slurm.conf file: PartitionName=normal \ Default=YES \ AllowQos=normal,long \ nodes=... PartitionName=dev \ AllowQos=dev \ nodes=... PartitionName=part_PI_A \ nodes=... PartitionName=owners \ nodes={all the nodes that are also in part_PI_X partitions} And my associations are now one per user, in the form of: Account User Partition QOS Def QOS ---------- ---------- ---------- ---------------- --------- PI_A kilian dev,long,normal normal That works fine, except when I define SLURM_PARTITION for the user so that the job is submitted to multiple partitions. Sbatch jobs seem to be submitted to the default partition no matter what. Using --partition in the command line works as expected, though. $ export SLURM_PARTITION=part_PI_A,owners,normal $ sbatch -vvv -n 16 -N 1 --wrap="sleep 10000" sbatch: defined options for program `sbatch' sbatch: ----------------- --------------------- sbatch: user : `kilian' sbatch: uid : 215845 sbatch: gid : 32264 sbatch: cwd : /home/kilian sbatch: ntasks : 16 (set) sbatch: nodes : 1-1 sbatch: jobid : 4294967294 (default) sbatch: partition : default <<<< Not what I expect sbatch: profile : `NotSet' So the job doesn't make it to the part_PI_A partition at all. With --partition, it works fine: $ sbatch -vvv -p part_PI_A,owners,normal -n 16 -N1 --wrap="sleep 10000" sbatch: defined options for program `sbatch' sbatch: ----------------- --------------------- sbatch: user : `kilian' sbatch: uid : 215845 sbatch: gid : 32264 sbatch: cwd : /home/kilian sbatch: ntasks : 16 (set) sbatch: nodes : 1-1 sbatch: jobid : 4294967294 (default) sbatch: partition : part_PI_A,owners,normal sbatch: profile : `NotSet' With SLURM_PARTITION and srun, it works too: $ export SLURM_PARTITION=part_PI_A,owners,normal $ srun --pty bash srun: job 173698 queued and waiting for resources ^Z [1]+ Stopped srun --pty bash $ scontrol show job 173698 JobId=173698 Name=bash [...] Partition=part_PI_A,owners,normal AllocNode:Sid=sh-ln01:35736 [...] Any idea what I may be missing? Thanks.
(In reply to Kilian Cavalotti from comment #18) > $ export SLURM_PARTITION=part_PI_A,owners,normal > $ sbatch -vvv -n 16 -N 1 --wrap="sleep 10000" > sbatch: defined options for program `sbatch' > sbatch: ----------------- --------------------- > sbatch: user : `kilian' > sbatch: uid : 215845 > sbatch: gid : 32264 > sbatch: cwd : /home/kilian > sbatch: ntasks : 16 (set) > sbatch: nodes : 1-1 > sbatch: jobid : 4294967294 (default) > sbatch: partition : default <<<< Not what I expect > sbatch: profile : `NotSet' Each of the resource allocation commands uses a different env var: sbatch/opt.c: {"SBATCH_PARTITION", OPT_STRING, &opt.partition, salloc/opt.c: {"SALLOC_PARTITION", OPT_STRING, &opt.partition, NULL }, srun/libsrun/opt.c:{"SLURM_PARTITION", OPT_STRING, &opt.partition, NULL }, That offers a bit more flexibility than using a single env var.
> Each of the resource allocation commands uses a different env var: > sbatch/opt.c: {"SBATCH_PARTITION", OPT_STRING, &opt.partition, > salloc/opt.c: {"SALLOC_PARTITION", OPT_STRING, &opt.partition, > NULL }, > srun/libsrun/opt.c:{"SLURM_PARTITION", OPT_STRING, &opt.partition, > NULL }, > > That offers a bit more flexibility than using a single env var. Aaah, thanks! It makes a lot of sense, but I've been inclined to believe that SLURM_PARTITION would cover them all (as opposed to something like SRUN_PARTITION, for instance), but that's just because I didn't look at the salloc and sbatch man pages. Thanks.
I'm getting there, things are looking good, but I have one more question. I'm trying to implement my 2nd requirement, which is: - having members of each owner account primarily run on their nodes, - allowing them to run on other owners' nodes when their own nodes are all busy, - preempt those jobs when the rightful owner wants to run. To give a concrete example: - "part_PI_A", "part_PI_B" and "part_PI_C" are each 1 node partitions. - "owners" is a 3 nodes overflow partitions (part_PI_{A,B,C}) - userA from PI_A submits 2 exclusive jobs with -p "part_PI_A,owners": the first one runs on part_PI_A, the second one runs on part_PI_B, which is what I want. - userB from PI_B submits a job with -p "part_PI_B,owners". I'd like this job to preempt userA's 2nd job which runs on part_PI_B. But what I see instead is that this jobs runs on part_PI_C I have: PreemptMode=suspend PreemptType=preempt/partition_prio and PartitionName=part_PI_A Priority=1000 PartitionName=part_PI_B Priority=1000 PartitionName=part_PI_C Priority=1000 PartitionName=owners Priority=100 So I guess my question is: is there a way to preempt a job even if there are some other nodes available?
(In reply to Kilian Cavalotti from comment #21) > I'm getting there, things are looking good, but I have one more question. > > I'm trying to implement my 2nd requirement, which is: > - having members of each owner account primarily run on their nodes, > - allowing them to run on other owners' nodes when their own nodes are all > busy, > - preempt those jobs when the rightful owner wants to run. > > To give a concrete example: > - "part_PI_A", "part_PI_B" and "part_PI_C" are each 1 node partitions. > - "owners" is a 3 nodes overflow partitions (part_PI_{A,B,C}) > - userA from PI_A submits 2 exclusive jobs with -p "part_PI_A,owners": the > first one runs on part_PI_A, the second one runs on part_PI_B, which is what > I want. > - userB from PI_B submits a job with -p "part_PI_B,owners". I'd like this > job to preempt userA's 2nd job which runs on part_PI_B. But what I see > instead is that this jobs runs on part_PI_C > > I have: > PreemptMode=suspend > PreemptType=preempt/partition_prio > and > PartitionName=part_PI_A Priority=1000 > PartitionName=part_PI_B Priority=1000 > PartitionName=part_PI_C Priority=1000 > PartitionName=owners Priority=100 > > > So I guess my question is: is there a way to preempt a job even if there are > some other nodes available? I see what you are saying. Slurm will start jobs without preemption if possible, so that is why userB's job runs on part_PI_C rather than preempting userA's job in part_PI_B. Off the top of my head I can't think of a way to do what you want without code changes, but let me give the matter more thought.
(In reply to Moe Jette from comment #22) > (In reply to Kilian Cavalotti from comment #21) > > I'm getting there, things are looking good, but I have one more question. > > > > I'm trying to implement my 2nd requirement, which is: > > - having members of each owner account primarily run on their nodes, > > - allowing them to run on other owners' nodes when their own nodes are all > > busy, > > - preempt those jobs when the rightful owner wants to run. > > > > To give a concrete example: > > - "part_PI_A", "part_PI_B" and "part_PI_C" are each 1 node partitions. > > - "owners" is a 3 nodes overflow partitions (part_PI_{A,B,C}) > > - userA from PI_A submits 2 exclusive jobs with -p "part_PI_A,owners": the > > first one runs on part_PI_A, the second one runs on part_PI_B, which is what > > I want. > > - userB from PI_B submits a job with -p "part_PI_B,owners". I'd like this > > job to preempt userA's 2nd job which runs on part_PI_B. But what I see > > instead is that this jobs runs on part_PI_C > > > > I have: > > PreemptMode=suspend > > PreemptType=preempt/partition_prio > > and > > PartitionName=part_PI_A Priority=1000 > > PartitionName=part_PI_B Priority=1000 > > PartitionName=part_PI_C Priority=1000 > > PartitionName=owners Priority=100 > > > > > > So I guess my question is: is there a way to preempt a job even if there are > > some other nodes available? > > > I see what you are saying. > Slurm will start jobs without preemption if possible, so that is why userB's > job runs on part_PI_C rather than preempting userA's job in part_PI_B. > Off the top of my head I can't think of a way to do what you want without > code changes, but let me give the matter more thought. This will definitely require some code changes, but they should be relatively minor. I'll try to get you something soon.
> This will definitely require some code changes, but they should be > relatively minor. I'll try to get you something soon. That would be excellent, thanks!
Created attachment 1145 [details] start job in highest priority parttion by preemption rathter than using lower prio partition This patch is designed to start a job in the highest priority partition possible, even if it requires preempting other jobs, rather than using a lower priority partition. I have not tested this extensively yet, but it seems to work fine in that testing I have done. Let me know how this works for you.
(In reply to Moe Jette from comment #25) > Created attachment 1145 [details] > start job in highest priority parttion by preemption rathter than using > lower prio partition > > This patch is designed to start a job in the highest priority partition > possible, even if it requires preempting other jobs, rather than using a > lower priority partition. I have not tested this extensively yet, but it > seems to work fine in that testing I have done. Let me know how this works > for you. Thank you! I'm gonna test this. Just to be completely sure: I've seen the patch touches the backfill plugin. Does that mean it needs to be deployed on compute nodes too?
(In reply to Kilian Cavalotti from comment #26) > Thank you! I'm gonna test this. > Just to be completely sure: I've seen the patch touches the backfill plugin. > Does that mean it needs to be deployed on compute nodes too? You only need to update the head node(s) for with patch.
(In reply to Moe Jette from comment #27) > (In reply to Kilian Cavalotti from comment #26) > > Thank you! I'm gonna test this. > > Just to be completely sure: I've seen the patch touches the backfill plugin. > > Does that mean it needs to be deployed on compute nodes too? > > You only need to update the head node(s) for with patch. Thanks. And it seems to work great! I just submitted 2 jobs as user A to part_PI_A,owners, so the first ran on part_PI_A and the 2nd on part_PI_B. Then I made user B submit a job, and while there was still room available in the "owners" partition, it indeed preempted user's A job in part_PI_B. So that looks awesome. Thanks a lot! Is it a behavior you feel should be the default or do you consider making it a configuration option? Just asking to understand what our best long-term option is, regarding configuration: I currently rebuilt a locally patched RPM, but wonder if this would make it to the next version, so I don't need to maintain that local RPM.
(In reply to Kilian Cavalotti from comment #28) > Is it a behavior you feel should be the default or do you consider making it > a configuration option? Just asking to understand what our best long-term > option is, regarding configuration: I currently rebuilt a locally patched > RPM, but wonder if this would make it to the next version, so I don't need > to maintain that local RPM. I plan to make this the default behaviour and include it in version 14.03.7 when released, probably in a week or so.
> I plan to make this the default behaviour and include it in version 14.03.7 > when released, probably in a week or so. Very good! So I won't have to maintain that specific patch here, thanks.
I'm going to close this. Please re-open or open a new trouble ticket as needed.