Bug 4512 - Preparing for federation
Summary: Preparing for federation
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Federation (show other bugs)
Version: 17.11.0
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Brian Christiansen
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2017-12-13 08:25 MST by NASA JSC Aerolab
Modified: 2018-02-27 09:03 MST (History)
1 user (show)

See Also:
Site: Johnson Space Center
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Current configuration of our existing cluster. (4.29 KB, text/plain)
2017-12-13 08:25 MST, NASA JSC Aerolab
Details
job submit plugin (3.98 KB, text/plain)
2018-02-05 15:57 MST, NASA JSC Aerolab
Details

Note You need to log in before you can comment on or make changes to this bug.
Description NASA JSC Aerolab 2017-12-13 08:25:33 MST
Created attachment 5730 [details]
Current configuration of our existing cluster.

Hello,

We are about to install a second cluster and we want to utilize the new federation features.  I would like your help to prepare our existing cluster for federation and to make sure the use case I have in mind will work.  

I've read through the federation documentation and the SLUG 17 presentation.  To prepare on our existing cluster, all I'm really doing is to upgrade to 17.11.  We are currently running 16.05.10.  I don't really see anything else to do but please let me know if I missed something.  

One question comes to mind that is not clear to me.  Do I run slurmdbd on the 2nd cluster or is there only one slurmdbd?  I think its just one slurmdbd (say on the original cluster) but please confirm.  

Finally, here is the use case I have in mind.  First, some background.  We mostly run MPI jobs on our current cluster.  I'll attach our full slurm.conf but here are the basics.  

NodeName=r1i[0-3]n[0-15] Procs=12 Weight=1 Feature=wes
NodeName=r2i[0-3]n[0-15] Procs=12 Weight=1 Feature=wes
NodeName=r3i[0-3]n[0-15] Procs=12 Weight=2 Feature=wes
NodeName=r4i[0-3]n[0-17] Procs=16 Weight=3 Feature=san
NodeName=r4i[4-7]n[0-17] Procs=24 Weight=4 Feature=has
NodeName=r5i[0-3]n[0-17] Procs=24 Weight=5 Feature=bro

PartitionName=normal Nodes=r[1-2]i[0-3]n[0-15],r3i[0-2]n[0-15],r3i3n[0-7],r4i[0-7]n[0-17],r5i[0-3]n[0-17] Priority=10000 DefaultTime=04:00:00 MaxTime=08:00:00 State=UP OverSubscribe=Exclusive Default=YES
PartitionName=idle   Nodes=r[1-2]i[0-3]n[0-15],r3i[0-2]n[0-15],r3i3n[0-7],r4i[0-7]n[0-17],r5i[0-3]n[0-17] Priority=10    DefaultTime=04:00:00 MaxTime=08:00:00 State=UP OverSubscribe=Exclusive
PartitionName=long   Nodes=r[1-2]i[0-3]n[0-15],r3i[0-2]n[0-15],r3i3n[0-7],r4i[0-7]n[0-17],r5i[0-3]n[0-17] Priority=10000 DefaultTime=04:00:00 MaxTime=24:00:00 State=UP OverSubscribe=Exclusive AllowGroups=longque
PartitionName=debug  Nodes=r3i3n[8-13]                                                                    Priority=10000 DefaultTime=01:00:00 MaxTime=01:00:00 State=UP OverSubscribe=Exclusive MaxNodes=6 MinNodes=1


So we have 4 generations of processors and since we want to run on consistent processors, almost all our jobs are submitted with this constraint:

#SBATCH --constraint=[bro|has|san|wes]

This works great to start the job on the processor types that have enough free nodes.  The new cluster will initially have only "bro" (Broadwell) nodes.  The ideal situation for us would be able to submit jobs on either cluster and have them run whichever is free.  This statement, under the Limitations section in the federation man page, worries me a little:

A federated job that fails due to resources (partition, node counts, etc.) on the local cluster will be rejected and won't be submitted to other sibling clusters even if it could run on them.

Will a job submitted on the new cluster with the above constraint fail since there is are no has, san or wes nodes on that cluster?  

As a sight extension to that, when we get new node processor types for the new cluster (say with the "ivy" feature), we'd still like to be able to submit jobs like this on either cluster:

#SBATCH --constraint=[ivy|bro|has|san|wes]

I'd appreciate your feedback on this.
Comment 1 Brian Christiansen 2017-12-13 14:14:35 MST
Hi.

Database question:
It's just one database. The database is used to create a federation which you add clusters to.

Constraint question:
The current implementation assumes all systems in the federation are largely identical. We hope to address this in future versions.

One workaround is to use the -M/--clusters option to choose which cluster to submit the job to. This functionality existed before the federation. When using the --clusters option, the submission command will ask each cluster when is the soonest the job could start -- if it could. The commands then choose the fastest start time and submits the job to that cluster. So in your case this could help choose the cluster that has the correct features.

For example, I have two clusters, c1 with feature f1 and c2 with feature f2.

If I submit to c1 then the job will fail.
c1$ sbatch -Cf2 --wrap="sleep 60" -N10 --exclusive
sbatch: error: Batch job submission failed: Requested node configuration is not available

If I submit to just c2 it will succeed.
c1$ sbatch -Mc2 -Cf2 --wrap="sleep 60" -N10 --exclusive
Submitted batch job 134420992 on cluster c2

If I say it can run on c1 or c2, c2 will be chosen.
c1$ sbatch -Mc1,c2 -Cf2 --wrap="sleep 60" -N10 --exclusive
sbatch: error: Problem with submit to cluster c1: Requested node configuration is not available
Submitted batch job 134420994 on cluster c2


Let us know if you have any other questions.

Thanks,
Brian
Comment 2 Brian Christiansen 2017-12-19 11:02:15 MST
Do you have any other questions regarding this subject?

If you haven't tried using the pre-existing multi-cluster operations before, I would recommend playing with those first and see how far it gets you to your goal and then seeing what the federation can add to it.

Thanks,
Brian
Comment 3 NASA JSC Aerolab 2017-12-19 11:21:57 MST
No, you can close this.  Thanks.
Comment 4 Brian Christiansen 2017-12-19 11:28:55 MST
Thanks. Let us know if you have any other questions.
Comment 5 NASA JSC Aerolab 2017-12-19 11:35:32 MST
I'm sure we will once we start trying to actually use federation... :)
Comment 6 NASA JSC Aerolab 2018-01-19 10:34:31 MST
I'm to the point where I have our second cluster configured and can do some basic tests.  The original cluster I described is called l1.  The new cluster is europa.  Currently, europa only has "bro" nodes:

NodeName=r1i[0-1]n[0-35] Sockets=2 CoresPerSocket=12 ThreadsPerCore=1 Procs=24 RealMemory=257594 Weight=5 Feature=fdr,bro,mem10,bigmem

So I tried doing some basic tests with -M.  Note that the login node on l1 is called service1.  The login node on europa is actually called europa.  

[dvicker@service1 slurm]% cat multi_cluster.csh 
#! /bin/csh 

#SBATCH --job-name=slurm_test
#SBATCH --time=1:00:00
#SBATCH --constraint=bro
#SBATCH -N 2
#SBATCH -M all

echo "hostnames" 
scontrol show hostnames
[dvicker@service1 slurm]% 


This returns instantly:

[dvicker@service1 slurm]% sbatch multi_cluster.csh 
Submitted batch job 127462 on cluster l1
[dvicker@service1 slurm]% 

But on Europa, it apparently also tries to run on L1 first.  This takes ~30 seconds to time out on L1 and then submits to europa.  

[dvicker@europa slurm]% sbatch multi_cluster.csh 
sbatch: error: Problem with submit to cluster l1: Unable to contact slurm controller (connect failure)
Submitted batch job 78 on cluster europa
[dvicker@europa slurm]%



This is because of the IP address of the hosts:

[root@service1 ~]# sacctmgr show clusters
   Cluster     ControlHost  ControlPort   RPC     Share GrpJobs       GrpTRES GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall                  QOS   Def QOS 
---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- --------- 
    europa   192.52.98.128         6817  8192         1                                                                                           normal           
        l1     10.148.2.14         6817  8192         1                                                                                           normal           
[root@service1 ~]# 

See Bug 4569 for more details on our networks but the 192.52.98.0/24 subnet is ethernet that is common to the login nodes on both machines.  The 10.148.0.0/16 subnet is the IB subnet on L1 (not visible to Europa).  The 10.150.0/16 is the IB subnet on Europa (not visible to L1).  So it looks like we need to change the control host of L1 to the ethernet address.  I'm not sure how they got set to what they were in the first place.  In our conf files we have the following:

slurmdbd.conf:
DbdAddr=10.148.2.14
DbdHost=service1

L1's slurm.conf:
ClusterName=L1
ControlMachine=service1
ControlAddr=10.148.2.14
BackupController=service0
BackupAddr=10.148.2.13
AccountingStorageHost=service1


Europa's slurm.conf:
ClusterName=europa
ControlMachine=europa
ControlAddr=10.150.0.2
#BackupController=service1 (We don't have the second login node configured yet)
#BackupAddr=10.150.0.3
AccountingStorageHost=192.52.98.29


So I'm not sure how europa got an ethernet IP in the first place for the ControlHost in the sacctmgr output.  I guess because Europa had to contact slumbdbd over ethernet?  

I think we need an "sacctmgr modify" command to change the ControHost for L1, right?  Is that safe to do live on L1 while its in production? 

Another question though.  Will using a mix of eth and IB be a problem?  I'm not concerned about the slurmdbd/slurmctld traffic.  That can all happen over ethernet as I'm guessing there is not a lot of traffic between those.  Our ethernet is 10 GbE anyway so we should be good regardless.  I'm more concerned about the slurmctld/slurmd communication.  I don't want the slurmd's on the compute nodes with IB only trying to talk to slurmcltd over ethernet.
Comment 7 Brian Christiansen 2018-01-19 11:30:42 MST
The quick and easy thing to do would be to set:

AccountingStorageHost=192.52.98.29

on L1 as well. This should cause the L1 controller to talk to slurmdbd over ethernet and the slurmdbd will get the ethernet ip.

However, a better solution would be to move all of the Slurm traffic onto the ethernet and let the IB just be for jobs. This way the Slurm traffic isn't causing jitter for the jobs.

Is Slurm talking to the compute nodes over IB as well?
Do you have ethernet to the nodes?

Once things are setup, you could use "squeue -Mall" to verify the communications are working without having to submit a job.
Comment 8 NASA JSC Aerolab 2018-01-19 12:52:57 MST
Setting AccountingStorageHost=192.52.98.29 in L1's conf file and restarting slurm worked.  Now "sacctmgr show clusters" displays the ethernet address for both clusters.  I can also now use -Mall commands from both clusters.  Thanks.  

No, there isn't an ethernet network that spans the entire cluster on an ICE machine - just IB.  So we have to use the IB for everything.  There is a dual-rail option (i.e. two independent IB networks to every node) that would allow you to segregate traffic.  But we don't have that.  How can I verify that the slurmd's on the compute nodes are talking to slurmctld over IB?
Comment 9 NASA JSC Aerolab 2018-01-19 13:15:02 MST
Also, I just checked the slurmdbd logs and I'm seeing messages like this:

[2018-01-10T09:25:03.165] slurmdbd version 17.11.2 started
[2018-01-10T09:25:21.347] DBD_JOB_COMPLETE: cluster not registered
[2018-01-10T09:40:42.258] DBD_JOB_COMPLETE: cluster not registered
[2018-01-10T09:46:35.634] DBD_JOB_COMPLETE: cluster not registered
[2018-01-10T11:10:39.459] DBD_JOB_COMPLETE: cluster not registered
[2018-01-10T17:26:40.230] Terminate signal (SIGINT or SIGTERM) received
[2018-01-10T19:24:25.378] Accounting storage MYSQL plugin loaded
[2018-01-10T19:24:25.466] slurmdbd version 17.11.2 started
[2018-01-10T19:24:31.396] DBD_CLUSTER_TRES: cluster not registered
[2018-01-10T19:51:18.162] Terminate signal (SIGINT or SIGTERM) received
[2018-01-10T19:57:33.500] Accounting storage MYSQL plugin loaded
[2018-01-10T19:57:33.511] slurmdbd version 17.11.2 started
[2018-01-10T19:57:42.533] DBD_CLUSTER_TRES: cluster not registered
[2018-01-18T09:06:54.361] DBD_JOB_COMPLETE: cluster not registered
[2018-01-19T13:25:17.755] DBD_JOB_COMPLETE: cluster not registered

Are the "cluster not registered" messages of concern?
Comment 10 Brian Christiansen 2018-01-19 14:16:02 MST
From the compute nodes can they talk to the 192. controller addresses?
e.g.
On L1 and a compute node talk to 192.52.98.29.

With the current configuration they would be talking to ControlAddr=10.148.2.14.


For the slurmdbd logs, it's an info message saying that it hasn't registered the cluster yet and then registers it in the database. But it was interesting that were multiple of them. Are there more showing up in the logs. I would expect there not to be anymore.
Comment 11 NASA JSC Aerolab 2018-01-19 15:43:55 MST
(In reply to Brian Christiansen from comment #10)
> From the compute nodes can they talk to the 192. controller addresses?
> e.g.
> On L1 and a compute node talk to 192.52.98.29.

Technically, yes, but that would be very non-ideal.  We have a route set up on the compute nodes to talk to the 192.52.98.0/24 subnet but it has to NAT through the login node to do so.  We do this, mainly, to allow jobs to contact our license server.  We don't want significant traffic on that subnet.  

> 
> With the current configuration they would be talking to
> ControlAddr=10.148.2.14.

That's good and what I figured.  

> 
> 
> For the slurmdbd logs, it's an info message saying that it hasn't registered
> the cluster yet and then registers it in the database. But it was
> interesting that were multiple of them. Are there more showing up in the
> logs. I would expect there not to be anymore.

No, I haven't seen any more.  But it is a little odd to me that there are multiples.  Here is the full set:

[root@service1 ~]# grep 'cluster not registered' /var/log/slurm/slurmdbd.log
[2017-01-18T09:45:56.428] DBD_STEP_START: cluster not registered
[2017-04-10T13:00:49.712] DBD_STEP_COMPLETE: cluster not registered
[2017-04-11T16:08:09.347] DBD_STEP_COMPLETE: cluster not registered
[2017-06-13T17:01:20.569] DBD_CLUSTER_TRES: cluster not registered
[2017-12-09T08:38:53.776] DBD_JOB_COMPLETE: cluster not registered
[2018-01-04T14:39:03.801] DBD_CLUSTER_TRES: cluster not registered
[2018-01-10T09:25:21.347] DBD_JOB_COMPLETE: cluster not registered
[2018-01-10T09:40:42.258] DBD_JOB_COMPLETE: cluster not registered
[2018-01-10T09:46:35.634] DBD_JOB_COMPLETE: cluster not registered
[2018-01-10T11:10:39.459] DBD_JOB_COMPLETE: cluster not registered
[2018-01-10T19:24:31.396] DBD_CLUSTER_TRES: cluster not registered
[2018-01-10T19:57:42.533] DBD_CLUSTER_TRES: cluster not registered
[2018-01-18T09:06:54.361] DBD_JOB_COMPLETE: cluster not registered
[2018-01-19T13:25:17.755] DBD_JOB_COMPLETE: cluster not registered
[root@service1 ~]#
Comment 12 Brian Christiansen 2018-01-19 18:08:14 MST
ok. I haven't been able to reproduce the multiplicity of messages in the dbd logs but I wouldn't be concerned as they should only happen when restarting the controller and the messages are just "info" messages.

I'll mark the bug as resolved again. Let us know if you have any other questions.

Thanks,
Brian
Comment 13 NASA JSC Aerolab 2018-01-23 13:41:03 MST
We haven't seen any more of those slurmdbd error messages.  

> Constraint question:
> The current implementation assumes all systems in the federation 
> are largely identical. We hope to address this in future versions.

Is there a target date in mind for this?  As I mentioned before, almost all our jobs use something like this:

#SBATCH --constraint=[bro|has|san|wes]

This does a great job of keeping the utilization high on our cluster and also take the burden off our users of selecting which node type is free.  It would be hard for us to not use it now.  But this isn't going to play well with multi-cluster (or federation).  For example:

[dvicker@europa slurm]% cat multi_cluster.csh 
#! /bin/csh 

#SBATCH --job-name=slurm_test
#SBATCH --time=1:00:00
#SBATCH --constraint=[bro|has|san|wes]
#SBATCH -N 2
#SBATCH -M all

echo "hostnames" 
scontrol show hostnames
[dvicker@europa slurm]% sbatch multi_cluster.csh 
sbatch: error: Problem with submit to cluster europa: Invalid feature specification
Submitted batch job 128051 on cluster l1
[dvicker@europa slurm]% 


It would be great if the above would work on europa too.  In other words, it would be nice if europa would accept a constraint list as long as at least one of feature types are present.   It looks like currently, it will reject the job of all of the features aren't present.
Comment 14 NASA JSC Aerolab 2018-01-24 07:49:01 MST
Sorry, I failed to mention in my last post that europa currently has all "bro" nodes and that job should be capable of running on europa.
Comment 15 Brian Christiansen 2018-01-24 09:23:12 MST
On 01/23/2018 01:41 PM, bugs@schedmd.com wrote:
>
> *Comment # 13 <https://bugs.schedmd.com/show_bug.cgi?id=4512#c13> on 
> bug 4512 <https://bugs.schedmd.com/show_bug.cgi?id=4512> from NASA JSC 
> Aerolab <mailto:JSC-DL-AEROLAB-ADMIN@mail.nasa.gov> *
> We haven't seen any more of those slurmdbd error messages.
>
> > Constraint question: > The current implementation assumes all systems in the federation > 
> are largely identical. We hope to address this in future versions.
>
> Is there a target date in mind for this?  As I mentioned before, almost all our
> jobs use something like this:

--snip--

There currently isn't a target date for implementing this. If you are 
interested in sponsoring the development you can get in touch Jess 
(jess@schedmd.com).

> It would be great if the above would work on europa too.  In other words, it
> would be nice if europa would accept a constraint list as long as at least one
> of feature types are present.   It looks like currently, it will reject the job
> of all of the features aren't present.
What you could do is create a job_submit plugin for each cluster that 
rips off the invalid features. This would allow the jobs to be submitted 
to the cluster even if it doesn't have the feature.
Comment 16 NASA JSC Aerolab 2018-01-30 16:17:21 MST
We already use a lua job submit plugin to check that the use requests a specific processor type:

   local feature_count = 0
   if job_desc ~= nil and job_desc.features ~= nil then
      if string.match(job_desc.features, "wes") then feature_count=feature_count+1 end
      if string.match(job_desc.features, "san") then feature_count=feature_count+1 end
      if string.match(job_desc.features, "has") then feature_count=feature_count+1 end
      if string.match(job_desc.features, "bro") then feature_count=feature_count+1 end
   end

   if feature_count > 0 then
      slurm.log_info("Found %s valid cpu features",feature_count)
   else
      slurm.log_user("Invalid features - aerolab policy requires specifying one or more of wes,san,has,bro.")
      slurm.log_error("Found %s cpu features from %s",feature_count,submit_uid)
      -- See slurm/slurm_errno.h and src/common/slurm_errno.c
      -- for the list of error codes and messages.
      return 2002
   end

So can we extend this to remove features that don't want to be valid on a given cluster?  For example (there could be syntax errors in this but you get the idea):

      if string.match(job_desc.features, "wes") then <rexex to remove bro> end
      
Also, how will this work with multiple clusters?  The desired situation would be to submit a job with all feature types on either cluster, have the job submit plugin (different for each cluster) strip off the invalid features and have the job run on which ever cluster is free.  In other words, if job_submit.lua does the right thing on each individual cluster, will the "sbatch -Mall" work like we want it too?
Comment 17 NASA JSC Aerolab 2018-02-05 08:18:31 MST
Sorry to bug you again but I never heard back and I'd appreciate your input.
Comment 18 Brian Christiansen 2018-02-05 08:53:04 MST
Thanks for poking again. I had the email as unread and put in a different folder. Just FYI for the future. Since the bug was marked as resolved it didn't show up in my list of bugs. In the future, if you'll mark the bug as unresolved, it'll help make sure your responses don't get overlooked.

As far as the lua plugin goes, what I did to test it was to just modify the features of the job that came in. 

e.g.
function slurm_job_submit( job_desc, part_list, submit_uid )

	if (job_desc.features == nil) then
		return slurm.SUCCESS;
	end

	slurm.log_user("Requested features: " .. job_desc.features .. "\n");
	
        -- figure out the union of the requested features and the cluster's configured features.
        
        job_desc.features = "c1";

	slurm.log_user("Modified features: " .. job_desc.features .. "\n");
	return slurm.SUCCESS
end

sticking c1 in for cluster1 and c2 for cluster2. Obviously the job's features would need to be the union of the job's features and the cluster's configured features.

e.g.
brian@lappy:~/slurm/federation2/c1$ sbatch -Mc1,c2 --wrap="sleep 300" -Cc1,c2
sbatch: Requested features: c1,c2
sbatch: Modified features: c1
Submitted batch job 208996 on cluster c1

brian@lappy:~/slurm/federation2/c1$ sbatch -Mc1,c2 --wrap="sleep 300" -Cc1,c2
sbatch: Requested features: c1,c2
sbatch: Modified features: c2
Submitted batch job 204105 on cluster c2

Does this make sense?
Comment 19 NASA JSC Aerolab 2018-02-05 11:55:10 MST
Good to know about the resolved/unresolved status.  Can I change that?  It sounds like I can.  

Yes, that makes sense.  But the "figure out the union of the requested features and the cluster's configured features" part is non-trivial.  As I mentioned above we require uses to specify the processor type in our jobs, so at least one of the processor types will be there.  But we end up with a wide variety of constraints.  Here is our current list of jobs as examples:

Job ID   Queue   Jobname              N:ppn Proc Wall  S Elap  Features  
-------- ------- -------------------- ----- ---- ----- - ----- --------------
131191   normal  200km_7.5kmps        20:12  240 08:00 R 06:37 [WES|san|has|bro]
131192   normal  140km_7.5kmps        20:12  240 08:00 R 06:26 [WES|san|has|bro]
131193   normal  Node1and3Dipole      12:24  288 08:00 R 03:55 BRO&MEM10 
131195   normal  m12a155_5sp_radeq    18:24  432 08:00 R 03:32 BRO|san   
131196   normal  cavity               12:16  192 08:00 R 03:13 SAN       
131197   long    127CH_Run_02_TSM_3p  56:20 1152 10:00 R 00:30 SAN|HAS|BRO
131198   normal  200km_6.0kmps        10:24  240 08:00 R 02:59 [BRO|has|san]
131203   normal  140km_4.5kmps        10:24  240 08:00 R 02:14 [san|HAS|bro]
131204   normal  cavity               12:16  192 08:00 R 00:30 SAN       
131205   normal  140km_6.0kmps        10:24  240 08:00 R 02:06 [san|HAS|bro]
131206   normal  cavity               12:16  192 08:00 R 00:30 SAN       
131207   normal  cavity               12:16  192 08:00 R 00:30 SAN       
131208   long    odpoLCscenarios       2:24   48 24:00 R 01:41 [HAS|wes] 
131209   long    odpoLCscenarios       2:24   48 24:00 R 01:41 [HAS|wes] 

The difficulty of just ripping out the invalid processor types is that I'll end up with invalid constraints.  For the first job above, I'd have to remove the appropriate | symbols too.  In the general case, I'd have to figure out if I should remove the leading or trailing |.  Any thoughts on how to go about this in a robust way?
Comment 20 NASA JSC Aerolab 2018-02-05 11:57:07 MST
Sorry - I also meant to mention that the capitals in the features listed is just an indication of which features were actually used for the job.  That's output from a script that processes the results of squeue (and other commands).  Its just a display thing - the actual job constraints are all lower case.
Comment 21 NASA JSC Aerolab 2018-02-05 15:55:31 MST
I have something that seems to be working.  I would appreciate your comments.  As a refresher, we currently have the following proc types:

L1: wes, san, has, bro
Europa: bro

So the Europa job submit plugin needs to strip off wes, san and has.  I've added this:


function remove_invalid_proc_types(features)

   local invalid_types = { "wes", "san", "has" }

   -- make a copy of the input features
   local pruned = features 

   --print("before:",features)

   -- Loop through the invalid types for this cluster and remove them
   -- Also need to try and clean up to keep a valid syntax
   local ntot = 0
   for k,t in pairs(invalid_types) do
      pruned,n = string.gsub(pruned,t,"") ;

      if ( n > 0) then
         -- Also try to clean up any other problems this creates
         -- Lua special characters are: ( ) . % + - * ? [ ^ $
         -- Lua escape character is %
         pruned = string.gsub(pruned,"||","|") ;
         pruned = string.gsub(pruned,"%[|","%[") ;
         pruned = string.gsub(pruned,"|%]","%]") ;
         pruned = string.gsub(pruned,"^|","") ;
         pruned = string.gsub(pruned,"|$","") ;

         ntot = ntot + n
      end
   end

   --print("after:",pruned,n)
   if ( ntot > 0 ) then
      slurm.log_info("Changed features from '%s' to '%s'", features,pruned)
   end
   return pruned
end

This is called within slurm_job_submit:

        job_desc.features = remove_invalid_proc_types(job_desc.features)

So far this seems to work for all the combinations of features we tend to use.  I'm just a little worried that it will not catch everything and we'll need to keep adding more logic to catch other constraints we haven't caught yet.
Comment 22 NASA JSC Aerolab 2018-02-05 15:57:39 MST
Created attachment 6078 [details]
job submit plugin
Comment 23 Brian Christiansen 2018-02-06 14:26:22 MST
Regarding changing to Status of the bug, just come to the bugs link and you can change the Status.

I like the way you did it. I had in mind validating "valid" features instead of "invalid" features and splitting out the string into lists and and removing the invalid features and then rebuilding the feature string. Your method seems simple enough. You may also want to catch the other possibilities of feature requests like &'s and *<numbers> -- just to be complete in case someone tries them.

You may also want to catch the other possibilities of feature requests like &'s and *<numbers> -- just to be complete in case someone tries them.
https://slurm.schedmd.com/sbatch.html#OPT_constraint

Also, I would move the function call after the check for nil. Otherwise you'll crash on "pruned" being null on the gsub call.


diff --git a/job_submit.lua b/job_submit.lua
index 10bd981..cff5f0f 100644
--- a/job_submit.lua
+++ b/job_submit.lua
@@ -83,7 +83,9 @@ function slurm_job_submit(job_desc, part_list, submit_uid)
        --slurm.log_info("job_desc is a %s",type(job_desc))
        --slurm.log_info("job_features are %s",job_desc.features)
 
-       job_desc.features = remove_invalid_proc_types(job_desc.features)
+       if job_desc ~= nil and job_desc.features ~= nil then
+               job_desc.features = remove_invalid_proc_types(job_desc.features)
+       end
 
        local feature_count = 0
        if job_desc ~= nil and job_desc.features ~= nil then


I played with the scripts a little as well.

For example I put features f1,f2 on cluster1 and f2,f3 on cluster2.

c1$ sbatch --wrap="sleep 300" -C"f3" -Mc1,c2
sbatch: error: Invalid features - aerolab policy requires specifying one or more of wes,san,has,bro.
sbatch: error: Problem with submit to cluster c1: Access/permission denied
Submitted batch job 204123 on cluster c2

It ended up submitting on cluster2 since cluster1 doesn't have f3 but it did spit out the errors from cluster1. You may play around with the logging from the submit script. It might be a little tricky in preventing errors from one cluster and not the other -- maybe it'll be best to not print out the errors? But even then you could get the errors from the system if it couldn't run on one cluster because of other resources -- like node count if one system was bigger than other. It could just be confusing to users to see errors even though the job submitted. Just thinking out loud.

Do you have a test system to play with? If not, one option is do what we do and setup a simulated cluster. We run one cluster with mutliple slurmds (all on different ports). 
https://slurm.schedmd.com/programmer_guide.html#multiple_slurmd_support

I put some scripts together that set up multiple clusters environments -- one using docker (setup.pl) and another using the multiple slurmd approach (setup_local.pl). You can look at these to help setup a test instance for trying the interactions out. FYI. The scripts aren't supported by SchedMD so you are on your own in using them.
https://github.com/gaijin03/fedtest

Having a test environment will help you see how the interactions work before putting them into production.
Comment 24 NASA JSC Aerolab 2018-02-07 08:55:49 MST
Thanks for the tip on moving the function call.  I was trying to build logic into the function call itself but that's a better solution.  

I agree the errors could be little confusing in a multi-cluster case where all valid options are eliminated for a given cluster.  But we really need to issue those messages for a job submitted to just that cluster.  I think we'll just have to educate our users on that.  

Yes, I have set up a simulated cluster for just L1 before, using my workstation, which was very useful.  I hadn't tried to do this for both of the clusters.  Thanks for the scripts to help with that.
Comment 25 Brian Christiansen 2018-02-27 08:23:27 MST
Just following up. Do you need any more assistance on this?
Comment 26 NASA JSC Aerolab 2018-02-27 08:35:27 MST
No, I think we are good on this one. Thanks.
Comment 27 Brian Christiansen 2018-02-27 08:44:56 MST
Great. Let us know if anything comes up.
Comment 28 S Senator 2018-02-27 08:52:41 MST
Apologies for lurking on this bug, but we too are exploring federation but are not as far along. Specifically, we have concerns about creating dependencies on shared resources, in this case, such case the shared data base.

Have either of explored or implemented a highly-available or clustered data base? Alternatively, have you altered processed to take data base backups more frequently so that recovery could be in a reasonable period of time?

We are also starting an effort to compare highly-available mysql compatible data bases. Some local administrators are big proponents of Percona's solution and highly-critical of the complexity of stock mysql alternatives.
Comment 29 NASA JSC Aerolab 2018-02-27 09:03:57 MST
No, we haven't explored HA on the database yet.  We are just using a login node from one of the clusters as our slurmdbd host.  We do run a backup slurmctld on a second login node.  And we do backup the DB on every night so we are relying on manually failing over (restoring the DB) to another node in the event of a failure and dealing with the lost jobs for that day.