Bug 9832

Summary: How to configure slurmd on login nodes when running in Configless Slurm mode?
Product: Slurm Reporter: Ole.H.Nielsen <Ole.H.Nielsen>
Component: ConfigurationAssignee: Marcin Stolarek <cinek>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: cinek
Version: 20.02.5   
Hardware: Linux   
OS: Linux   
Site: DTU Physics Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Ole.H.Nielsen@fysik.dtu.dk 2020-09-16 06:48:21 MDT
The SLUG 2020 talk "Field Notes 4: From The Frontlines of Slurm Support" by Jason Booth recommends on p. 31 to run slurmd on all login nodes in Configless Slurm mode:

> We generally suggest that you run a slurmd to manage the
> configs on those nodes that run client commands, including
> submit or login nodes

Jason mentioned in the talk that one could configure a hidden partition and let login nodes be in that partition.  I guess this is required in order for slurmd to be able to start correctly and join the Configless cluster.

Question: Can you share information about how to configure such a hidden partition, and not allowing any users to submit jobs to that partition?

Thanks,
Ole
Comment 1 Marcin Stolarek 2020-09-16 07:26:30 MDT
Ole,

The simples approach is just to add the node to the confiuration file, by the line like:
>NodeName=submitHost ...
but don't add this node to any partition:
>PartitionName=partition Nodes=...
                               ^^^don't use 'submitHost' here


The parameter to hide a partition Jason mentioned is hidden[1], but it's technically not necessary to add a node to any partition.

cheers,
Marcin

[1]https://slurm.schedmd.com/slurm.conf.html#OPT_Hidden
Comment 2 Ole.H.Nielsen@fysik.dtu.dk 2020-09-16 09:25:39 MDT
(In reply to Marcin Stolarek from comment #1)
> Ole,
> 
> The simples approach is just to add the node to the confiuration file, by
> the line like:
> >NodeName=submitHost ...
> but don't add this node to any partition:
> >PartitionName=partition Nodes=...
>                                ^^^don't use 'submitHost' here
> 
> 
> The parameter to hide a partition Jason mentioned is hidden[1], but it's
> technically not necessary to add a node to any partition.

I see, that's a nice and simple solution!  I've just added our login nodes to slurm.conf with default parameters (who cares about the real hardware):

NodeName=login1,login2

and restarted slurmctld and all slurmd's.  On the login nodes I installed the slurm-slurmd RPM and started the slurmd service.  Now the Configless directory is populated as desired:

# ssh login1 ls -l /run/slurm/conf/
total 28
-rw-r--r--. 1 root root   485 Sep 16 17:07 cgroup.conf
-rw-r--r--. 1 root root   123 Sep 16 17:07 gres.conf
-rw-r--r--. 1 root root 13550 Sep 16 17:07 slurm.conf
-rw-r--r--. 1 root root  1963 Sep 16 17:07 topology.conf

Hopefully no one will be able to submit jobs to these login nodes :-)

I have added your method to my Slurm Wiki pages now:
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#add-login-and-submit-nodes-to-slurm-conf

Thanks a lot,
Ole
Comment 3 Ole.H.Nielsen@fysik.dtu.dk 2020-09-16 12:13:20 MDT
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #2)
> I see, that's a nice and simple solution!  I've just added our login nodes
> to slurm.conf with default parameters (who cares about the real hardware):
> 
> NodeName=login1,login2

It turns out that the topology.conf file must also contain the login nodes now:

SwitchName=switch Nodes=login1,login2

If this is forgotten, the slurmctld.log will say:

error: WARNING: switches lack access to 6 nodes: login1,login2,...
Comment 4 Ole.H.Nielsen@fysik.dtu.dk 2020-09-16 12:22:19 MDT
I'm also seeing a lot of down/up messages in slurmctld.log without having touched the login node (fjorm, in this example):

[2020-09-16T18:52:06.729] Node fjorm now responding
[2020-09-16T18:52:06.729] node fjorm returned to service
[2020-09-16T18:58:46.335] error: Nodes fjorm not responding, setting DOWN
[2020-09-16T19:25:26.558] Node fjorm now responding
[2020-09-16T19:25:26.558] node fjorm returned to service
[2020-09-16T19:35:13.003] error: Nodes fjorm not responding
[2020-09-16T19:35:26.478] error: Nodes fjorm not responding, setting DOWN
[2020-09-16T19:58:46.515] Node fjorm now responding
[2020-09-16T19:58:46.515] node fjorm returned to service
[2020-09-16T20:05:13.612] error: Nodes fjorm not responding
[2020-09-16T20:05:26.100] error: Nodes fjorm not responding, setting DOWN

I wonder if this could be a firewall issue?  What would be the requirements for slurmd to read the Configless files from slurmctld?  Must one open the firewall on the login node for all traffic from/to the slurmctld node?
Comment 5 Marcin Stolarek 2020-09-18 04:05:50 MDT
>It turns out that the topology.conf file must also contain the login nodes now
Yep.. in your configuration (without RoutePlugin=route/topology the warning should not have any serious consequences, but adding the node to topology.conf will silence it.

>I wonder if this could be a firewall issue?
It may be - in general every slurmd should be able to talk not only to slurmctld but also other slurmd instances. Just to double check - did you restart all other slurmd's once submit host node was added?

cheers,
Marcin
Comment 6 Ole.H.Nielsen@fysik.dtu.dk 2020-09-18 04:22:52 MDT
(In reply to Marcin Stolarek from comment #5)
> >It turns out that the topology.conf file must also contain the login nodes now
> Yep.. in your configuration (without RoutePlugin=route/topology the warning
> should not have any serious consequences, but adding the node to
> topology.conf will silence it.

Yes, that seems to be the correct approach.

> >I wonder if this could be a firewall issue?
> It may be - in general every slurmd should be able to talk not only to
> slurmctld but also other slurmd instances. Just to double check - did you
> restart all other slurmd's once submit host node was added?

Our login nodes live on a public network, whereas all compute nodes live on a private subnet.  We do not permit network traffic from the public network to the private subnet, so the login nodes can definitely not reach the compute slurmd nodes!

Is this a showstopper for slurmd on login nodes?  If so, I will remove the login nodes from slurm.conf again.

I rebooted the login node, and now it seems to be responding all the time.  Don't know what the issue was...

Thanks,
Ole
Comment 8 Marcin Stolarek 2020-09-21 04:08:09 MDT
Ole,

Unfortunately, this needs to be a little bit details answer to give a complete view of what's happening behind the scene.

>[..]We do not permit network traffic from the public network to the private subnet, so the login nodes can definitely not reach the compute slurred nodes!
Does it mean that you don't have any user of e.g. srun --pty or salloc/srun combiation? This requires submit node -> slurmd communication too (Just a heads-up).

Focusing on the topic of the bug report this is getting really complicated, because of the forwarding tree communication used by Slurm. I cannot explain all the details shortly, but to give you some insights lets focus on the example of accounting data gathering.

Many periodic activities of Slurm are performed from so-called background thread - it regularly goes over the list of things that should happen in certain intervals and if the last occurrence of e.g. gathering of accounting data happened more than JobAcctGatherFrequency it will queue appropriate RPC to all active slurmds in slurm agent thread queue. Agent will pick this message and send it to up to TreeWidth[1] nodes, those will forward the nodes to the lower level of the tree. If any level will not get a reply within MessageTimeout then it will reissue the RPC directly to all missing descendants. At the same time agent will mark all finally non-responding nodes down. While this retry mechanism should work in your case it will result in communication threads randomly failing or hanging (depending on the details of specific RPC and firewall configuration) that will very likely result in additional error messages on slurmctld and slurmd side. A few RPC (like accounting gathering) are additionaly treated as a PING - the delay of it's processing will result in additional ping RPC being send to nodes at every SlurmdTimeout/3 which won't happen in a standard healthy case.

Having that said, I'd not recommend running a slurmd in isolated network, however, you can give it a try and check how it works on login node in your specific case. Looking at slurm.conf you shared with us in other cases and code the error prone behavior will probably happen only for gathering of accounting data, healthcheck and ping RPCs, the last one will happen quite rarely since succesful accounting gathering (every 30s in your configuration) is treated as succesfull ping too.

I hope you better understand the situation now. Do you have any further questions?

cheers,
Marcin

[1]https://slurm.schedmd.com/slurm.conf.html#OPT_TreeWidth
Comment 9 Marcin Stolarek 2020-09-30 04:24:32 MDT
Ole,

Did you further experiment with the setup?
Is there anything else I can help you with in the case?

cheers,
Marcin
Comment 10 Ole.H.Nielsen@fysik.dtu.dk 2020-09-30 04:38:43 MDT
Hi Marcin,

(In reply to Marcin Stolarek from comment #9)
> Ole,
> 
> Did you further experiment with the setup?
> Is there anything else I can help you with in the case?

I'm sorry for not replying.  I'm really busy with some new systems this week.

It is true that slurmd on our compute nodes and login nodes cannot communicate, being on separated networks.  I really prefer to have compute nodes on an isolated network, and I have not seen any bad side effects for years.

Maybe the safe solution is to remove the Configless Slurm on the login nodes, and remove them from slurm.conf.  I will then need to update /etc/slurm/* files manually, which we have scripts to do.

Under these circumstances, do you agree that my login nodes should be removed from slurm.conf?

Thanks,
Ole
Comment 11 Marcin Stolarek 2020-09-30 05:32:21 MDT
>I'm sorry for not replying.
No issue - I just wanted to follow-up and check that you get what you need from us.

>do you agree that my login nodes should be removed from slurm.conf?
I'd suggest removing submit host from slurm.conf and using configless without slurmd. This will require additional RPC to get the configuration per command execution, but this RPC has a special treatment to reduce its impact on other activities. 
If you'll notice that it has any negative influence on end-users you can always create a local copy of slurm.conf to prevent it.

cheers,
Marcin
Comment 12 Ole.H.Nielsen@fysik.dtu.dk 2020-09-30 07:31:05 MDT
Hi Marcin,

Thanks for the recommendations:

(In reply to Marcin Stolarek from comment #11)
> >do you agree that my login nodes should be removed from slurm.conf?
> I'd suggest removing submit host from slurm.conf and using configless
> without slurmd. This will require additional RPC to get the configuration
> per command execution, but this RPC has a special treatment to reduce its
> impact on other activities. 
> If you'll notice that it has any negative influence on end-users you can
> always create a local copy of slurm.conf to prevent it.

Actually, I must currently have local copies of slurm.conf due to the bug 9330 which is affecting the tools in the slurm-torque RPM package.

You may close this case now.

Best regards,
Ole