Ticket 4569

Summary:	Slurmdbd with a second cluster
Product:	Slurm	Reporter:	NASA JSC Aerolab <JSC-DL-AEROLAB-ADMIN>
Component:	Configuration	Assignee:	Brian Christiansen <brian>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	2 - High Impact
Priority:	---
Version:	17.11.1
Hardware:	Linux
OS:	Linux
Site:	Johnson Space Center	Alineos Sites:	---
Atos/Eviden Sites:	---	Confidential Site:	---
Coreweave sites:	---	Cray Sites:	---
DS9 clusters:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Linux Distro:	---
Machine Name:		CLE Version:
Version Fixed:		Target Release:	---
DevPrio:	---	Emory-Cloud Sites:	---
Attachments:	Current configuration files

Description NASA JSC Aerolab 2018-01-02 13:54:58 MST

We've been using slurm with our cluster for about a year.  We are just installing a 2nd cluster and I'm having issues getting things configured.

Existing cluster:
name = L1
OS = CentOS/RHEL6
IB network = 10.148.0.0/16
10g Ethernet network (service0 only)
Head node = service1 
 - ethernet address = 192.52.98.29
 - IB address = 10.148.2.14
slurmctld and slurmdbd run on service1
config dir = etc

New cluster:
name = Europa
OS = CentOS7
IB network = 10.150.0.0/16
10g Ethernet network (service0 only)
Head node = service0 (aka europa)
 - ethernet address = 192.52.98.128
 - IB address = 10.150.0.2
slurmctld to run on service0
config dir = etc2

This is probably obvious, but the IB networks are independant between the two clusters (i.e. not connected in any way).  The original conf files (for L1) use the IB addresses for everything.  I'm trying to use the ethernet address for DbdAddr on the new cluster because that's the only network that can see the node running slurmdbd on the old cluster.  I've attached the configuration files I'm trying to use.  

Does slurmdbd need to run on the new cluster?  I think it does, but it needs to connect to the DB running on the old cluster.  That's how I've tried to configure the Europa (etc2) files.  

The first problem I was running into is an issue even trying to start slurmdbd:

[2018-01-02T14:15:26.328] error: plugin_load_from_file: dlopen(/software/x86_64/slurm/17.11.1/lib/slurm/accounting_storage_mysql.so): libmysqlclient_r.so.16: cannot open shared object file: No such file or directory
[2018-01-02T14:15:26.328] error: Couldn't load specified plugin name for accounting_storage/mysql: Dlopen of plugin file failed
[2018-01-02T14:15:26.328] error: cannot create accounting_storage context for accounting_storage/mysql
[2018-01-02T14:15:26.328] fatal: Unable to initialize accounting_storage/mysql accounting storage plugin

I've got these installed:

[root@europa ~]# rpm -qa | grep maria
mariadb-5.5.52-1.el7.x86_64
mariadb-libs-5.5.52-1.el7.x86_64
mariadb-devel-5.5.52-1.el7.x86_64

[root@europa ~]# ls -l /usr/lib64/mysql/
total 3080
lrwxrwxrwx 1 root root      17 Dec 29 00:41 libmysqlclient_r.so -> libmysqlclient.so
lrwxrwxrwx 1 root root      19 Jan  2 14:23 libmysqlclient_r.so.16 -> libmysqlclient_r.so
lrwxrwxrwx 1 root root      20 Dec 29 00:41 libmysqlclient.so -> libmysqlclient.so.18
lrwxrwxrwx 1 root root      24 Dec 27 11:25 libmysqlclient.so.18 -> libmysqlclient.so.18.0.0
-rwxr-xr-x 1 root root 3135736 Nov 14  2016 libmysqlclient.so.18.0.0
-rwxr-xr-x 1 root root    6758 Nov 14  2016 mysql_config
drwxr-xr-x 2 root root    4096 Dec 27 11:25 plugin
[root@europa ~]# 


I found a little via google on this but no good fix.  The version of slurm I'm using has been recompiled from scratch on the (centos7) head node.  I can solve this by:


[root@europa ~]# cd /usr/lib64/mysql/
[root@europa mysql]# ln -s libmysqlclient_r.so libmysqlclient_r.so.16


But that seems like kind of a hack.  Is there a better way?  

After that, I still can't connect properly:


[2018-01-02T14:42:45.416] debug3: Trying to load plugin /software/x86_64/slurm/17.11.1/lib/slurm/auth_munge.so
[2018-01-02T14:42:45.416] debug:  Munge authentication plugin loaded
[2018-01-02T14:42:45.416] debug3: Success.
[2018-01-02T14:42:45.416] debug3: Trying to load plugin /software/x86_64/slurm/17.11.1/lib/slurm/accounting_storage_mysql.so
[2018-01-02T14:42:45.420] debug2: mysql_connect() called for db slurm_acct_db
[2018-01-02T14:42:45.420] error: mysql_real_connect failed: 2002 Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)
[2018-01-02T14:42:45.420] error: The database must be up when starting the MYSQL plugin.  Trying again in 5 seconds.


The "local MySQL server" part is bothering me.  Shouldn't it be trying to connect to the mySQL server on another host (192.52.98.29)?

Comment 1 NASA JSC Aerolab 2018-01-02 13:56:05 MST

Created attachment 5833 [details]
Current configuration files

Comment 2 Brian Christiansen 2018-01-02 14:10:00 MST

I'm assuming you want to have both clusters connected to the single database on service1 correct?

In this case you just need one slurmdbd -- the one on service1. The AccountingStorageHost=service1 in the slurm.conf's tells the slurmctlds to contact the slurmdbd on service1. 

Have you tried starting the slurmctld on Europa?
Are you using the same munge key for both clusters?

Comment 3 NASA JSC Aerolab 2018-01-02 14:18:47 MST

Correct - we want both clusters to use a common database.  

Yes, I am using the same munge key.  

Yes, I have tried starting slurmctld, but this is what I get:


[2018-01-02T15:14:28.690] Job accounting information stored, but details not gathered
[2018-01-02T15:14:28.690] slurmctld version 17.11.1-2 started on cluster europa
[2018-01-02T15:14:28.694] job_submit.lua: initialized
[2018-01-02T15:14:30.700] error: slurm_persist_conn_open_without_init: failed to open persistent connection to service1:6819: Connection timed out
[2018-01-02T15:14:30.700] error: slurmdbd: Sending PersistInit msg: Connection timed out
[2018-01-02T15:14:30.700] error: Association database appears down, reading from state file.
[2018-01-02T15:14:30.700] error: Unable to get any information from the state file
[2018-01-02T15:14:30.700] fatal: slurmdbd and/or database must be up at slurmctld start time

That's what lead me down the path of trying to get slurmdbd running on Europa.  I figured maybe slurmdbd running on Europa communicated with the slurmdbd running on service1 (and slurmctld on Europa talked with the slurmdbd running on Europa).

Comment 4 Brian Christiansen 2018-01-02 14:25:39 MST

Can you connect to service1 on port 6819 from europa? I'm wondering if the ports are open?

ex. 
telnet service1 6819


You should be able to do sacctmgr commands from europa as well (even with the slurmctld down). What does:

sacctmgr show clusters

give you?

Comment 5 NASA JSC Aerolab 2018-01-02 14:59:37 MST

Yep, ports are open:

[root@europa init.d]# telnet 192.52.98.29 6819
Trying 192.52.98.29...
Connected to 192.52.98.29.
Escape character is '^]'.
^]
telnet> 

sacctmgr doesn't work either:

[root@europa init.d]# cd /software/x86_64/slurm/17.11.1/bin/
[root@europa bin]# ./sacctmgr show clusters
sacctmgr: error: slurm_persist_conn_open_without_init: failed to open persistent connection to service1:6819: Connection timed out
sacctmgr: error: slurmdbd: Sending PersistInit msg: Connection timed out
sacctmgr: error: Problem talking to the database: Connection timed out
[root@europa bin]#

On Europa, the DNS for "service1" isn't going to resolve to the correct host on L1.  If I change this to the correct IP address, it works.  


[root@europa bin]# vi ../etc/slurm.conf
[root@europa bin]# grep AccountingStorageHost ../etc/slurm.conf
AccountingStorageHost=192.52.98.29
[root@europa bin]# ./sacctmgr show clusters
   Cluster     ControlHost  ControlPort   RPC     Share GrpJobs       GrpTRES GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall                  QOS   Def QOS 
---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- --------- 
        l1     10.148.2.14         6817  8192         1                                                                                           normal           


[root@europa bin]# 


Any issue with using an IP for AccountingStorageHost?

Comment 6 NASA JSC Aerolab 2018-01-02 15:08:09 MST

Definite progress.  I'm able to start slurmctl on Europa:


[root@europa init.d]# ./slurm start
starting slurmctld:                                        [  OK  ]
[root@europa init.d]# 

And the logs seem reasonable:


[2018-01-02T16:03:34.832] Job accounting information stored, but details not gathered
[2018-01-02T16:03:34.832] slurmctld version 17.11.1-2 started on cluster europa
[2018-01-02T16:03:34.837] job_submit.lua: initialized
[2018-01-02T16:03:35.207] layouts: no layout to initialize
[2018-01-02T16:03:35.213] layouts: loading entities/relations information
[2018-01-02T16:03:35.213] error: Could not open node state file /software/x86_64/slurm/state2/node_state: No such file or directory
[2018-01-02T16:03:35.213] error: NOTE: Trying backup state save file. Information may be lost!
[2018-01-02T16:03:35.213] No node state file (/software/x86_64/slurm/state2/node_state.old) to recover
[2018-01-02T16:03:35.213] error: Could not open job state file /software/x86_64/slurm/state2/job_state: No such file or directory
[2018-01-02T16:03:35.213] error: NOTE: Trying backup state save file. Jobs may be lost!
[2018-01-02T16:03:35.213] No job state file (/software/x86_64/slurm/state2/job_state.old) to recover
[2018-01-02T16:03:35.213] cons_res: select_p_node_init
[2018-01-02T16:03:35.213] cons_res: preparing for 3 partitions
[2018-01-02T16:03:35.497] error: Could not open reservation state file /software/x86_64/slurm/state2/resv_state: No such file or directory
[2018-01-02T16:03:35.497] error: NOTE: Trying backup state save file. Reservations may be lost
[2018-01-02T16:03:35.497] No reservation state file (/software/x86_64/slurm/state2/resv_state.old) to recover
[2018-01-02T16:03:35.497] error: Could not open trigger state file /software/x86_64/slurm/state2/trigger_state: No such file or directory
[2018-01-02T16:03:35.497] error: NOTE: Trying backup state save file. Triggers may be lost!
[2018-01-02T16:03:35.497] No trigger state file (/software/x86_64/slurm/state2/trigger_state.old) to recover
[2018-01-02T16:03:35.497] _preserve_plugins: backup_controller not specified
[2018-01-02T16:03:35.497] Reinitializing job accounting state
[2018-01-02T16:03:35.497] Ending any jobs in accounting that were running when controller went down on
[2018-01-02T16:03:35.497] cons_res: select_p_reconfigure
[2018-01-02T16:03:35.497] cons_res: select_p_node_init
[2018-01-02T16:03:35.497] cons_res: preparing for 3 partitions
[2018-01-02T16:03:35.497] Running as primary controller
[2018-01-02T16:03:35.497] Registering slurmctld at port 6817 with slurmdbd.
[2018-01-02T16:03:35.718] error: No fed_mgr state file (/software/x86_64/slurm/state2/fed_mgr_state) to recover
[2018-01-02T16:03:35.799] No last decay (/software/x86_64/slurm/state2/priority_last_decay_ran) to recover
[2018-01-02T16:03:35.799] No parameter for mcs plugin, default values set
[2018-01-02T16:03:35.799] mcs: MCSParameters = (null). ondemand set.
[2018-01-02T16:04:00.078] agent msg_type=1001 ran for 22 seconds
[2018-01-02T16:04:35.831] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
[2018-01-02T16:05:40.130] agent msg_type=1001 ran for 22 seconds



However, back on service1, there are some error messages in the slurmdbd logs:



[2017-12-15T13:57:28.100] slurmdbd version 17.11.0 started
[2018-01-02T15:40:09.581] error: Invalid msg_size (1397966893) from connection 10(192.52.98.128) uid(-2)
[2018-01-02T16:00:34.641] error: Processing last message from connection 8(192.52.98.128) uid(33144)
[2018-01-02T16:00:34.681] error: Processing last message from connection 8(192.52.98.128) uid(33144)


Are those of concern?

Comment 7 Brian Christiansen 2018-01-02 15:22:55 MST

Good to hear. No issues using an IP instead of DNS name. 

I'm guessing those messages are from the telnet connection. I get similar messages when I telnet in as well.

e.g.
slurmdbd: debug2: Opened connection 8 from 127.0.0.1
slurmdbd: error: Could not read msg_size from connection 8(127.0.0.1) uid(-2)
slurmdbd: debug2: Closed connection 8 uid(-2)

If you run sacctmgr show clusters you should see both clusters now.

Comment 8 NASA JSC Aerolab 2018-01-02 15:25:06 MST

Yep:

[root@service1 tmp]# sacctmgr show clusters
   Cluster     ControlHost  ControlPort   RPC     Share GrpJobs       GrpTRES GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall                  QOS   Def QOS 
---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- --------- 
    europa   192.52.98.128         6817  8192         1                                                                                           normal           
        l1     10.148.2.14         6817  8192         1                                                                                           normal           
[root@service1 tmp]# 


Thanks for the help.  You might keep this open a little longer - I'm guessing I'm going to need some help with a couple other items as I keep going...

Comment 9 Brian Christiansen 2018-01-02 15:33:43 MST

No problem. Glad it's working for you now. 


How about we close this for now and then you can reopen it if a question/issue is related or you can open new ones if they are separate questions?

Comment 10 NASA JSC Aerolab 2018-01-02 15:34:54 MST

Sure, that's fine.  Thanks again.

Comment 11 Brian Christiansen 2018-01-02 15:35:30 MST

Great. Just let us know if you have more questions.

Thanks,
Brian