We've been using slurm with our cluster for about a year. We are just installing a 2nd cluster and I'm having issues getting things configured. Existing cluster: name = L1 OS = CentOS/RHEL6 IB network = 10.148.0.0/16 10g Ethernet network (service0 only) Head node = service1 - ethernet address = 192.52.98.29 - IB address = 10.148.2.14 slurmctld and slurmdbd run on service1 config dir = etc New cluster: name = Europa OS = CentOS7 IB network = 10.150.0.0/16 10g Ethernet network (service0 only) Head node = service0 (aka europa) - ethernet address = 192.52.98.128 - IB address = 10.150.0.2 slurmctld to run on service0 config dir = etc2 This is probably obvious, but the IB networks are independant between the two clusters (i.e. not connected in any way). The original conf files (for L1) use the IB addresses for everything. I'm trying to use the ethernet address for DbdAddr on the new cluster because that's the only network that can see the node running slurmdbd on the old cluster. I've attached the configuration files I'm trying to use. Does slurmdbd need to run on the new cluster? I think it does, but it needs to connect to the DB running on the old cluster. That's how I've tried to configure the Europa (etc2) files. The first problem I was running into is an issue even trying to start slurmdbd: [2018-01-02T14:15:26.328] error: plugin_load_from_file: dlopen(/software/x86_64/slurm/17.11.1/lib/slurm/accounting_storage_mysql.so): libmysqlclient_r.so.16: cannot open shared object file: No such file or directory [2018-01-02T14:15:26.328] error: Couldn't load specified plugin name for accounting_storage/mysql: Dlopen of plugin file failed [2018-01-02T14:15:26.328] error: cannot create accounting_storage context for accounting_storage/mysql [2018-01-02T14:15:26.328] fatal: Unable to initialize accounting_storage/mysql accounting storage plugin I've got these installed: [root@europa ~]# rpm -qa | grep maria mariadb-5.5.52-1.el7.x86_64 mariadb-libs-5.5.52-1.el7.x86_64 mariadb-devel-5.5.52-1.el7.x86_64 [root@europa ~]# ls -l /usr/lib64/mysql/ total 3080 lrwxrwxrwx 1 root root 17 Dec 29 00:41 libmysqlclient_r.so -> libmysqlclient.so lrwxrwxrwx 1 root root 19 Jan 2 14:23 libmysqlclient_r.so.16 -> libmysqlclient_r.so lrwxrwxrwx 1 root root 20 Dec 29 00:41 libmysqlclient.so -> libmysqlclient.so.18 lrwxrwxrwx 1 root root 24 Dec 27 11:25 libmysqlclient.so.18 -> libmysqlclient.so.18.0.0 -rwxr-xr-x 1 root root 3135736 Nov 14 2016 libmysqlclient.so.18.0.0 -rwxr-xr-x 1 root root 6758 Nov 14 2016 mysql_config drwxr-xr-x 2 root root 4096 Dec 27 11:25 plugin [root@europa ~]# I found a little via google on this but no good fix. The version of slurm I'm using has been recompiled from scratch on the (centos7) head node. I can solve this by: [root@europa ~]# cd /usr/lib64/mysql/ [root@europa mysql]# ln -s libmysqlclient_r.so libmysqlclient_r.so.16 But that seems like kind of a hack. Is there a better way? After that, I still can't connect properly: [2018-01-02T14:42:45.416] debug3: Trying to load plugin /software/x86_64/slurm/17.11.1/lib/slurm/auth_munge.so [2018-01-02T14:42:45.416] debug: Munge authentication plugin loaded [2018-01-02T14:42:45.416] debug3: Success. [2018-01-02T14:42:45.416] debug3: Trying to load plugin /software/x86_64/slurm/17.11.1/lib/slurm/accounting_storage_mysql.so [2018-01-02T14:42:45.420] debug2: mysql_connect() called for db slurm_acct_db [2018-01-02T14:42:45.420] error: mysql_real_connect failed: 2002 Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2) [2018-01-02T14:42:45.420] error: The database must be up when starting the MYSQL plugin. Trying again in 5 seconds. The "local MySQL server" part is bothering me. Shouldn't it be trying to connect to the mySQL server on another host (192.52.98.29)?
Created attachment 5833 [details] Current configuration files
I'm assuming you want to have both clusters connected to the single database on service1 correct? In this case you just need one slurmdbd -- the one on service1. The AccountingStorageHost=service1 in the slurm.conf's tells the slurmctlds to contact the slurmdbd on service1. Have you tried starting the slurmctld on Europa? Are you using the same munge key for both clusters?
Correct - we want both clusters to use a common database. Yes, I am using the same munge key. Yes, I have tried starting slurmctld, but this is what I get: [2018-01-02T15:14:28.690] Job accounting information stored, but details not gathered [2018-01-02T15:14:28.690] slurmctld version 17.11.1-2 started on cluster europa [2018-01-02T15:14:28.694] job_submit.lua: initialized [2018-01-02T15:14:30.700] error: slurm_persist_conn_open_without_init: failed to open persistent connection to service1:6819: Connection timed out [2018-01-02T15:14:30.700] error: slurmdbd: Sending PersistInit msg: Connection timed out [2018-01-02T15:14:30.700] error: Association database appears down, reading from state file. [2018-01-02T15:14:30.700] error: Unable to get any information from the state file [2018-01-02T15:14:30.700] fatal: slurmdbd and/or database must be up at slurmctld start time That's what lead me down the path of trying to get slurmdbd running on Europa. I figured maybe slurmdbd running on Europa communicated with the slurmdbd running on service1 (and slurmctld on Europa talked with the slurmdbd running on Europa).
Can you connect to service1 on port 6819 from europa? I'm wondering if the ports are open? ex. telnet service1 6819 You should be able to do sacctmgr commands from europa as well (even with the slurmctld down). What does: sacctmgr show clusters give you?
Yep, ports are open: [root@europa init.d]# telnet 192.52.98.29 6819 Trying 192.52.98.29... Connected to 192.52.98.29. Escape character is '^]'. ^] telnet> sacctmgr doesn't work either: [root@europa init.d]# cd /software/x86_64/slurm/17.11.1/bin/ [root@europa bin]# ./sacctmgr show clusters sacctmgr: error: slurm_persist_conn_open_without_init: failed to open persistent connection to service1:6819: Connection timed out sacctmgr: error: slurmdbd: Sending PersistInit msg: Connection timed out sacctmgr: error: Problem talking to the database: Connection timed out [root@europa bin]# On Europa, the DNS for "service1" isn't going to resolve to the correct host on L1. If I change this to the correct IP address, it works. [root@europa bin]# vi ../etc/slurm.conf [root@europa bin]# grep AccountingStorageHost ../etc/slurm.conf AccountingStorageHost=192.52.98.29 [root@europa bin]# ./sacctmgr show clusters Cluster ControlHost ControlPort RPC Share GrpJobs GrpTRES GrpSubmit MaxJobs MaxTRES MaxSubmit MaxWall QOS Def QOS ---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- --------- l1 10.148.2.14 6817 8192 1 normal [root@europa bin]# Any issue with using an IP for AccountingStorageHost?
Definite progress. I'm able to start slurmctl on Europa: [root@europa init.d]# ./slurm start starting slurmctld: [ OK ] [root@europa init.d]# And the logs seem reasonable: [2018-01-02T16:03:34.832] Job accounting information stored, but details not gathered [2018-01-02T16:03:34.832] slurmctld version 17.11.1-2 started on cluster europa [2018-01-02T16:03:34.837] job_submit.lua: initialized [2018-01-02T16:03:35.207] layouts: no layout to initialize [2018-01-02T16:03:35.213] layouts: loading entities/relations information [2018-01-02T16:03:35.213] error: Could not open node state file /software/x86_64/slurm/state2/node_state: No such file or directory [2018-01-02T16:03:35.213] error: NOTE: Trying backup state save file. Information may be lost! [2018-01-02T16:03:35.213] No node state file (/software/x86_64/slurm/state2/node_state.old) to recover [2018-01-02T16:03:35.213] error: Could not open job state file /software/x86_64/slurm/state2/job_state: No such file or directory [2018-01-02T16:03:35.213] error: NOTE: Trying backup state save file. Jobs may be lost! [2018-01-02T16:03:35.213] No job state file (/software/x86_64/slurm/state2/job_state.old) to recover [2018-01-02T16:03:35.213] cons_res: select_p_node_init [2018-01-02T16:03:35.213] cons_res: preparing for 3 partitions [2018-01-02T16:03:35.497] error: Could not open reservation state file /software/x86_64/slurm/state2/resv_state: No such file or directory [2018-01-02T16:03:35.497] error: NOTE: Trying backup state save file. Reservations may be lost [2018-01-02T16:03:35.497] No reservation state file (/software/x86_64/slurm/state2/resv_state.old) to recover [2018-01-02T16:03:35.497] error: Could not open trigger state file /software/x86_64/slurm/state2/trigger_state: No such file or directory [2018-01-02T16:03:35.497] error: NOTE: Trying backup state save file. Triggers may be lost! [2018-01-02T16:03:35.497] No trigger state file (/software/x86_64/slurm/state2/trigger_state.old) to recover [2018-01-02T16:03:35.497] _preserve_plugins: backup_controller not specified [2018-01-02T16:03:35.497] Reinitializing job accounting state [2018-01-02T16:03:35.497] Ending any jobs in accounting that were running when controller went down on [2018-01-02T16:03:35.497] cons_res: select_p_reconfigure [2018-01-02T16:03:35.497] cons_res: select_p_node_init [2018-01-02T16:03:35.497] cons_res: preparing for 3 partitions [2018-01-02T16:03:35.497] Running as primary controller [2018-01-02T16:03:35.497] Registering slurmctld at port 6817 with slurmdbd. [2018-01-02T16:03:35.718] error: No fed_mgr state file (/software/x86_64/slurm/state2/fed_mgr_state) to recover [2018-01-02T16:03:35.799] No last decay (/software/x86_64/slurm/state2/priority_last_decay_ran) to recover [2018-01-02T16:03:35.799] No parameter for mcs plugin, default values set [2018-01-02T16:03:35.799] mcs: MCSParameters = (null). ondemand set. [2018-01-02T16:04:00.078] agent msg_type=1001 ran for 22 seconds [2018-01-02T16:04:35.831] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2 [2018-01-02T16:05:40.130] agent msg_type=1001 ran for 22 seconds However, back on service1, there are some error messages in the slurmdbd logs: [2017-12-15T13:57:28.100] slurmdbd version 17.11.0 started [2018-01-02T15:40:09.581] error: Invalid msg_size (1397966893) from connection 10(192.52.98.128) uid(-2) [2018-01-02T16:00:34.641] error: Processing last message from connection 8(192.52.98.128) uid(33144) [2018-01-02T16:00:34.681] error: Processing last message from connection 8(192.52.98.128) uid(33144) Are those of concern?
Good to hear. No issues using an IP instead of DNS name. I'm guessing those messages are from the telnet connection. I get similar messages when I telnet in as well. e.g. slurmdbd: debug2: Opened connection 8 from 127.0.0.1 slurmdbd: error: Could not read msg_size from connection 8(127.0.0.1) uid(-2) slurmdbd: debug2: Closed connection 8 uid(-2) If you run sacctmgr show clusters you should see both clusters now.
Yep: [root@service1 tmp]# sacctmgr show clusters Cluster ControlHost ControlPort RPC Share GrpJobs GrpTRES GrpSubmit MaxJobs MaxTRES MaxSubmit MaxWall QOS Def QOS ---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- --------- europa 192.52.98.128 6817 8192 1 normal l1 10.148.2.14 6817 8192 1 normal [root@service1 tmp]# Thanks for the help. You might keep this open a little longer - I'm guessing I'm going to need some help with a couple other items as I keep going...
No problem. Glad it's working for you now. How about we close this for now and then you can reopen it if a question/issue is related or you can open new ones if they are separate questions?
Sure, that's fine. Thanks again.
Great. Just let us know if you have more questions. Thanks, Brian