Summary: | Slurmdbd with a second cluster | ||
---|---|---|---|
Product: | Slurm | Reporter: | NASA JSC Aerolab <JSC-DL-AEROLAB-ADMIN> |
Component: | Configuration | Assignee: | Brian Christiansen <brian> |
Status: | RESOLVED INFOGIVEN | QA Contact: | |
Severity: | 2 - High Impact | ||
Priority: | --- | ||
Version: | 17.11.1 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | Johnson Space Center | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | Target Release: | --- | |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Attachments: | Current configuration files |
Description
NASA JSC Aerolab
2018-01-02 13:54:58 MST
Created attachment 5833 [details]
Current configuration files
I'm assuming you want to have both clusters connected to the single database on service1 correct? In this case you just need one slurmdbd -- the one on service1. The AccountingStorageHost=service1 in the slurm.conf's tells the slurmctlds to contact the slurmdbd on service1. Have you tried starting the slurmctld on Europa? Are you using the same munge key for both clusters? Correct - we want both clusters to use a common database. Yes, I am using the same munge key. Yes, I have tried starting slurmctld, but this is what I get: [2018-01-02T15:14:28.690] Job accounting information stored, but details not gathered [2018-01-02T15:14:28.690] slurmctld version 17.11.1-2 started on cluster europa [2018-01-02T15:14:28.694] job_submit.lua: initialized [2018-01-02T15:14:30.700] error: slurm_persist_conn_open_without_init: failed to open persistent connection to service1:6819: Connection timed out [2018-01-02T15:14:30.700] error: slurmdbd: Sending PersistInit msg: Connection timed out [2018-01-02T15:14:30.700] error: Association database appears down, reading from state file. [2018-01-02T15:14:30.700] error: Unable to get any information from the state file [2018-01-02T15:14:30.700] fatal: slurmdbd and/or database must be up at slurmctld start time That's what lead me down the path of trying to get slurmdbd running on Europa. I figured maybe slurmdbd running on Europa communicated with the slurmdbd running on service1 (and slurmctld on Europa talked with the slurmdbd running on Europa). Can you connect to service1 on port 6819 from europa? I'm wondering if the ports are open? ex. telnet service1 6819 You should be able to do sacctmgr commands from europa as well (even with the slurmctld down). What does: sacctmgr show clusters give you? Yep, ports are open: [root@europa init.d]# telnet 192.52.98.29 6819 Trying 192.52.98.29... Connected to 192.52.98.29. Escape character is '^]'. ^] telnet> sacctmgr doesn't work either: [root@europa init.d]# cd /software/x86_64/slurm/17.11.1/bin/ [root@europa bin]# ./sacctmgr show clusters sacctmgr: error: slurm_persist_conn_open_without_init: failed to open persistent connection to service1:6819: Connection timed out sacctmgr: error: slurmdbd: Sending PersistInit msg: Connection timed out sacctmgr: error: Problem talking to the database: Connection timed out [root@europa bin]# On Europa, the DNS for "service1" isn't going to resolve to the correct host on L1. If I change this to the correct IP address, it works. [root@europa bin]# vi ../etc/slurm.conf [root@europa bin]# grep AccountingStorageHost ../etc/slurm.conf AccountingStorageHost=192.52.98.29 [root@europa bin]# ./sacctmgr show clusters Cluster ControlHost ControlPort RPC Share GrpJobs GrpTRES GrpSubmit MaxJobs MaxTRES MaxSubmit MaxWall QOS Def QOS ---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- --------- l1 10.148.2.14 6817 8192 1 normal [root@europa bin]# Any issue with using an IP for AccountingStorageHost? Definite progress. I'm able to start slurmctl on Europa: [root@europa init.d]# ./slurm start starting slurmctld: [ OK ] [root@europa init.d]# And the logs seem reasonable: [2018-01-02T16:03:34.832] Job accounting information stored, but details not gathered [2018-01-02T16:03:34.832] slurmctld version 17.11.1-2 started on cluster europa [2018-01-02T16:03:34.837] job_submit.lua: initialized [2018-01-02T16:03:35.207] layouts: no layout to initialize [2018-01-02T16:03:35.213] layouts: loading entities/relations information [2018-01-02T16:03:35.213] error: Could not open node state file /software/x86_64/slurm/state2/node_state: No such file or directory [2018-01-02T16:03:35.213] error: NOTE: Trying backup state save file. Information may be lost! [2018-01-02T16:03:35.213] No node state file (/software/x86_64/slurm/state2/node_state.old) to recover [2018-01-02T16:03:35.213] error: Could not open job state file /software/x86_64/slurm/state2/job_state: No such file or directory [2018-01-02T16:03:35.213] error: NOTE: Trying backup state save file. Jobs may be lost! [2018-01-02T16:03:35.213] No job state file (/software/x86_64/slurm/state2/job_state.old) to recover [2018-01-02T16:03:35.213] cons_res: select_p_node_init [2018-01-02T16:03:35.213] cons_res: preparing for 3 partitions [2018-01-02T16:03:35.497] error: Could not open reservation state file /software/x86_64/slurm/state2/resv_state: No such file or directory [2018-01-02T16:03:35.497] error: NOTE: Trying backup state save file. Reservations may be lost [2018-01-02T16:03:35.497] No reservation state file (/software/x86_64/slurm/state2/resv_state.old) to recover [2018-01-02T16:03:35.497] error: Could not open trigger state file /software/x86_64/slurm/state2/trigger_state: No such file or directory [2018-01-02T16:03:35.497] error: NOTE: Trying backup state save file. Triggers may be lost! [2018-01-02T16:03:35.497] No trigger state file (/software/x86_64/slurm/state2/trigger_state.old) to recover [2018-01-02T16:03:35.497] _preserve_plugins: backup_controller not specified [2018-01-02T16:03:35.497] Reinitializing job accounting state [2018-01-02T16:03:35.497] Ending any jobs in accounting that were running when controller went down on [2018-01-02T16:03:35.497] cons_res: select_p_reconfigure [2018-01-02T16:03:35.497] cons_res: select_p_node_init [2018-01-02T16:03:35.497] cons_res: preparing for 3 partitions [2018-01-02T16:03:35.497] Running as primary controller [2018-01-02T16:03:35.497] Registering slurmctld at port 6817 with slurmdbd. [2018-01-02T16:03:35.718] error: No fed_mgr state file (/software/x86_64/slurm/state2/fed_mgr_state) to recover [2018-01-02T16:03:35.799] No last decay (/software/x86_64/slurm/state2/priority_last_decay_ran) to recover [2018-01-02T16:03:35.799] No parameter for mcs plugin, default values set [2018-01-02T16:03:35.799] mcs: MCSParameters = (null). ondemand set. [2018-01-02T16:04:00.078] agent msg_type=1001 ran for 22 seconds [2018-01-02T16:04:35.831] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2 [2018-01-02T16:05:40.130] agent msg_type=1001 ran for 22 seconds However, back on service1, there are some error messages in the slurmdbd logs: [2017-12-15T13:57:28.100] slurmdbd version 17.11.0 started [2018-01-02T15:40:09.581] error: Invalid msg_size (1397966893) from connection 10(192.52.98.128) uid(-2) [2018-01-02T16:00:34.641] error: Processing last message from connection 8(192.52.98.128) uid(33144) [2018-01-02T16:00:34.681] error: Processing last message from connection 8(192.52.98.128) uid(33144) Are those of concern? Good to hear. No issues using an IP instead of DNS name. I'm guessing those messages are from the telnet connection. I get similar messages when I telnet in as well. e.g. slurmdbd: debug2: Opened connection 8 from 127.0.0.1 slurmdbd: error: Could not read msg_size from connection 8(127.0.0.1) uid(-2) slurmdbd: debug2: Closed connection 8 uid(-2) If you run sacctmgr show clusters you should see both clusters now. Yep: [root@service1 tmp]# sacctmgr show clusters Cluster ControlHost ControlPort RPC Share GrpJobs GrpTRES GrpSubmit MaxJobs MaxTRES MaxSubmit MaxWall QOS Def QOS ---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- --------- europa 192.52.98.128 6817 8192 1 normal l1 10.148.2.14 6817 8192 1 normal [root@service1 tmp]# Thanks for the help. You might keep this open a little longer - I'm guessing I'm going to need some help with a couple other items as I keep going... No problem. Glad it's working for you now. How about we close this for now and then you can reopen it if a question/issue is related or you can open new ones if they are separate questions? Sure, that's fine. Thanks again. Great. Just let us know if you have more questions. Thanks, Brian |