Stuck on this issue for a while, hoping someone can provide some clues: slurmctld: error: Error binding slurm stream socket: Invalid argument Quickly traced the error message to slurm_init_msg_engine() and backtracked from there. I added the debug code below (output also shown), which shows the port and address from __ss_pad1 in the addrinfo_storage structure. port bytes = 26 161 address bytes = 0 0 0 0 0 0 0 0 0 0 0 0 I'm wondering if the address is being constructed correctly. slurm_setup_addr() is calling slurm_set_addr(&s_addr, port, NULL); and not getting the real host IP. Does this seem correct? I'm not an expert on TCP/IP programming. Thanks, JB P.S. Also working on a 21.08.1 port, but still some build errors to resolve there. --- src/common/slurm_protocol_api.c.orig 2021-07-01 22:28:38 UTC +++ src/common/slurm_protocol_api.c @@ -2866,6 +2866,7 @@ extern void slurm_setup_addr(slurm_addr_t *sin, uint16 { static slurm_addr_t s_addr = { 0 }; + // Useless with unconditional memcpy() below memset(sin, 0, sizeof(*sin)); if (slurm_addr_is_unspec(&s_addr)) { @@ -2880,7 +2881,7 @@ extern void slurm_setup_addr(slurm_addr_t *sin, uint16 else var = "NoInAddrAny"; - if (xstrcasestr(slurm_conf.comm_params, var)) { + if ( xstrcasestr(slurm_conf.comm_params, var)) { char host[MAXHOSTNAMELEN]; if (!gethostname(host, MAXHOSTNAMELEN)) { @@ -2889,13 +2890,20 @@ extern void slurm_setup_addr(slurm_addr_t *sin, uint16 fatal("%s: Can't get hostname or addr: %m", __func__); } else { + fprintf(stderr, "Using NULL host for slurm_set_addr().\n"); slurm_set_addr(&s_addr, port, NULL); } } + fprintf(stderr, "Back from slurm_set_addr\n"); + for (int c = 0; c < 14; ++c) + fprintf(stderr, "%u ", + (unsigned char)s_addr.__ss_pad1[c]); + putc('\n', stderr); memcpy(sin, &s_addr, sizeof(*sin)); slurm_set_port(sin, port); log_flag(NET, "%s: update address to %pA", __func__, sin); + fprintf(stderr, "slurm_setup_addr(): ss_len = %u\n", sin->ss_len); } /* --- src/common/slurm_protocol_socket.c.orig 2021-07-01 22:28:38 UTC +++ src/common/slurm_protocol_socket.c @@ -120,6 +120,9 @@ static void _sock_bind_wild(int sockfd) slurm_setup_addr(&sin, RANDOM_USER_PORT); + fprintf(stderr, "sockfd = %d sin = %p\n", sockfd, sin); + fprintf(stderr, "ss_len = %u ss_family = %u __ss_align = %lu\n", + sin.ss_len, sin.ss_family, sin.__ss_align); for (retry=0; retry < PORT_RETRIES ; retry++) { rc = bind(sockfd, (struct sockaddr *) &sin, sizeof(sin)); if (rc >= 0) @@ -422,6 +425,15 @@ extern int slurm_init_msg_engine(slurm_addr_t *addr) goto error; } + fprintf(stderr, "slurm_init_msg_engine()...\n"); + fprintf(stderr, "fd = %d addr = %p\n", fd, addr); + fprintf(stderr, "ss_len = %u ss_family = %u __ss_align = %lu\n", + addr->ss_len, addr->ss_family, addr->__ss_align); + fprintf(stderr, "sizeof(*addr) = %zu\n", sizeof(*addr)); + // fprintf(stderr, "port = %u\n", (unsigned short)(addr->__ss_pad1[1] + (addr->__ss_pad1[0] << 8))); + for (int c = 0; c < 14; ++c) + fprintf(stderr, "%u ", (unsigned char)addr->__ss_pad1[c]); + putc('\n', stderr); rc = bind(fd, (struct sockaddr const *) addr, sizeof(*addr)); if (rc < 0) { error("Error binding slurm stream socket: %m"); @@ -669,6 +681,7 @@ extern void slurm_set_addr(slurm_addr_t *addr, uint16_ memcpy(addr, ai_ptr->ai_addr, ai_ptr->ai_addrlen); log_flag(NET, "%s: update addr. addr='%pA'", __func__, addr); freeaddrinfo(ai_start); + fprintf(stderr, "slurm_set_addr(): ss_len = %u\n", addr->ss_len); } extern void slurm_pack_slurm_addr(slurm_addr_t *addr, Buf buffer) <<<ROOT@coral.acadix>>> /usr/ports/wip/slurm-wlm-devel 1203 # slurmctld -D slurmctld: Stack size set to 536870912 slurmctld: slurmctld version 20.11.8 started on cluster beastie slurmctld: No memory enforcing mechanism configured. slurm_set_addr(): ss_len = 16 slurmctld: error: Could not open node state file /var/spool/slurm/ctld/node_state: No such file or directory slurmctld: error: NOTE: Trying backup state save file. Information may be lost! slurmctld: No node state file (/var/spool/slurm/ctld/node_state.old) to recover slurmctld: error: Could not open job state file /var/spool/slurm/ctld/job_state: No such file or directory slurmctld: error: NOTE: Trying backup state save file. Jobs may be lost! slurmctld: No job state file (/var/spool/slurm/ctld/job_state.old) to recover slurmctld: select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions slurmctld: error: Could not open reservation state file /var/spool/slurm/ctld/resv_state: No such file or directory slurmctld: error: NOTE: Trying backup state save file. Reservations may be lost slurmctld: No reservation state file (/var/spool/slurm/ctld/resv_state.old) to recover slurmctld: error: Could not open trigger state file /var/spool/slurm/ctld/trigger_state: No such file or directory slurmctld: error: NOTE: Trying backup state save file. Triggers may be lost! slurmctld: No trigger state file (/var/spool/slurm/ctld/trigger_state.old) to recover slurmctld: read_slurm_conf: backup_controller not specified slurmctld: Reinitializing job accounting state slurmctld: select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure slurmctld: select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions slurmctld: Running as primary controller slurmctld: No parameter for mcs plugin, default values set slurmctld: mcs: MCSParameters = (null). ondemand set. Using NULL host for slurm_set_addr(). slurm_set_addr(): ss_len = 16 Back from slurm_set_addr 26 161 0 0 0 0 0 0 0 0 0 0 0 0 slurm_setup_addr(): ss_len = 16 slurm_init_msg_engine()... fd = 3 addr = 0x7fffdfdfbd98 ss_len = 16 ss_family = 2 __ss_align = 0 sizeof(*addr) = 128 26 161 0 0 0 0 0 0 0 0 0 0 0 0 slurmctld: error: Error binding slurm stream socket: Invalid argument slurmctld: fatal: slurm_init_msg_engine_port error Invalid argument
I think I found the problem. From the bind(3) man page: ERRORS The bind() system call will fail if: ... [EINVAL] The addrlen argument is not a valid length for the address family. ... Based on this, I think using bind(fd, add, sizeof(*addr)) is incorrect and it should be bind(fd, addr, addr->ss_len). sizeof(*addr) is sizeof(struct sockaddr_storage), which is *NOT* sensitive to the address family in use. It includes padding that is not used for ipv4 addresses and returns 128 on FreeBSD. When I change the addrlen argument to addr->ss_len, slurmctld starts successfully. I think every bind() call in the SLURM sources will have to be checked for this.
The following returns 16 128 on both FreeBSD and CentOS. It seems the Linux bind() tolerates a mismatch between the addr->ss_len and the addrlen reported in the 3rd argument, but FreeBSD's does not. The Linux bind(2) man page is a little more vague about addrlen, not defining what "wrong" means. EINVAL The addrlen is wrong, or the socket was not in the AF_UNIX fam ily. #include <stdio.h> #include <sysexits.h> #include <sys/socket.h> int main(int argc,char *argv[]) { printf("sockaddr: %zu sockaddr_storage: %zu\n", sizeof(struct sockaddr), sizeof(struct sockaddr_storage)); return EX_OK; }