18664 – slurmd doesn't start in the training nodes in Scaleout

Ticket 18664 - slurmd doesn't start in the training nodes in Scaleout

Summary: slurmd doesn't start in the training nodes in Scaleout

Status:	RESOLVED TIMEDOUT

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scaleout (show other tickets)
Version:	23.11.1
Hardware:	Linux Linux

Importance:	--- 4 - Minor Issue
Assignee:	Nate Rini
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2024-01-12 11:18 MST by Serguei Mokhov
Modified:	2024-02-02 13:07 MST (History)
CC List:	2 users (show)

See Also:
Site:	Concordia University
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	AlmaLinux
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Serguei Mokhov 2024-01-12 11:18:36 MST

This is a fork of a ticket from bug ID 18620 for this issue:

From https://bugs.schedmd.com/show_bug.cgi?id=18620#c15

The Training cluster is up, but
has a few issues.

[root@mgmtnode ~]# slurmd -V
slurm 23.11.1


- It seems slurmd failed to start on each node[00-09]
- it appears the cluster added slurm_pam_adopt but ssh still allows through
- I see idle cloud[xxx] nodes?? I did not do a cloud build
- I can't ssh to the 'db' note and check it but can ping it

I can freely otherwise ssh between the nodes, etc. Here are
some diagnostic excerpts:

[root@mgmtnode ~]# sinfo -la
Fri Jan 12 12:06:49 2024
PARTITION AVAIL  TIMELIMIT   JOB_SIZE ROOT OVERSUBS     GROUPS  NODES       STATE RESERVATION NODELIST
cloud        up   infinite 1-infinite   no       NO        all   1025       idle~             cloud[0000-1024]
debug*       up   infinite 1-infinite   no       NO        all     10    unknown*             node[00-09]

[root@mgmtnode ~]# ssh db
ssh: connect to host db port 22: Connection refused
[root@mgmtnode ~]# ping db
PING db(db (2001:db8:1:1::1:3)) 56 data bytes
64 bytes from db (2001:db8:1:1::1:3): icmp_seq=1 ttl=64 time=0.045 ms
64 bytes from db (2001:db8:1:1::1:3): icmp_seq=2 ttl=64 time=0.040 ms
64 bytes from db (2001:db8:1:1::1:3): icmp_seq=3 ttl=64 time=0.035 ms
64 bytes from db (2001:db8:1:1::1:3): icmp_seq=4 ttl=64 time=0.027 ms
^C
--- db ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3098ms
rtt min/avg/max/mdev = 0.027/0.036/0.045/0.009 ms

slurmctld is fine:

[root@mgmtnode ~]# systemctl status slurmctld
● slurmctld.service - Slurm controller daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
  Drop-In: /usr/lib/systemd/system/slurmctld.service.d
           └─cluster.conf
           /etc/systemd/system/slurmctld.service.d
           └─local.conf
   Active: active (running) since Fri 2024-01-12 04:15:03 EST; 7h ago
  Process: 173 ExecStartPost=/usr/local/bin/slurmctld.startup2.sh (code=exited, status=0/SUCCESS)
  Process: 113 ExecStartPre=/usr/local/bin/slurmctld.startup.sh (code=exited, status=0/SUCCESS)
  Process: 112 ExecStartPre=/usr/bin/chown slurm:slurm /var/log/slurmctld.log (code=exited, status=0/SUCCESS)
  Process: 111 ExecStartPre=/usr/bin/touch /var/log/slurmctld.log (code=exited, status=0/SUCCESS)
  Process: 110 ExecStartPre=/usr/bin/chmod -R 0770 /auth (code=exited, status=0/SUCCESS)
  Process: 107 ExecStartPre=/usr/bin/chown -R slurm:slurm /auth /etc/slurm/ (code=exited, status=0/SUCCESS)
 Main PID: 166 (slurmctld)
    Tasks: 36
   Memory: 46.2M
   CGroup: /docker.slice/docker-2b1c74d8bc42dee8dbe55d6a796b56bb3d4500586ca08fe0c94bd56f92196cc8.scope/system.slice/slurmctld.service
           ├─166 /usr/local/sbin/slurmctld --systemd
           └─167 slurmctld: slurmscriptd

Jan 12 11:52:19 mgmtnode slurmctld[166]: slurmctld: debug:  Spawning registration agent for node[00-09] 10 hosts
Jan 12 11:52:31 mgmtnode slurmctld[166]: slurmctld: debug:  Spawning registration agent for node[00-09] 10 hosts
Jan 12 11:52:31 mgmtnode slurmctld[166]: slurmctld: error: Nodes node[00-09] not responding
Jan 12 11:52:39 mgmtnode slurmctld[166]: slurmctld: debug:  Requesting control from backup controller mgmtnode2
Jan 12 11:52:39 mgmtnode slurmctld[166]: slurmctld: debug:  backup controller mgmtnode2 responding
Jan 12 11:52:43 mgmtnode slurmctld[166]: slurmctld: error: Nodes node[00-09] not responding
Jan 12 11:52:43 mgmtnode slurmctld[166]: slurmctld: debug:  Spawning registration agent for node[00-09] 10 hosts
Jan 12 11:52:52 mgmtnode slurmctld[166]: slurmctld: debug:  sched: Running job scheduler for full queue.
Jan 12 11:52:55 mgmtnode slurmctld[166]: slurmctld: debug:  Spawning registration agent for node[00-09] 10 hosts
Jan 12 11:52:55 mgmtnode slurmctld[166]: slurmctld: error: Nodes node[00-09] not responding

but slurmd even after a restart fails. They are all like that in each node I spot-checked:

[root@node07 ~]# systemctl list-units
  UNIT                                   LOAD   ACTIVE SUB       DESCRIPTION                                          
  -.mount                                loaded active mounted   Root Mount                                           
  dev-log.mount                          loaded active mounted   /dev/log                                             
  dev-mqueue.mount                       loaded active mounted   POSIX Message Queue File System                      
  etc-hostname.mount                     loaded active mounted   /etc/hostname                                        
  etc-hosts.mount                        loaded active mounted   /etc/hosts                                           
  etc-resolv.conf.mount                  loaded active mounted   /etc/resolv.conf                                     
  etc-slurm.mount                        loaded active mounted   /etc/slurm                                           
  etc-ssh.mount                          loaded active mounted   /etc/ssh                                             
  home.mount                             loaded active mounted   /home                                                
  proc-acpi.mount                        loaded active mounted   /proc/acpi                                           
  proc-bus.mount                         loaded active mounted   /proc/bus                                            
  proc-fs.mount                          loaded active mounted   /proc/fs                                             
  proc-irq.mount                         loaded active mounted   /proc/irq                                            
  proc-kcore.mount                       loaded active mounted   /proc/kcore                                          
  proc-keys.mount                        loaded active mounted   /proc/keys                                           
  proc-scsi.mount                        loaded active mounted   /proc/scsi                                           
  proc-sysrq\x2dtrigger.mount            loaded active mounted   /proc/sysrq-trigger                                  
  proc-timer_list.mount                  loaded active mounted   /proc/timer_list                                     
  root.mount                             loaded active mounted   root.mount                                           
  run-lock.mount                         loaded active mounted   /run/lock                                            
  srv-containers.mount                   loaded active mounted   /srv/containers                                      
● sys-fs-fuse-connections.mount          masked active mounted   sys-fs-fuse-connections.mount                        
  sys-fs-fuse.mount                      loaded active mounted   /sys/fs/fuse                                         
  tmp.mount                              loaded active mounted   Temporary Directory (/tmp)                           
  usr-local-src.mount                    loaded active mounted   /usr/local/src                                       
  usr-share-zoneinfo-UTC.mount           loaded active mounted   /usr/share/zoneinfo/UTC                              
  var-lib-journal.mount                  loaded active mounted   /var/lib/journal                                     
  var-spool-mail.mount                   loaded active mounted   /var/spool/mail                                      
  systemd-ask-password-console.path      loaded active waiting   Dispatch Password Requests to Console Directory Watch
  systemd-ask-password-wall.path         loaded active waiting   Forward Password Requests to Wall Directory Watch    
  init.scope                             loaded active running   System and Service Manager                           
  crond.service                          loaded active running   Command Scheduler                                    
  dbus.service                           loaded active running   D-Bus System Message Bus                             
  dracut-shutdown.service                loaded active exited    Restore /run/initramfs on shutdown                   
  ldconfig.service                       loaded active exited    Rebuild Dynamic Linker Cache                         
  munge.service                          loaded active running   MUNGE authentication service                         
  selinux-autorelabel-mark.service       loaded active exited    Mark the need to relabel after reboot                
● slurmd.service                         loaded failed failed    Slurm node daemon                                    
  sshd.service                           loaded active running   OpenSSH server daemon                                
  systemd-journal-catalog-update.service loaded active exited    Rebuild Journal Catalog                              
  systemd-journal-flush.service          loaded active exited    Flush Journal to Persistent Storage                  
  systemd-journald.service               loaded active running   Journal Service                                      
  systemd-sysusers.service               loaded active exited    Create System Users                                  
  systemd-tmpfiles-setup.service         loaded active exited    Create Volatile Files and Directories                
  systemd-update-done.service            loaded active exited    Update is Completed                                  
  systemd-update-utmp.service            loaded active exited    Update UTMP about System Boot/Shutdown               
  systemd-user-sessions.service          loaded active exited    Permit User Sessions                                 
  -.slice                                loaded active active    Root Slice                                           
[root@node07 ~]# systemctl slurmd status
Unknown operation slurmd.
[root@node07 ~]# systemctl status slurmd
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
  Drop-In: /usr/lib/systemd/system/slurmd.service.d
           └─cluster.conf
           /etc/systemd/system/slurmd.service.d
           └─local.conf
   Active: failed (Result: exit-code) since Fri 2024-01-12 04:14:29 EST; 7h ago
  Process: 121 ExecStart=/usr/local/sbin/slurmd --systemd $SLURMD_OPTIONS (code=exited, status=1/FAILURE)
  Process: 114 ExecStartPre=/usr/local/bin/slurmd.startup.sh (code=exited, status=0/SUCCESS)
  Process: 113 ExecStartPre=/usr/bin/chown slurm:slurm /var/log/slurmd.log (code=exited, status=0/SUCCESS)
  Process: 112 ExecStartPre=/usr/bin/touch /var/log/slurmd.log (code=exited, status=0/SUCCESS)
 Main PID: 121 (code=exited, status=1/FAILURE)

Jan 12 04:14:29 node07 slurmd[121]: slurmd: error: ProctrackType 1 specified more than once, latest value used
Jan 12 04:14:29 node07 slurmd[121]: slurmd: debug:  Log file re-opened
Jan 12 04:14:29 node07 slurmd[121]: slurmd: debug:  CPUs:48 Boards:1 Sockets:2 CoresPerSocket:12 ThreadsPerCore:2
Jan 12 04:14:29 node07 slurmd[121]: slurmd: fatal: Hybrid mode is not supported. Mounted cgroups are: 9:devices:/
Jan 12 04:14:29 node07 slurmd[121]: 7:freezer:/
Jan 12 04:14:29 node07 slurmd[121]: 6:perf_event:/
Jan 12 04:14:29 node07 slurmd[121]: 5:net_cls,net_prio:/
Jan 12 04:14:29 node07 slurmd[121]: 1:name=systemd:/
Jan 12 04:14:29 node07 slurmd[121]: 0::/docker.slice/docker-07a88b0251dce9e294f193e08a3c5a8821a24e089f927876045b1542a43e1023.scope/init.scope
[root@node07 ~]# systemctl restart slurmd
Job for slurmd.service failed because the control process exited with error code.
See "systemctl status slurmd.service" and "journalctl -xe" for details.
[root@node07 ~]# journalctl -xe
-- The system journal process has started up, opened the journal
-- files for writing and is now ready to process requests.
Jan 12 04:14:29 node07 systemd-journald[70]: Runtime journal (/run/log/journal/040e07a4495d48b1925d320d1aa78b73) is 8.0M, max 4.0G, 3.9G free.
-- Subject: Disk space used by the journal
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
-- 
-- Runtime journal (/run/log/journal/040e07a4495d48b1925d320d1aa78b73) is currently using 8.0M.
-- Maximum allowed usage is set to 4.0G.
-- Leaving at least 4.0G free (of currently available 20.2G of disk space).
-- Enforced usage limit is thus 4.0G, of which 3.9G are still available.
-- 
-- The limits controlling how much disk space is used by the journal may
-- be configured with SystemMaxUse=, SystemKeepFree=, SystemMaxFileSize=,
-- RuntimeMaxUse=, RuntimeKeepFree=, RuntimeMaxFileSize= settings in
-- /etc/systemd/journald.conf. See journald.conf(5) for details.
Jan 12 04:14:29 node07 systemd-journald[70]: Runtime journal (/run/log/journal/040e07a4495d48b1925d320d1aa78b73) is 8.0M, max 4.0G, 3.9G free.
-- Subject: Disk space used by the journal
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
-- 
-- Runtime journal (/run/log/journal/040e07a4495d48b1925d320d1aa78b73) is currently using 8.0M.
-- Maximum allowed usage is set to 4.0G.
-- Leaving at least 4.0G free (of currently available 20.2G of disk space).
-- Enforced usage limit is thus 4.0G, of which 3.9G are still available.
-- 
-- The limits controlling how much disk space is used by the journal may
-- be configured with SystemMaxUse=, SystemKeepFree=, SystemMaxFileSize=,
-- RuntimeMaxUse=, RuntimeKeepFree=, RuntimeMaxFileSize= settings in
-- /etc/systemd/journald.conf. See journald.conf(5) for details.
Jan 12 04:14:29 node07 systemd-tmpfiles[76]: [/usr/local/lib/tmpfiles.d/munge.conf:1] Line references path below legacy directory /var/run/, updating /var/run/munge → /run/munge; please update the tmpfiles.d/ d>
Jan 12 04:14:29 node07 slurmd[121]: slurmd: error: ProctrackType 1 specified more than once, latest value used
Jan 12 04:14:29 node07 slurmd[121]: slurmd: debug:  Log file re-opened
Jan 12 04:14:29 node07 slurmd[121]: slurmd: debug:  CPUs:48 Boards:1 Sockets:2 CoresPerSocket:12 ThreadsPerCore:2
Jan 12 04:14:29 node07 slurmd[121]: slurmd: fatal: Hybrid mode is not supported. Mounted cgroups are: 9:devices:/
Jan 12 04:14:29 node07 slurmd[121]: 7:freezer:/
Jan 12 04:14:29 node07 slurmd[121]: 6:perf_event:/
Jan 12 04:14:29 node07 slurmd[121]: 5:net_cls,net_prio:/
Jan 12 04:14:29 node07 slurmd[121]: 1:name=systemd:/
Jan 12 04:14:29 node07 slurmd[121]: 0::/docker.slice/docker-07a88b0251dce9e294f193e08a3c5a8821a24e089f927876045b1542a43e1023.scope/init.scope
Jan 12 11:51:24 node07 slurmd[258]: slurmd: error: ProctrackType 1 specified more than once, latest value used
Jan 12 11:51:24 node07 slurmd[258]: slurmd: debug:  Log file re-opened
Jan 12 11:51:24 node07 slurmd[258]: slurmd: debug:  CPUs:48 Boards:1 Sockets:2 CoresPerSocket:12 ThreadsPerCore:2
Jan 12 11:51:24 node07 slurmd[258]: slurmd: fatal: Hybrid mode is not supported. Mounted cgroups are: 9:devices:/
Jan 12 11:51:24 node07 slurmd[258]: 7:freezer:/
Jan 12 11:51:24 node07 slurmd[258]: 6:perf_event:/
Jan 12 11:51:24 node07 slurmd[258]: 5:net_cls,net_prio:/
Jan 12 11:51:24 node07 slurmd[258]: 1:name=systemd:/
Jan 12 11:51:24 node07 slurmd[258]: 0::/docker.slice/docker-07a88b0251dce9e294f193e08a3c5a8821a24e089f927876045b1542a43e1023.scope/init.scope
 ESCOC


a78b73) is 8.0M, max 4.0G, 3.9G free.













a78b73) is 8.0M, max 4.0G, 3.9G free.













h below legacy directory /var/run/, updating /var/run/munge → /run/munge; please update the tmpfiles.d/ drop-in file accordingly.
 used

erCore:2
evices:/




876045b1542a43e1023.scope/init.scope
 used

erCore:2
evices:/




876045b1542a43e1023.scope/init.scope
 ESCOD
-- The system journal process has started up, opened the journal
-- files for writing and is now ready to process requests.
Jan 12 04:14:29 node07 systemd-journald[70]: Runtime journal (/run/log/journal/040e07a4495d48b1925d320d1aa78b73) is 8.0M, max 4.0G, 3.9G free.
-- Subject: Disk space used by the journal
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
-- 
-- Runtime journal (/run/log/journal/040e07a4495d48b1925d320d1aa78b73) is currently using 8.0M.
-- Maximum allowed usage is set to 4.0G.
-- Leaving at least 4.0G free (of currently available 20.2G of disk space).
-- Enforced usage limit is thus 4.0G, of which 3.9G are still available.
-- 
-- The limits controlling how much disk space is used by the journal may
-- be configured with SystemMaxUse=, SystemKeepFree=, SystemMaxFileSize=,
-- RuntimeMaxUse=, RuntimeKeepFree=, RuntimeMaxFileSize= settings in
-- /etc/systemd/journald.conf. See journald.conf(5) for details.
Jan 12 04:14:29 node07 systemd-journald[70]: Runtime journal (/run/log/journal/040e07a4495d48b1925d320d1aa78b73) is 8.0M, max 4.0G, 3.9G free.
-- Subject: Disk space used by the journal
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
-- 
-- Runtime journal (/run/log/journal/040e07a4495d48b1925d320d1aa78b73) is currently using 8.0M.
-- Maximum allowed usage is set to 4.0G.
-- Leaving at least 4.0G free (of currently available 20.2G of disk space).
-- Enforced usage limit is thus 4.0G, of which 3.9G are still available.
-- 
-- The limits controlling how much disk space is used by the journal may
-- be configured with SystemMaxUse=, SystemKeepFree=, SystemMaxFileSize=,
-- RuntimeMaxUse=, RuntimeKeepFree=, RuntimeMaxFileSize= settings in
-- /etc/systemd/journald.conf. See journald.conf(5) for details.
Jan 12 04:14:29 node07 systemd-tmpfiles[76]: [/usr/local/lib/tmpfiles.d/munge.conf:1] Line references path below legacy directory /var/run/, updating /var/run/munge → /run/munge; please update the tmpfiles.d/ d>
Jan 12 04:14:29 node07 slurmd[121]: slurmd: error: ProctrackType 1 specified more than once, latest value used
Jan 12 04:14:29 node07 slurmd[121]: slurmd: debug:  Log file re-opened
Jan 12 04:14:29 node07 slurmd[121]: slurmd: debug:  CPUs:48 Boards:1 Sockets:2 CoresPerSocket:12 ThreadsPerCore:2
Jan 12 04:14:29 node07 slurmd[121]: slurmd: fatal: Hybrid mode is not supported. Mounted cgroups are: 9:devices:/
Jan 12 04:14:29 node07 slurmd[121]: 7:freezer:/
Jan 12 04:14:29 node07 slurmd[121]: 6:perf_event:/
Jan 12 04:14:29 node07 slurmd[121]: 5:net_cls,net_prio:/
Jan 12 04:14:29 node07 slurmd[121]: 1:name=systemd:/
Jan 12 04:14:29 node07 slurmd[121]: 0::/docker.slice/docker-07a88b0251dce9e294f193e08a3c5a8821a24e089f927876045b1542a43e1023.scope/init.scope
Jan 12 11:51:24 node07 slurmd[258]: slurmd: error: ProctrackType 1 specified more than once, latest value used
Jan 12 11:51:24 node07 slurmd[258]: slurmd: debug:  Log file re-opened
Jan 12 11:51:24 node07 slurmd[258]: slurmd: debug:  CPUs:48 Boards:1 Sockets:2 CoresPerSocket:12 ThreadsPerCore:2
Jan 12 11:51:24 node07 slurmd[258]: slurmd: fatal: Hybrid mode is not supported. Mounted cgroups are: 9:devices:/
Jan 12 11:51:24 node07 slurmd[258]: 7:freezer:/
Jan 12 11:51:24 node07 slurmd[258]: 6:perf_event:/
Jan 12 11:51:24 node07 slurmd[258]: 5:net_cls,net_prio:/
Jan 12 11:51:24 node07 slurmd[258]: 1:name=systemd:/
Jan 12 11:51:24 node07 slurmd[258]: 0::/docker.slice/docker-07a88b0251dce9e294f193e08a3c5a8821a24e089f927876045b1542a43e1023.scope/init.scope

wilma's pam_slurm_adopt is not working (BTW, in our real cluster in el7 and el9 it does work, just reporting this with the test cluster I just built):

[root@node09 /]# uname -a
Linux node09 5.14.0-362.8.1.el9_3.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Nov 7 14:54:22 EST 2023 x86_64 x86_64 x86_64 GNU/Linux
[root@node09 /]# ssh wilma@node01
wilma@node01's password: 
Access denied by pam_slurm_adopt: you have no active jobs on this node
[wilma@node01 ~]$ netstat -pnat
(No info could be read for "-p": geteuid()=1013 but you should be root.)
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      -                   
tcp        0      0 127.0.0.11:40891        0.0.0.0:*               LISTEN      -                   
tcp6       0      0 :::22                   :::*                    LISTEN      -                   
tcp6       0      0 2001:db8:1:1::5:11:22   2001:db8:1:1::5:1:58838 ESTABLISHED -                   
[wilma@node01 ~]$ route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         _gateway        0.0.0.0         UG    0      0        0 eth0
10.11.0.0       0.0.0.0         255.255.0.0     U     0      0        0 eth0


Main issue now for me to get slurmd working in node01-09. What could be the cause? Thanks!

Comment 1 Nate Rini 2024-01-12 11:27:31 MST

(In reply to Serguei Mokhov from comment #0)
> This is a fork of a ticket from bug ID 18620 for this issue:

Thanks. Having 1 issue per ticket really helps us keep track of all of the issues and avoid a lot of confusion.

> - It seems slurmd failed to start on each node[00-09]

The relevant log was in comment#0:
> Jan 12 04:14:29 node07 slurmd[121]: slurmd: fatal: Hybrid mode is not supported. Mounted cgroups are: 9:devices:/

As noted here:
> https://slurm.schedmd.com/cgroups.html

The hybrid cgroup mode is not supported by Slurm and has been removed in more recent kernels. Please configure cgroups v1 or v2 per the above link on the host system to resolve this error.


> - it appears the cluster added slurm_pam_adopt but ssh still allows through

One of the training labs includes adding the pam_deny to /etc/pam.d/sshd2 to enforce pam_slurm_adopt. This will need to be done manually if you wish to see it enforced.

> - I see idle cloud[xxx] nodes?? I did not do a cloud build

they are included in the default config to avoid having multiple versions. if not built with cloud mode, then they can be removed from slurm.conf or safely ignored.

> - I can't ssh to the 'db' note and check it but can ping it

Try calling `make HOST=db bash` instead. The container image is provided the the mysql project and does not include sshd.

Comment 2 Serguei Mokhov 2024-01-12 12:01:45 MST

(In reply to Nate Rini from comment #1)

> > - It seems slurmd failed to start on each node[00-09]
> 
> The relevant log was in comment#0:
> > Jan 12 04:14:29 node07 slurmd[121]: slurmd: fatal: Hybrid mode is not supported. Mounted cgroups are: 9:devices:/
> 
> As noted here:
> > https://slurm.schedmd.com/cgroups.html
> 
> The hybrid cgroup mode is not supported by Slurm and has been removed in
> more recent kernels. Please configure cgroups v1 or v2 per the above link on
> the host system to resolve this error.

Oh, that's the parent system's cgroups not in the containers...
Let me see how I can fix that...


> > - it appears the cluster added slurm_pam_adopt but ssh still allows through
> 
> One of the training labs includes adding the pam_deny to /etc/pam.d/sshd2 to
> enforce pam_slurm_adopt. This will need to be done manually if you wish to
> see it enforced.
> 
> > - I see idle cloud[xxx] nodes?? I did not do a cloud build
> 
> they are included in the default config to avoid having multiple versions.
> if not built with cloud mode, then they can be removed from slurm.conf or
> safely ignored.
> 
> > - I can't ssh to the 'db' note and check it but can ping it
> 
> Try calling `make HOST=db bash` instead. The container image is provided the
> the mysql project and does not include sshd.

Alright, thanks.

Comment 3 Serguei Mokhov 2024-01-12 12:21:32 MST

(In reply to Serguei Mokhov from comment #2)
> (In reply to Nate Rini from comment #1)
> 
> > > - It seems slurmd failed to start on each node[00-09]
> > 
> > The relevant log was in comment#0:
> > > Jan 12 04:14:29 node07 slurmd[121]: slurmd: fatal: Hybrid mode is not supported. Mounted cgroups are: 9:devices:/
> > 
> > As noted here:
> > > https://slurm.schedmd.com/cgroups.html
> > 
> > The hybrid cgroup mode is not supported by Slurm and has been removed in
> > more recent kernels. Please configure cgroups v1 or v2 per the above link on
> > the host system to resolve this error.
> 
> Oh, that's the parent system's cgroups not in the containers...
> Let me see how I can fix that...

I don't think the host system has hybrid or v1 cgroups:


# cat /proc/cgroups
#subsys_name	hierarchy	num_cgroups	enabled
cpuset	0	837	1
cpu	0	837	1
cpuacct	0	837	1
blkio	0	837	1
memory	0	837	1
devices	9	53	1
freezer	7	1	1
net_cls	5	1	1
perf_event	6	1	1
net_prio	5	1	1
hugetlb	0	837	1
pids	0	837	1
rdma	0	837	1
misc	0	837	1

# ll /sys/fs/cgroup/ 
total 0
-r--r--r--  1 root root 0 Dec 18 19:22 cgroup.controllers
-rw-r--r--  1 root root 0 Jan 12 14:06 cgroup.max.depth
-rw-r--r--  1 root root 0 Jan 12 14:06 cgroup.max.descendants
-rw-r--r--  1 root root 0 Dec 18 19:22 cgroup.procs
-r--r--r--  1 root root 0 Jan 12 14:06 cgroup.stat
-rw-r--r--  1 root root 0 Jan 12 10:25 cgroup.subtree_control
-rw-r--r--  1 root root 0 Jan 12 14:06 cgroup.threads
-r--r--r--  1 root root 0 Jan  8 23:41 cpuset.cpus.effective
-r--r--r--  1 root root 0 Jan  8 23:41 cpuset.mems.effective
-r--r--r--  1 root root 0 Jan 12 14:06 cpu.stat
drwxr-xr-x  2 root root 0 Jan  8 23:47 dev-hugepages.mount
drwxr-xr-x  2 root root 0 Jan  8 23:47 dev-mqueue.mount
drwxr-xr-x 24 root root 0 Jan 12 10:08 docker.slice
drwxr-xr-x  2 root root 0 Dec 18 19:22 init.scope
-r--r--r--  1 root root 0 Jan 12 14:06 io.stat
drwxr-xr-x  2 root root 0 Jan  8 23:47 machine.slice
-r--r--r--  1 root root 0 Jan 12 14:06 memory.numa_stat
--w-------  1 root root 0 Jan 12 14:06 memory.reclaim
-r--r--r--  1 root root 0 Jan 12 14:06 memory.stat
-r--r--r--  1 root root 0 Jan 12 14:06 misc.capacity
drwxr-xr-x  2 root root 0 Jan  8 23:47 proc-sys-fs-binfmt_misc.mount
drwxr-xr-x  2 root root 0 Jan  8 23:47 sys-fs-fuse-connections.mount
drwxr-xr-x  2 root root 0 Jan  8 23:47 sys-kernel-config.mount
drwxr-xr-x  2 root root 0 Jan  8 23:47 sys-kernel-debug.mount
drwxr-xr-x  2 root root 0 Jan  8 23:47 sys-kernel-tracing.mount
drwxr-xr-x 48 root root 0 Jan 12 13:52 system.slice
drwxr-xr-x  3 root root 0 Jan 12 11:44 user.slice

[root@filth docker-scale-out]# cat /sys/fs/cgroup/cgroup.controllers
cpuset cpu io memory hugetlb pids rdma misc

[root@filth docker-scale-out]# stat -f /sys/fs/cgroup
  File: "/sys/fs/cgroup"
    ID: 0        Namelen: 255     Type: cgroup2fs
Block size: 4096       Fundamental block size: 4096
Blocks: Total: 0          Free: 0          Available: 0
Inodes: Total: 0          Free: 0

How does it detect "hybrid"?

Comment 4 Nate Rini 2024-01-12 12:34:43 MST

(In reply to Serguei Mokhov from comment #3)
> How does it detect "hybrid"?

Call:
> cat /proc/self/cgroup

Comment 5 Serguei Mokhov 2024-01-12 14:46:56 MST

(In reply to Nate Rini from comment #4)
> (In reply to Serguei Mokhov from comment #3)
> > How does it detect "hybrid"?
> 
> Call:
> > cat /proc/self/cgroup

# cat /proc/self/cgroup
9:devices:/
7:freezer:/
6:perf_event:/
5:net_cls,net_prio:/
1:name=systemd:/
0::/user.slice/user-11929.slice/session-377.scope

What here does tell me it is hybrid?

Comment 6 Nate Rini 2024-01-12 15:59:23 MST

(In reply to Serguei Mokhov from comment #5)
> (In reply to Nate Rini from comment #4)
> > (In reply to Serguei Mokhov from comment #3)
> > > How does it detect "hybrid"?
> > 
> > Call:
> > > cat /proc/self/cgroup
> 
> # cat /proc/self/cgroup
> 9:devices:/
> 7:freezer:/
> 6:perf_event:/
> 5:net_cls,net_prio:/
> 1:name=systemd:/
> 0::/user.slice/user-11929.slice/session-377.scope
> 
> What here does tell me it is hybrid?

Cgroup v2 looks like this:
> srun cat /proc/self/cgroup
> 0::/system.slice/slurmstepd.scope/job_1330/step_0/user/task_0

Please try following the suggestions below:

(In reply to Ben Glines from bug#18359 comment#8)
> We only support legacy mode (cgroup v1) and unified mode (cgroup v2), and
> not hybrid setups [3] as you have noticed.
> 
> To disable hybrid mode and only enable cgroup v2 (this is the mode what we
> recommend), you'll need to add "cgroup_no_v1=all" to your kernel command
> line. Depending on your setup, this can be added to GRUB_CMDLINE_LINUX="" in
> /etc/default/grub. Run `sudo update-grab` after making the change, and then
> reboot.
> 
> Check the cgroup mount points to ensure that onlyu cgroup v2 is enabled. You
> should see something like the following:
> > $ mount | grep cgroup
> > cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)
> 
> Let me know if you have any questions about this.
> 
> [1] https://slurm.schedmd.com/slurm.conf.html#OPT_task/affinity
> [2] https://slurm.schedmd.com/cgroups.html#task
> [3] https://slurm.schedmd.com/cgroups.html#overview

Comment 7 Serguei Mokhov 2024-01-12 16:46:44 MST

(In reply to Nate Rini from comment #6)
> (In reply to Serguei Mokhov from comment #5)
> > (In reply to Nate Rini from comment #4)
> > > (In reply to Serguei Mokhov from comment #3)
> > > > How does it detect "hybrid"?
> > > 
> > > Call:
> > > > cat /proc/self/cgroup
> > 
> > # cat /proc/self/cgroup
> > 9:devices:/
> > 7:freezer:/
> > 6:perf_event:/
> > 5:net_cls,net_prio:/
> > 1:name=systemd:/
> > 0::/user.slice/user-11929.slice/session-377.scope
> > 
> > What here does tell me it is hybrid?
> 
> Cgroup v2 looks like this:
> > srun cat /proc/self/cgroup
> > 0::/system.slice/slurmstepd.scope/job_1330/step_0/user/task_0

Well I have 0::/ in mine too :)
 
> Please try following the suggestions below:

That's the thing...

# mount | grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)

Looks I have nothing else mounted

# uname -a
Linux filth 5.14.0-362.8.1.el9_3.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Nov 7 14:54:22 EST 2023 x86_64 x86_64 x86_64 GNU/Linux

"RHEL 9, by default, mounts and uses cgroups-v2."
 
We did not change that in Alma. And AlmaLinux 9 defaults to using cgroup v2 as well.

I'll try GRUB_CMDLINE_LINUX maybe next time I have the opportunity
to reboot the server, but the above analysis so far suggests I don't
have v1..


> (In reply to Ben Glines from bug#18359 comment#8)
> > We only support legacy mode (cgroup v1) and unified mode (cgroup v2), and
> > not hybrid setups [3] as you have noticed.
> > 
> > To disable hybrid mode and only enable cgroup v2 (this is the mode what we
> > recommend), you'll need to add "cgroup_no_v1=all" to your kernel command
> > line. Depending on your setup, this can be added to GRUB_CMDLINE_LINUX="" in
> > /etc/default/grub. Run `sudo update-grab` after making the change, and then
> > reboot.
> > 
> > Check the cgroup mount points to ensure that onlyu cgroup v2 is enabled. You
> > should see something like the following:
> > > $ mount | grep cgroup
> > > cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)
> > 
> > Let me know if you have any questions about this.
> > 
> > [1] https://slurm.schedmd.com/slurm.conf.html#OPT_task/affinity
> > [2] https://slurm.schedmd.com/cgroups.html#task
> > [3] https://slurm.schedmd.com/cgroups.html#overview

Comment 8 Serguei Mokhov 2024-01-12 17:19:35 MST

So... I've added cgroup_no_v1=all to my GRUB_CMDLINE_LINUX and rebooted.
The result is virtually the same:

# cat /proc/self/cgroup 
14:freezer:/
9:perf_event:/
8:net_cls,net_prio:/
7:devices:/
1:name=systemd:/
0::/user.slice/user-11929.slice/session-1.scope
# mount | grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)
# make HOST=node01 bash
docker compose exec node01 /bin/bash
[root@node01 /]# systemctl status slurmd 
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
  Drop-In: /usr/lib/systemd/system/slurmd.service.d
           └─cluster.conf
           /etc/systemd/system/slurmd.service.d
           └─local.conf
   Active: failed (Result: exit-code) since Fri 2024-01-12 19:10:20 EST; 6min ago
  Process: 121 ExecStart=/usr/local/sbin/slurmd --systemd $SLURMD_OPTIONS (code=exited, status=1/FAILURE)
  Process: 114 ExecStartPre=/usr/local/bin/slurmd.startup.sh (code=exited, status=0/SUCCESS)
  Process: 113 ExecStartPre=/usr/bin/chown slurm:slurm /var/log/slurmd.log (code=exited, status=0/SUCCESS)
  Process: 112 ExecStartPre=/usr/bin/touch /var/log/slurmd.log (code=exited, status=0/SUCCESS)
 Main PID: 121 (code=exited, status=1/FAILURE)

Jan 12 19:10:20 node01 slurmd[121]: slurmd: error: ProctrackType 1 specified more than once, latest value used
Jan 12 19:10:20 node01 slurmd[121]: slurmd: debug:  Log file re-opened
Jan 12 19:10:20 node01 slurmd[121]: slurmd: debug:  CPUs:48 Boards:1 Sockets:2 CoresPerSocket:12 ThreadsPerCore:2
Jan 12 19:10:20 node01 slurmd[121]: slurmd: fatal: Hybrid mode is not supported. Mounted cgroups are: 14:freezer:/
Jan 12 19:10:20 node01 slurmd[121]: 9:perf_event:/
Jan 12 19:10:20 node01 slurmd[121]: 8:net_cls,net_prio:/
Jan 12 19:10:20 node01 slurmd[121]: 7:devices:/
Jan 12 19:10:20 node01 slurmd[121]: 1:name=systemd:/
Jan 12 19:10:20 node01 slurmd[121]: 0::/docker.slice/docker-4b68f66a75745866e007e7e21c61bac490d250d59925f6eb30246abf2dc22e3c.scope/init.scope
[root@node01 /]# cat /proc/self/cgroup
14:freezer:/
9:perf_event:/
8:net_cls,net_prio:/
7:devices:/
1:name=systemd:/
0::/docker.slice/docker-4b68f66a75745866e007e7e21c61bac490d250d59925f6eb30246abf2dc22e3c.scope/init.scope
[root@node01 /]# mount | grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (ro,relatime)
cgroup2 on /sys/fs/cgroup/docker.slice type cgroup2 (rw,nosuid,nodev,noexec,relatime)
[root@node01 /]#

Comment 9 Nate Rini 2024-01-24 15:38:54 MST

(In reply to Serguei Mokhov from comment #8)
> So... I've added cgroup_no_v1=all to my GRUB_CMDLINE_LINUX and rebooted.
> The result is virtually the same:

Is this after a reboot?

Comment 10 Nate Rini 2024-01-24 15:43:45 MST

(In reply to Nate Rini from comment #9)
> (In reply to Serguei Mokhov from comment #8)
> > So... I've added cgroup_no_v1=all to my GRUB_CMDLINE_LINUX and rebooted.
> > The result is virtually the same:
> 
> Is this after a reboot?

Please try the suggestions here too: https://slurm.schedmd.com/faq.html#cgroupv2

The output in comment#7 is clearly a hybrid cgroup mount setup.

Comment 11 Nate Rini 2024-01-31 09:26:58 MST

Any change to try the suggestion in comment#10?

Comment 12 Nate Rini 2024-02-02 13:07:04 MST

It's been more than a week since comment#10. Please respond when convenient, and the ticket will automatically re-open.