3701 – New slurm cluster - unable to start slurmd

Ticket 3701 - New slurm cluster - unable to start slurmd

Summary: New slurm cluster - unable to start slurmd

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmd (show other tickets)
Version:	17.02.1
Hardware:	Linux Linux

Importance:	--- 3 - Medium Impact
Assignee:	Tim Wickberg
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2017-04-16 17:06 MDT by Simran
Modified:	2024-03-13 08:36 MDT (History)
CC List:	1 user (show)

See Also:
Site:	Genentech (Roche)
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	Ubuntu
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Simran 2017-04-16 17:06:37 MDT

Hi Guys,

I have a new slurm 17.02.1 based cluster running on CentOS 7.3.  I have setup the head node but can't seem to start the client slurmd daemons.  It is complaining about cgroups and I don't see any additional cgroups specific configuration that I have to do to fix this.  Can you please help:

--
from the logs:
..
Apr 16 16:02:19 amber301 systemd: Cannot add dependency job for unit microcode.service, ignoring: Unit is not loaded properly: Invalid argument.
Apr 16 16:02:19 amber301 systemd: Starting Slurm node daemon...
Apr 16 16:02:19 amber301 slurmd[5457]: error: You are using cons_res or gang scheduling with Fastschedule=0 and node configuration differs from hardware.  The node configuration used will be what is in the slurm.conf because of the bitmaps the slurmctld must create before the slurmd registers.#012   CPUs=24:48(hw) Boards=1:1(hw) SocketsPerBoard=2:2(hw) CoresPerSocket=12:12(hw) ThreadsPerCore=1:2(hw)
Apr 16 16:02:19 amber301 slurmd[5457]: Message aggregation disabled
Apr 16 16:02:19 amber301 slurmd[5457]: gpu 0 is device number 0
Apr 16 16:02:19 amber301 slurmd[5457]: gpu 1 is device number 1
Apr 16 16:02:19 amber301 slurmd[5457]: gpu 2 is device number 2
Apr 16 16:02:19 amber301 slurmd[5457]: gpu 3 is device number 3
Apr 16 16:02:19 amber301 slurmd[5457]: gpu 4 is device number 4
Apr 16 16:02:19 amber301 slurmd[5457]: gpu 5 is device number 5
Apr 16 16:02:19 amber301 slurmd[5457]: gpu 6 is device number 6
Apr 16 16:02:19 amber301 slurmd[5457]: gpu 7 is device number 7
Apr 16 16:02:19 amber301 slurmd[5457]: Resource spec: Reserved system memory limit not configured for this node
Apr 16 16:02:19 amber301 slurmd[5457]: error: cgroup namespace 'freezer' not mounted. aborting
Apr 16 16:02:19 amber301 slurmd[5457]: error: unable to create freezer cgroup namespace
Apr 16 16:02:19 amber301 slurmd[5457]: error: Couldn't load specified plugin name for proctrack/cgroup: Plugin init() callback failed
Apr 16 16:02:19 amber301 slurmd[5457]: error: cannot create proctrack context for proctrack/cgroup
Apr 16 16:02:19 amber301 systemd: slurmd.service: control process exited, code=exited status=1
Apr 16 16:02:19 amber301 slurmd[5457]: error: slurmd initialization failed
Apr 16 16:02:19 amber301 systemd: Failed to start Slurm node daemon.
Apr 16 16:02:19 amber301 systemd: Unit slurmd.service entered failed state.
Apr 16 16:02:19 amber301 systemd: slurmd.service failed.
..

[root@amber301 log]# rpm -qa | grep -i slurm
slurm-contribs-17.02.1-2.el7.centos.x86_64
slurm-perlapi-17.02.1-2.el7.centos.x86_64
slurm-devel-17.02.1-2.el7.centos.x86_64
slurm-torque-17.02.1-2.el7.centos.x86_64
slurm-slurmdbd-17.02.1-2.el7.centos.x86_64
slurm-pam_slurm-17.02.1-2.el7.centos.x86_64
slurm-openlava-17.02.1-2.el7.centos.x86_64
slurm-17.02.1-2.el7.centos.x86_64
slurm-munge-17.02.1-2.el7.centos.x86_64
slurm-plugins-17.02.1-2.el7.centos.x86_64
slurm-sql-17.02.1-2.el7.centos.x86_64

Thanks,
-Simran

Comment 1 Simran 2017-04-16 17:31:44 MDT

Also, here is my slurm.conf:

--
ControlMachine=amber600
AuthType=auth/munge
CryptoType=crypto/munge
GresTypes=gpu 
MpiDefault=none
ProctrackType=proctrack/cgroup
ReturnToService=0
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool/slurm
SwitchType=switch/none
TaskPlugin=task/cgroup
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
FastSchedule=0
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
AccountingStorageType=accounting_storage/slurmdbd
AccountingStoreJobComment=YES
ClusterName=amber
JobCompType=jobcomp/slurmdbd
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=3
SlurmdDebug=3
NodeName=amber[301-314] CPUs=24 RealMemory=128656 Sockets=2 CoresPerSocket=12 ThreadsPerCore=1 Gres=gpu:8 State=UNKNOWN 
PartitionName=amber3 Nodes=amber[301-314] Default=YES MaxTime=INFINITE State=UP
--

Thanks,
-Simran

Comment 2 Simran 2017-04-16 17:34:59 MDT

I also see this on the client:

--
[root@amber301 log]# ls -l /sys/fs/cgroup/freezer
total 0
-rw-r--r-- 1 root root 0 Apr 14 11:58 cgroup.clone_children
--w--w--w- 1 root root 0 Apr 14 11:58 cgroup.event_control
-rw-r--r-- 1 root root 0 Apr 14 11:58 cgroup.procs
-r--r--r-- 1 root root 0 Apr 14 11:58 cgroup.sane_behavior
-rw-r--r-- 1 root root 0 Apr 14 11:58 notify_on_release
-rw-r--r-- 1 root root 0 Apr 14 11:58 release_agent
-rw-r--r-- 1 root root 0 Apr 14 11:58 tasks
--

Regards,
-Simran

Comment 3 Simran 2017-04-16 17:57:55 MDT

figured it out.  Looks like this was because I did not have the cgroup.conf and the cgroup_allowed_devices_file.conf files configured.  After putting these 2 files in place I am now able to start slurmd on the clients.

Comment 4 Simran 2017-04-16 18:00:06 MDT

This is how I am setting up my cgroups.conf and slurm.conf.  Can you please let me know if this makes sense or if I am missing something here?  I have not used cgroups with slurm before so would like to make sure:

--
[root@amber600 slurm]# cat cgroup.conf| grep -v '#'
CgroupMountpoint="/sys/fs/cgroup"
CgroupAutomount=yes
CgroupReleaseAgentDir="/etc/slurm/cgroup"
AllowedDevicesFile="/etc/slurm/cgroup_allowed_devices_file.conf"
ConstrainCores=no
TaskAffinity=no
ConstrainRAMSpace=yes
ConstrainSwapSpace=no
ConstrainDevices=no
AllowedRamSpace=100
AllowedSwapSpace=0
MaxRAMPercent=100
MaxSwapPercent=100
MinRAMSpace=30

[root@amber600 slurm]# cat cgroup_allowed_devices_file.conf | grep -v '#'
/dev/null
/dev/urandom
/dev/zero
/dev/sda*
/dev/cpu/*/*
/dev/pts/*

[root@amber600 slurm]# cat slurm.conf | grep -v '#'
ControlMachine=amber600
AuthType=auth/munge
CryptoType=crypto/munge
GresTypes=gpu 
MpiDefault=none
ProctrackType=proctrack/cgroup
ReturnToService=0
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool/slurm
SwitchType=switch/none
TaskPlugin=task/cgroup
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
FastSchedule=0
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
AccountingStorageType=accounting_storage/slurmdbd
AccountingStoreJobComment=YES
ClusterName=amber
JobCompType=jobcomp/slurmdbd
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=3
SlurmdDebug=3
NodeName=amber[301-314] CPUs=24 RealMemory=128656 Sockets=2 CoresPerSocket=12 ThreadsPerCore=1 Gres=gpu:8 State=UNKNOWN 
PartitionName=amber3 Nodes=amber[301-314] Default=YES MaxTime=INFINITE State=UP
--

Thanks,
-Simran

Comment 5 Tim Wickberg 2017-04-17 10:04:22 MDT

Glad to hear you were able to work out this over the weekend. I'm responding to some config entries below.

(In reply to Simran from comment #4)
> This is how I am setting up my cgroups.conf and slurm.conf.  Can you please
> let me know if this makes sense or if I am missing something here?  I have
> not used cgroups with slurm before so would like to make sure:
> 
> --
> [root@amber600 slurm]# cat cgroup.conf| grep -v '#'
> CgroupMountpoint="/sys/fs/cgroup"
> CgroupAutomount=yes
> CgroupReleaseAgentDir="/etc/slurm/cgroup"
> AllowedDevicesFile="/etc/slurm/cgroup_allowed_devices_file.conf"

You don't need this set if ConstrainDevices is not enabled.

> ConstrainCores=no

If you're managing the cluster based on CPU allocations then we usually recommend enabling this; it'll force the job to only run on the CPUs explicitly assigned to them. Otherwise they may run on whichever cores the Linux kernel schedules them on automatically, and they can use more than their share of the CPUs.

> TaskAffinity=no
> ConstrainRAMSpace=yes
> ConstrainSwapSpace=no

I'd encourage you to enable this - otherwise a job that's run out of RAMSpace will automatically start using swap space instead, which will usually lead to it slowing down considerably. I personally prefer to have the jobs die sooner from using their memory allocation, rather than slow down considerably while using swap.

> ConstrainDevices=no
> AllowedRamSpace=100
> AllowedSwapSpace=0
> MaxRAMPercent=100
> MaxSwapPercent=100
> MinRAMSpace=30

Comment 6 Tim Wickberg 2017-05-03 07:05:13 MDT

Marking this as resolved/infogiven. Please reopen if there is anything further I can address.

cheers,
- Tim

Comment 7 Saleh 2024-03-13 06:29:59 MDT

Hi,

Initially I had a same issue as the one reported here by @Simran. After adding  "cgroup.conf" and "cgroup_allowed_devices_file.conf" that issue was resolved. 
I have 1 server which I want to use it as both controller node and the compute node.



However I am facing a new issue:
--------------------------------
(base) saleh@compute-node-22:~$ sinfo -Nel
Wed Mar 13 12:14:38 2024
NODELIST         NODES    PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON              
compute-node-22      1 single-node*        down   56   2:14:2      1        0      1   (null) Not responding      
(base) saleh@compute-node-22:~$ 
--------

basically the "STATE" is "down" and the compute (I have only 1 node) is not working. 


Here is the configs that I used:
-------------------
(base) saleh@compute-node-22:~$ cat /etc/slurm-llnl/slurm.conf 
# Slurm configuration file for single-node setup

# Control machine (Slurm controller node)
ControlMachine=compute-node-22

# Compute node definition
NodeName=compute-node-22 CPUs=56 CoresPerSocket=14 ThreadsPerCore=2 State=UNKNOWN


# Partition configuration (single partition with all resources)
PartitionName=single-node Nodes=compute-node-22 Default=YES MaxTime=INFINITE State=UP

ClusterName=MyCluster


# Proctrack and Task plugin configuration
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup
(base) saleh@compute-node-22:~$ 
------------------------------

(base) saleh@compute-node-22:~$ cat /etc/slurm-llnl/slurmd.conf    # Check Slurm compute node configuration
# Slurmd Configuration File
Name=compute-node-22
SlurmdLogFile=/var/log/slurmd.log
SlurmdDebug=3
(base) saleh@compute-node-22:~$ 
----------------------------------
(base) saleh@compute-node-22:~$ cat /etc/slurm-llnl/cgroup.conf 
CgroupMountpoint="/sys/fs/cgroup"
CgroupAutomount=yes
CgroupReleaseAgentDir="/etc/slurm-llnl/cgroup"
AllowedDevicesFile="/etc/slurm-llnl/cgroup_allowed_devices_file.conf"
ConstrainCores=no
TaskAffinity=no
ConstrainRAMSpace=yes
ConstrainSwapSpace=no
ConstrainDevices=no
AllowedRamSpace=100
AllowedSwapSpace=0
MaxRAMPercent=100
MaxSwapPercent=100
MinRAMSpace=30
(base) saleh@compute-node-22:~$ 
---------------------------------
(base) saleh@compute-node-22:~$ cat /etc/slurm-llnl/cgroup_allowed_devices_file.conf 
/dev/null
/dev/urandom
/dev/zero
/dev/sda*
/dev/cpu/*/*
/dev/pts/*
(base) saleh@compute-node-22:~$ 
------------------------------------
(base) saleh@compute-node-22:~$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.6 LTS
Release:	20.04
Codename:	focal
(base) saleh@compute-node-22:~$ 
-------------------------------
(base) saleh@compute-node-22:~$ lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          56
On-line CPU(s) list:             0-55
Thread(s) per core:              2
Core(s) per socket:              14
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           79
Model name:                      Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
Stepping:                        1
CPU MHz:                         1200.549
CPU max MHz:                     2400.0000
CPU min MHz:                     1200.0000
BogoMIPS:                        4799.97
Virtualization:                  VT-x
L1d cache:                       896 KiB
L1i cache:                       896 KiB
L2 cache:                        7 MiB
L3 cache:                        70 MiB
NUMA node0 CPU(s):               0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54
NUMA node1 CPU(s):               1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55
Vulnerability Itlb multihit:     KVM: Mitigation: VMX disabled
Vulnerability L1tf:              Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds:               Vulnerable: Clear CPU buffers attempted, no microcode; SMT vulnerable
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full generic retpoline, STIBP disabled, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Vulnerable: Clear CPU buffers attempted, no microcode; SMT vulnerable
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perf
                                 mon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 
                                 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsb
                                 ase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm arat pln pts
(base) saleh@compute-node-22:~$ 
------------------------------------------------------------------
(base) saleh@compute-node-22:~$ sudo systemctl status slurmctld
[sudo] password for saleh: 
● slurmctld.service - Slurm controller daemon
     Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2024-03-13 12:00:47 UTC; 27min ago
       Docs: man:slurmctld(8)
    Process: 3695708 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS)
   Main PID: 3695710 (slurmctld)
      Tasks: 7
     Memory: 4.0M
     CGroup: /system.slice/slurmctld.service
             └─3695710 /usr/sbin/slurmctld

Mar 13 12:00:47 compute-node-22 slurmctld[3695710]: Down nodes: compute-node-22
Mar 13 12:00:47 compute-node-22 slurmctld[3695710]: Recovered JobId=4 Assoc=0
Mar 13 12:00:47 compute-node-22 slurmctld[3695710]: Recovered JobId=5 Assoc=0
Mar 13 12:00:47 compute-node-22 slurmctld[3695710]: Recovered information about 2 jobs
Mar 13 12:00:47 compute-node-22 slurmctld[3695710]: Recovered state of 0 reservations
Mar 13 12:00:47 compute-node-22 slurmctld[3695710]: _preserve_plugins: backup_controller not specified
Mar 13 12:00:47 compute-node-22 slurmctld[3695710]: Running as primary controller
Mar 13 12:00:47 compute-node-22 slurmctld[3695710]: No parameter for mcs plugin, default values set
Mar 13 12:00:47 compute-node-22 slurmctld[3695710]: mcs: MCSParameters = (null). ondemand set.
Mar 13 12:01:47 compute-node-22 slurmctld[3695710]: SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
(base) saleh@compute-node-22:~$ 
-----------------------------------------------------
(base) saleh@compute-node-22:~$ sudo systemctl status slurmd
● slurmd.service - Slurm node daemon
     Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2024-03-13 12:00:53 UTC; 28min ago
       Docs: man:slurmd(8)
    Process: 3695728 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
   Main PID: 3695731 (slurmd)
      Tasks: 1
     Memory: 1.7M
     CGroup: /system.slice/slurmd.service
             └─3695731 /usr/sbin/slurmd

Mar 13 12:00:53 compute-node-22 systemd[1]: Starting Slurm node daemon...
Mar 13 12:00:53 compute-node-22 slurmd-compute-node-22[3695728]: Message aggregation disabled
Mar 13 12:00:53 compute-node-22 systemd[1]: slurmd.service: Can't open PID file /run/slurmd.pid (yet?) after start: Operation not permitted
Mar 13 12:00:53 compute-node-22 slurmd-compute-node-22[3695731]: slurmd version 19.05.5 started
Mar 13 12:00:53 compute-node-22 slurmd-compute-node-22[3695731]: slurmd started on Wed, 13 Mar 2024 12:00:53 +0000
Mar 13 12:00:53 compute-node-22 systemd[1]: Started Slurm node daemon.
Mar 13 12:00:53 compute-node-22 slurmd-compute-node-22[3695731]: CPUs=56 Boards=1 Sockets=2 Cores=14 Threads=2 Memory=1031768 TmpDisk=749589 Uptime=4224271 CPUSpecList=(null) FeaturesAvail=(null) Feature>
(base) saleh@compute-node-22:~$ 
------------------------------------------------

that was info

please let me know how can I resolve the issue
Best regards,
Saleh

Comment 8 Saleh 2024-03-13 06:32:19 MDT

(In reply to Saleh from comment #7)
> Hi,
> 
> Initially I had a same issue as the one reported here by @Simran. After
> adding  "cgroup.conf" and "cgroup_allowed_devices_file.conf" that issue was
> resolved. 
> I have 1 server which I want to use it as both controller node and the
> compute node.
> 
> 
> 
> However I am facing a new issue:
> --------------------------------
> (base) saleh@compute-node-22:~$ sinfo -Nel
> Wed Mar 13 12:14:38 2024
> NODELIST         NODES    PARTITION       STATE CPUS    S:C:T MEMORY
> TMP_DISK WEIGHT AVAIL_FE REASON              
> compute-node-22      1 single-node*        down   56   2:14:2      1       
> 0      1   (null) Not responding      
> (base) saleh@compute-node-22:~$ 
> --------
> 
> basically the "STATE" is "down" and the compute (I have only 1 node) is not
> working. 
> 
> 
> Here is the configs that I used:
> -------------------
> (base) saleh@compute-node-22:~$ cat /etc/slurm-llnl/slurm.conf 
> # Slurm configuration file for single-node setup
> 
> # Control machine (Slurm controller node)
> ControlMachine=compute-node-22
> 
> # Compute node definition
> NodeName=compute-node-22 CPUs=56 CoresPerSocket=14 ThreadsPerCore=2
> State=UNKNOWN
> 
> 
> # Partition configuration (single partition with all resources)
> PartitionName=single-node Nodes=compute-node-22 Default=YES MaxTime=INFINITE
> State=UP
> 
> ClusterName=MyCluster
> 
> 
> # Proctrack and Task plugin configuration
> ProctrackType=proctrack/cgroup
> TaskPlugin=task/cgroup
> (base) saleh@compute-node-22:~$ 
> ------------------------------
> 
> (base) saleh@compute-node-22:~$ cat /etc/slurm-llnl/slurmd.conf    # Check
> Slurm compute node configuration
> # Slurmd Configuration File
> Name=compute-node-22
> SlurmdLogFile=/var/log/slurmd.log
> SlurmdDebug=3
> (base) saleh@compute-node-22:~$ 
> ----------------------------------
> (base) saleh@compute-node-22:~$ cat /etc/slurm-llnl/cgroup.conf 
> CgroupMountpoint="/sys/fs/cgroup"
> CgroupAutomount=yes
> CgroupReleaseAgentDir="/etc/slurm-llnl/cgroup"
> AllowedDevicesFile="/etc/slurm-llnl/cgroup_allowed_devices_file.conf"
> ConstrainCores=no
> TaskAffinity=no
> ConstrainRAMSpace=yes
> ConstrainSwapSpace=no
> ConstrainDevices=no
> AllowedRamSpace=100
> AllowedSwapSpace=0
> MaxRAMPercent=100
> MaxSwapPercent=100
> MinRAMSpace=30
> (base) saleh@compute-node-22:~$ 
> ---------------------------------
> (base) saleh@compute-node-22:~$ cat
> /etc/slurm-llnl/cgroup_allowed_devices_file.conf 
> /dev/null
> /dev/urandom
> /dev/zero
> /dev/sda*
> /dev/cpu/*/*
> /dev/pts/*
> (base) saleh@compute-node-22:~$ 
> ------------------------------------
> (base) saleh@compute-node-22:~$ lsb_release -a
> No LSB modules are available.
> Distributor ID:	Ubuntu
> Description:	Ubuntu 20.04.6 LTS
> Release:	20.04
> Codename:	focal
> (base) saleh@compute-node-22:~$ 
> -------------------------------
> (base) saleh@compute-node-22:~$ lscpu
> Architecture:                    x86_64
> CPU op-mode(s):                  32-bit, 64-bit
> Byte Order:                      Little Endian
> Address sizes:                   46 bits physical, 48 bits virtual
> CPU(s):                          56
> On-line CPU(s) list:             0-55
> Thread(s) per core:              2
> Core(s) per socket:              14
> Socket(s):                       2
> NUMA node(s):                    2
> Vendor ID:                       GenuineIntel
> CPU family:                      6
> Model:                           79
> Model name:                      Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
> Stepping:                        1
> CPU MHz:                         1200.549
> CPU max MHz:                     2400.0000
> CPU min MHz:                     1200.0000
> BogoMIPS:                        4799.97
> Virtualization:                  VT-x
> L1d cache:                       896 KiB
> L1i cache:                       896 KiB
> L2 cache:                        7 MiB
> L3 cache:                        70 MiB
> NUMA node0 CPU(s):              
> 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,
> 54
> NUMA node1 CPU(s):              
> 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,
> 55
> Vulnerability Itlb multihit:     KVM: Mitigation: VMX disabled
> Vulnerability L1tf:              Mitigation; PTE Inversion; VMX conditional
> cache flushes, SMT vulnerable
> Vulnerability Mds:               Vulnerable: Clear CPU buffers attempted, no
> microcode; SMT vulnerable
> Vulnerability Meltdown:          Mitigation; PTI
> Vulnerability Spec store bypass: Vulnerable
> Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and
> __user pointer sanitization
> Vulnerability Spectre v2:        Mitigation; Full generic retpoline, STIBP
> disabled, RSB filling
> Vulnerability Srbds:             Not affected
> Vulnerability Tsx async abort:   Vulnerable: Clear CPU buffers attempted, no
> microcode; SMT vulnerable
> Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep
> mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
> syscall nx pdpe1gb rdtscp lm constant_tsc arch_perf
>                                  mon pebs bts rep_good nopl xtopology
> nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est
> tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 
>                                  x2apic movbe popcnt aes xsave avx f16c
> rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3
> invpcid_single pti tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsb
>                                  ase tsc_adjust bmi1 hle avx2 smep bmi2 erms
> invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc
> cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm arat pln pts
> (base) saleh@compute-node-22:~$ 
> ------------------------------------------------------------------
> (base) saleh@compute-node-22:~$ sudo systemctl status slurmctld
> [sudo] password for saleh: 
> ● slurmctld.service - Slurm controller daemon
>      Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; vendor
> preset: enabled)
>      Active: active (running) since Wed 2024-03-13 12:00:47 UTC; 27min ago
>        Docs: man:slurmctld(8)
>     Process: 3695708 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
> (code=exited, status=0/SUCCESS)
>    Main PID: 3695710 (slurmctld)
>       Tasks: 7
>      Memory: 4.0M
>      CGroup: /system.slice/slurmctld.service
>              └─3695710 /usr/sbin/slurmctld
> 
> Mar 13 12:00:47 compute-node-22 slurmctld[3695710]: Down nodes:
> compute-node-22
> Mar 13 12:00:47 compute-node-22 slurmctld[3695710]: Recovered JobId=4 Assoc=0
> Mar 13 12:00:47 compute-node-22 slurmctld[3695710]: Recovered JobId=5 Assoc=0
> Mar 13 12:00:47 compute-node-22 slurmctld[3695710]: Recovered information
> about 2 jobs
> Mar 13 12:00:47 compute-node-22 slurmctld[3695710]: Recovered state of 0
> reservations
> Mar 13 12:00:47 compute-node-22 slurmctld[3695710]: _preserve_plugins:
> backup_controller not specified
> Mar 13 12:00:47 compute-node-22 slurmctld[3695710]: Running as primary
> controller
> Mar 13 12:00:47 compute-node-22 slurmctld[3695710]: No parameter for mcs
> plugin, default values set
> Mar 13 12:00:47 compute-node-22 slurmctld[3695710]: mcs: MCSParameters =
> (null). ondemand set.
> Mar 13 12:01:47 compute-node-22 slurmctld[3695710]:
> SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,
> partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
> (base) saleh@compute-node-22:~$ 
> -----------------------------------------------------
> (base) saleh@compute-node-22:~$ sudo systemctl status slurmd
> ● slurmd.service - Slurm node daemon
>      Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor
> preset: enabled)
>      Active: active (running) since Wed 2024-03-13 12:00:53 UTC; 28min ago
>        Docs: man:slurmd(8)
>     Process: 3695728 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS
> (code=exited, status=0/SUCCESS)
>    Main PID: 3695731 (slurmd)
>       Tasks: 1
>      Memory: 1.7M
>      CGroup: /system.slice/slurmd.service
>              └─3695731 /usr/sbin/slurmd
> 
> Mar 13 12:00:53 compute-node-22 systemd[1]: Starting Slurm node daemon...
> Mar 13 12:00:53 compute-node-22 slurmd-compute-node-22[3695728]: Message
> aggregation disabled
> Mar 13 12:00:53 compute-node-22 systemd[1]: slurmd.service: Can't open PID
> file /run/slurmd.pid (yet?) after start: Operation not permitted
> Mar 13 12:00:53 compute-node-22 slurmd-compute-node-22[3695731]: slurmd
> version 19.05.5 started
> Mar 13 12:00:53 compute-node-22 slurmd-compute-node-22[3695731]: slurmd
> started on Wed, 13 Mar 2024 12:00:53 +0000
> Mar 13 12:00:53 compute-node-22 systemd[1]: Started Slurm node daemon.
> Mar 13 12:00:53 compute-node-22 slurmd-compute-node-22[3695731]: CPUs=56
> Boards=1 Sockets=2 Cores=14 Threads=2 Memory=1031768 TmpDisk=749589
> Uptime=4224271 CPUSpecList=(null) FeaturesAvail=(null) Feature>
> (base) saleh@compute-node-22:~$ 
> ------------------------------------------------
> 
> that was info
> 
> please let me know how can I resolve the issue
> Best regards,
> Saleh


here is the info about the version:
-------------------------------------
(base) saleh@compute-node-22:~$ sinfo -V
slurm-wlm 19.05.5
(base) saleh@compute-node-22:~$ 
-------------------------------------