Hi Guys, I have a new slurm 17.02.1 based cluster running on CentOS 7.3. I have setup the head node but can't seem to start the client slurmd daemons. It is complaining about cgroups and I don't see any additional cgroups specific configuration that I have to do to fix this. Can you please help: -- from the logs: .. Apr 16 16:02:19 amber301 systemd: Cannot add dependency job for unit microcode.service, ignoring: Unit is not loaded properly: Invalid argument. Apr 16 16:02:19 amber301 systemd: Starting Slurm node daemon... Apr 16 16:02:19 amber301 slurmd[5457]: error: You are using cons_res or gang scheduling with Fastschedule=0 and node configuration differs from hardware. The node configuration used will be what is in the slurm.conf because of the bitmaps the slurmctld must create before the slurmd registers.#012 CPUs=24:48(hw) Boards=1:1(hw) SocketsPerBoard=2:2(hw) CoresPerSocket=12:12(hw) ThreadsPerCore=1:2(hw) Apr 16 16:02:19 amber301 slurmd[5457]: Message aggregation disabled Apr 16 16:02:19 amber301 slurmd[5457]: gpu 0 is device number 0 Apr 16 16:02:19 amber301 slurmd[5457]: gpu 1 is device number 1 Apr 16 16:02:19 amber301 slurmd[5457]: gpu 2 is device number 2 Apr 16 16:02:19 amber301 slurmd[5457]: gpu 3 is device number 3 Apr 16 16:02:19 amber301 slurmd[5457]: gpu 4 is device number 4 Apr 16 16:02:19 amber301 slurmd[5457]: gpu 5 is device number 5 Apr 16 16:02:19 amber301 slurmd[5457]: gpu 6 is device number 6 Apr 16 16:02:19 amber301 slurmd[5457]: gpu 7 is device number 7 Apr 16 16:02:19 amber301 slurmd[5457]: Resource spec: Reserved system memory limit not configured for this node Apr 16 16:02:19 amber301 slurmd[5457]: error: cgroup namespace 'freezer' not mounted. aborting Apr 16 16:02:19 amber301 slurmd[5457]: error: unable to create freezer cgroup namespace Apr 16 16:02:19 amber301 slurmd[5457]: error: Couldn't load specified plugin name for proctrack/cgroup: Plugin init() callback failed Apr 16 16:02:19 amber301 slurmd[5457]: error: cannot create proctrack context for proctrack/cgroup Apr 16 16:02:19 amber301 systemd: slurmd.service: control process exited, code=exited status=1 Apr 16 16:02:19 amber301 slurmd[5457]: error: slurmd initialization failed Apr 16 16:02:19 amber301 systemd: Failed to start Slurm node daemon. Apr 16 16:02:19 amber301 systemd: Unit slurmd.service entered failed state. Apr 16 16:02:19 amber301 systemd: slurmd.service failed. .. [root@amber301 log]# rpm -qa | grep -i slurm slurm-contribs-17.02.1-2.el7.centos.x86_64 slurm-perlapi-17.02.1-2.el7.centos.x86_64 slurm-devel-17.02.1-2.el7.centos.x86_64 slurm-torque-17.02.1-2.el7.centos.x86_64 slurm-slurmdbd-17.02.1-2.el7.centos.x86_64 slurm-pam_slurm-17.02.1-2.el7.centos.x86_64 slurm-openlava-17.02.1-2.el7.centos.x86_64 slurm-17.02.1-2.el7.centos.x86_64 slurm-munge-17.02.1-2.el7.centos.x86_64 slurm-plugins-17.02.1-2.el7.centos.x86_64 slurm-sql-17.02.1-2.el7.centos.x86_64 Thanks, -Simran
Also, here is my slurm.conf: -- ControlMachine=amber600 AuthType=auth/munge CryptoType=crypto/munge GresTypes=gpu MpiDefault=none ProctrackType=proctrack/cgroup ReturnToService=0 SlurmctldPidFile=/var/run/slurm/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurm/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm StateSaveLocation=/var/spool/slurm SwitchType=switch/none TaskPlugin=task/cgroup InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 FastSchedule=0 SchedulerType=sched/backfill SelectType=select/cons_res SelectTypeParameters=CR_CPU_Memory AccountingStorageType=accounting_storage/slurmdbd AccountingStoreJobComment=YES ClusterName=amber JobCompType=jobcomp/slurmdbd JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/linux SlurmctldDebug=3 SlurmdDebug=3 NodeName=amber[301-314] CPUs=24 RealMemory=128656 Sockets=2 CoresPerSocket=12 ThreadsPerCore=1 Gres=gpu:8 State=UNKNOWN PartitionName=amber3 Nodes=amber[301-314] Default=YES MaxTime=INFINITE State=UP -- Thanks, -Simran
I also see this on the client: -- [root@amber301 log]# ls -l /sys/fs/cgroup/freezer total 0 -rw-r--r-- 1 root root 0 Apr 14 11:58 cgroup.clone_children --w--w--w- 1 root root 0 Apr 14 11:58 cgroup.event_control -rw-r--r-- 1 root root 0 Apr 14 11:58 cgroup.procs -r--r--r-- 1 root root 0 Apr 14 11:58 cgroup.sane_behavior -rw-r--r-- 1 root root 0 Apr 14 11:58 notify_on_release -rw-r--r-- 1 root root 0 Apr 14 11:58 release_agent -rw-r--r-- 1 root root 0 Apr 14 11:58 tasks -- Regards, -Simran
figured it out. Looks like this was because I did not have the cgroup.conf and the cgroup_allowed_devices_file.conf files configured. After putting these 2 files in place I am now able to start slurmd on the clients.
This is how I am setting up my cgroups.conf and slurm.conf. Can you please let me know if this makes sense or if I am missing something here? I have not used cgroups with slurm before so would like to make sure: -- [root@amber600 slurm]# cat cgroup.conf| grep -v '#' CgroupMountpoint="/sys/fs/cgroup" CgroupAutomount=yes CgroupReleaseAgentDir="/etc/slurm/cgroup" AllowedDevicesFile="/etc/slurm/cgroup_allowed_devices_file.conf" ConstrainCores=no TaskAffinity=no ConstrainRAMSpace=yes ConstrainSwapSpace=no ConstrainDevices=no AllowedRamSpace=100 AllowedSwapSpace=0 MaxRAMPercent=100 MaxSwapPercent=100 MinRAMSpace=30 [root@amber600 slurm]# cat cgroup_allowed_devices_file.conf | grep -v '#' /dev/null /dev/urandom /dev/zero /dev/sda* /dev/cpu/*/* /dev/pts/* [root@amber600 slurm]# cat slurm.conf | grep -v '#' ControlMachine=amber600 AuthType=auth/munge CryptoType=crypto/munge GresTypes=gpu MpiDefault=none ProctrackType=proctrack/cgroup ReturnToService=0 SlurmctldPidFile=/var/run/slurm/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurm/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm StateSaveLocation=/var/spool/slurm SwitchType=switch/none TaskPlugin=task/cgroup InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 FastSchedule=0 SchedulerType=sched/backfill SelectType=select/cons_res SelectTypeParameters=CR_CPU_Memory AccountingStorageType=accounting_storage/slurmdbd AccountingStoreJobComment=YES ClusterName=amber JobCompType=jobcomp/slurmdbd JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/linux SlurmctldDebug=3 SlurmdDebug=3 NodeName=amber[301-314] CPUs=24 RealMemory=128656 Sockets=2 CoresPerSocket=12 ThreadsPerCore=1 Gres=gpu:8 State=UNKNOWN PartitionName=amber3 Nodes=amber[301-314] Default=YES MaxTime=INFINITE State=UP -- Thanks, -Simran
Glad to hear you were able to work out this over the weekend. I'm responding to some config entries below. (In reply to Simran from comment #4) > This is how I am setting up my cgroups.conf and slurm.conf. Can you please > let me know if this makes sense or if I am missing something here? I have > not used cgroups with slurm before so would like to make sure: > > -- > [root@amber600 slurm]# cat cgroup.conf| grep -v '#' > CgroupMountpoint="/sys/fs/cgroup" > CgroupAutomount=yes > CgroupReleaseAgentDir="/etc/slurm/cgroup" > AllowedDevicesFile="/etc/slurm/cgroup_allowed_devices_file.conf" You don't need this set if ConstrainDevices is not enabled. > ConstrainCores=no If you're managing the cluster based on CPU allocations then we usually recommend enabling this; it'll force the job to only run on the CPUs explicitly assigned to them. Otherwise they may run on whichever cores the Linux kernel schedules them on automatically, and they can use more than their share of the CPUs. > TaskAffinity=no > ConstrainRAMSpace=yes > ConstrainSwapSpace=no I'd encourage you to enable this - otherwise a job that's run out of RAMSpace will automatically start using swap space instead, which will usually lead to it slowing down considerably. I personally prefer to have the jobs die sooner from using their memory allocation, rather than slow down considerably while using swap. > ConstrainDevices=no > AllowedRamSpace=100 > AllowedSwapSpace=0 > MaxRAMPercent=100 > MaxSwapPercent=100 > MinRAMSpace=30
Marking this as resolved/infogiven. Please reopen if there is anything further I can address. cheers, - Tim
Hi, Initially I had a same issue as the one reported here by @Simran. After adding "cgroup.conf" and "cgroup_allowed_devices_file.conf" that issue was resolved. I have 1 server which I want to use it as both controller node and the compute node. However I am facing a new issue: -------------------------------- (base) saleh@compute-node-22:~$ sinfo -Nel Wed Mar 13 12:14:38 2024 NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON compute-node-22 1 single-node* down 56 2:14:2 1 0 1 (null) Not responding (base) saleh@compute-node-22:~$ -------- basically the "STATE" is "down" and the compute (I have only 1 node) is not working. Here is the configs that I used: ------------------- (base) saleh@compute-node-22:~$ cat /etc/slurm-llnl/slurm.conf # Slurm configuration file for single-node setup # Control machine (Slurm controller node) ControlMachine=compute-node-22 # Compute node definition NodeName=compute-node-22 CPUs=56 CoresPerSocket=14 ThreadsPerCore=2 State=UNKNOWN # Partition configuration (single partition with all resources) PartitionName=single-node Nodes=compute-node-22 Default=YES MaxTime=INFINITE State=UP ClusterName=MyCluster # Proctrack and Task plugin configuration ProctrackType=proctrack/cgroup TaskPlugin=task/cgroup (base) saleh@compute-node-22:~$ ------------------------------ (base) saleh@compute-node-22:~$ cat /etc/slurm-llnl/slurmd.conf # Check Slurm compute node configuration # Slurmd Configuration File Name=compute-node-22 SlurmdLogFile=/var/log/slurmd.log SlurmdDebug=3 (base) saleh@compute-node-22:~$ ---------------------------------- (base) saleh@compute-node-22:~$ cat /etc/slurm-llnl/cgroup.conf CgroupMountpoint="/sys/fs/cgroup" CgroupAutomount=yes CgroupReleaseAgentDir="/etc/slurm-llnl/cgroup" AllowedDevicesFile="/etc/slurm-llnl/cgroup_allowed_devices_file.conf" ConstrainCores=no TaskAffinity=no ConstrainRAMSpace=yes ConstrainSwapSpace=no ConstrainDevices=no AllowedRamSpace=100 AllowedSwapSpace=0 MaxRAMPercent=100 MaxSwapPercent=100 MinRAMSpace=30 (base) saleh@compute-node-22:~$ --------------------------------- (base) saleh@compute-node-22:~$ cat /etc/slurm-llnl/cgroup_allowed_devices_file.conf /dev/null /dev/urandom /dev/zero /dev/sda* /dev/cpu/*/* /dev/pts/* (base) saleh@compute-node-22:~$ ------------------------------------ (base) saleh@compute-node-22:~$ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 20.04.6 LTS Release: 20.04 Codename: focal (base) saleh@compute-node-22:~$ ------------------------------- (base) saleh@compute-node-22:~$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 46 bits physical, 48 bits virtual CPU(s): 56 On-line CPU(s) list: 0-55 Thread(s) per core: 2 Core(s) per socket: 14 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz Stepping: 1 CPU MHz: 1200.549 CPU max MHz: 2400.0000 CPU min MHz: 1200.0000 BogoMIPS: 4799.97 Virtualization: VT-x L1d cache: 896 KiB L1i cache: 896 KiB L2 cache: 7 MiB L3 cache: 70 MiB NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55 Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable Vulnerability Mds: Vulnerable: Clear CPU buffers attempted, no microcode; SMT vulnerable Vulnerability Meltdown: Mitigation; PTI Vulnerability Spec store bypass: Vulnerable Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Full generic retpoline, STIBP disabled, RSB filling Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Vulnerable: Clear CPU buffers attempted, no microcode; SMT vulnerable Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perf mon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsb ase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm arat pln pts (base) saleh@compute-node-22:~$ ------------------------------------------------------------------ (base) saleh@compute-node-22:~$ sudo systemctl status slurmctld [sudo] password for saleh: ● slurmctld.service - Slurm controller daemon Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; vendor preset: enabled) Active: active (running) since Wed 2024-03-13 12:00:47 UTC; 27min ago Docs: man:slurmctld(8) Process: 3695708 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS) Main PID: 3695710 (slurmctld) Tasks: 7 Memory: 4.0M CGroup: /system.slice/slurmctld.service └─3695710 /usr/sbin/slurmctld Mar 13 12:00:47 compute-node-22 slurmctld[3695710]: Down nodes: compute-node-22 Mar 13 12:00:47 compute-node-22 slurmctld[3695710]: Recovered JobId=4 Assoc=0 Mar 13 12:00:47 compute-node-22 slurmctld[3695710]: Recovered JobId=5 Assoc=0 Mar 13 12:00:47 compute-node-22 slurmctld[3695710]: Recovered information about 2 jobs Mar 13 12:00:47 compute-node-22 slurmctld[3695710]: Recovered state of 0 reservations Mar 13 12:00:47 compute-node-22 slurmctld[3695710]: _preserve_plugins: backup_controller not specified Mar 13 12:00:47 compute-node-22 slurmctld[3695710]: Running as primary controller Mar 13 12:00:47 compute-node-22 slurmctld[3695710]: No parameter for mcs plugin, default values set Mar 13 12:00:47 compute-node-22 slurmctld[3695710]: mcs: MCSParameters = (null). ondemand set. Mar 13 12:01:47 compute-node-22 slurmctld[3695710]: SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2 (base) saleh@compute-node-22:~$ ----------------------------------------------------- (base) saleh@compute-node-22:~$ sudo systemctl status slurmd ● slurmd.service - Slurm node daemon Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled) Active: active (running) since Wed 2024-03-13 12:00:53 UTC; 28min ago Docs: man:slurmd(8) Process: 3695728 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS) Main PID: 3695731 (slurmd) Tasks: 1 Memory: 1.7M CGroup: /system.slice/slurmd.service └─3695731 /usr/sbin/slurmd Mar 13 12:00:53 compute-node-22 systemd[1]: Starting Slurm node daemon... Mar 13 12:00:53 compute-node-22 slurmd-compute-node-22[3695728]: Message aggregation disabled Mar 13 12:00:53 compute-node-22 systemd[1]: slurmd.service: Can't open PID file /run/slurmd.pid (yet?) after start: Operation not permitted Mar 13 12:00:53 compute-node-22 slurmd-compute-node-22[3695731]: slurmd version 19.05.5 started Mar 13 12:00:53 compute-node-22 slurmd-compute-node-22[3695731]: slurmd started on Wed, 13 Mar 2024 12:00:53 +0000 Mar 13 12:00:53 compute-node-22 systemd[1]: Started Slurm node daemon. Mar 13 12:00:53 compute-node-22 slurmd-compute-node-22[3695731]: CPUs=56 Boards=1 Sockets=2 Cores=14 Threads=2 Memory=1031768 TmpDisk=749589 Uptime=4224271 CPUSpecList=(null) FeaturesAvail=(null) Feature> (base) saleh@compute-node-22:~$ ------------------------------------------------ that was info please let me know how can I resolve the issue Best regards, Saleh
(In reply to Saleh from comment #7) > Hi, > > Initially I had a same issue as the one reported here by @Simran. After > adding "cgroup.conf" and "cgroup_allowed_devices_file.conf" that issue was > resolved. > I have 1 server which I want to use it as both controller node and the > compute node. > > > > However I am facing a new issue: > -------------------------------- > (base) saleh@compute-node-22:~$ sinfo -Nel > Wed Mar 13 12:14:38 2024 > NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY > TMP_DISK WEIGHT AVAIL_FE REASON > compute-node-22 1 single-node* down 56 2:14:2 1 > 0 1 (null) Not responding > (base) saleh@compute-node-22:~$ > -------- > > basically the "STATE" is "down" and the compute (I have only 1 node) is not > working. > > > Here is the configs that I used: > ------------------- > (base) saleh@compute-node-22:~$ cat /etc/slurm-llnl/slurm.conf > # Slurm configuration file for single-node setup > > # Control machine (Slurm controller node) > ControlMachine=compute-node-22 > > # Compute node definition > NodeName=compute-node-22 CPUs=56 CoresPerSocket=14 ThreadsPerCore=2 > State=UNKNOWN > > > # Partition configuration (single partition with all resources) > PartitionName=single-node Nodes=compute-node-22 Default=YES MaxTime=INFINITE > State=UP > > ClusterName=MyCluster > > > # Proctrack and Task plugin configuration > ProctrackType=proctrack/cgroup > TaskPlugin=task/cgroup > (base) saleh@compute-node-22:~$ > ------------------------------ > > (base) saleh@compute-node-22:~$ cat /etc/slurm-llnl/slurmd.conf # Check > Slurm compute node configuration > # Slurmd Configuration File > Name=compute-node-22 > SlurmdLogFile=/var/log/slurmd.log > SlurmdDebug=3 > (base) saleh@compute-node-22:~$ > ---------------------------------- > (base) saleh@compute-node-22:~$ cat /etc/slurm-llnl/cgroup.conf > CgroupMountpoint="/sys/fs/cgroup" > CgroupAutomount=yes > CgroupReleaseAgentDir="/etc/slurm-llnl/cgroup" > AllowedDevicesFile="/etc/slurm-llnl/cgroup_allowed_devices_file.conf" > ConstrainCores=no > TaskAffinity=no > ConstrainRAMSpace=yes > ConstrainSwapSpace=no > ConstrainDevices=no > AllowedRamSpace=100 > AllowedSwapSpace=0 > MaxRAMPercent=100 > MaxSwapPercent=100 > MinRAMSpace=30 > (base) saleh@compute-node-22:~$ > --------------------------------- > (base) saleh@compute-node-22:~$ cat > /etc/slurm-llnl/cgroup_allowed_devices_file.conf > /dev/null > /dev/urandom > /dev/zero > /dev/sda* > /dev/cpu/*/* > /dev/pts/* > (base) saleh@compute-node-22:~$ > ------------------------------------ > (base) saleh@compute-node-22:~$ lsb_release -a > No LSB modules are available. > Distributor ID: Ubuntu > Description: Ubuntu 20.04.6 LTS > Release: 20.04 > Codename: focal > (base) saleh@compute-node-22:~$ > ------------------------------- > (base) saleh@compute-node-22:~$ lscpu > Architecture: x86_64 > CPU op-mode(s): 32-bit, 64-bit > Byte Order: Little Endian > Address sizes: 46 bits physical, 48 bits virtual > CPU(s): 56 > On-line CPU(s) list: 0-55 > Thread(s) per core: 2 > Core(s) per socket: 14 > Socket(s): 2 > NUMA node(s): 2 > Vendor ID: GenuineIntel > CPU family: 6 > Model: 79 > Model name: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz > Stepping: 1 > CPU MHz: 1200.549 > CPU max MHz: 2400.0000 > CPU min MHz: 1200.0000 > BogoMIPS: 4799.97 > Virtualization: VT-x > L1d cache: 896 KiB > L1i cache: 896 KiB > L2 cache: 7 MiB > L3 cache: 70 MiB > NUMA node0 CPU(s): > 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52, > 54 > NUMA node1 CPU(s): > 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53, > 55 > Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled > Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional > cache flushes, SMT vulnerable > Vulnerability Mds: Vulnerable: Clear CPU buffers attempted, no > microcode; SMT vulnerable > Vulnerability Meltdown: Mitigation; PTI > Vulnerability Spec store bypass: Vulnerable > Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and > __user pointer sanitization > Vulnerability Spectre v2: Mitigation; Full generic retpoline, STIBP > disabled, RSB filling > Vulnerability Srbds: Not affected > Vulnerability Tsx async abort: Vulnerable: Clear CPU buffers attempted, no > microcode; SMT vulnerable > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep > mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe > syscall nx pdpe1gb rdtscp lm constant_tsc arch_perf > mon pebs bts rep_good nopl xtopology > nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est > tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 > x2apic movbe popcnt aes xsave avx f16c > rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 > invpcid_single pti tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsb > ase tsc_adjust bmi1 hle avx2 smep bmi2 erms > invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc > cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm arat pln pts > (base) saleh@compute-node-22:~$ > ------------------------------------------------------------------ > (base) saleh@compute-node-22:~$ sudo systemctl status slurmctld > [sudo] password for saleh: > ● slurmctld.service - Slurm controller daemon > Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; vendor > preset: enabled) > Active: active (running) since Wed 2024-03-13 12:00:47 UTC; 27min ago > Docs: man:slurmctld(8) > Process: 3695708 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS > (code=exited, status=0/SUCCESS) > Main PID: 3695710 (slurmctld) > Tasks: 7 > Memory: 4.0M > CGroup: /system.slice/slurmctld.service > └─3695710 /usr/sbin/slurmctld > > Mar 13 12:00:47 compute-node-22 slurmctld[3695710]: Down nodes: > compute-node-22 > Mar 13 12:00:47 compute-node-22 slurmctld[3695710]: Recovered JobId=4 Assoc=0 > Mar 13 12:00:47 compute-node-22 slurmctld[3695710]: Recovered JobId=5 Assoc=0 > Mar 13 12:00:47 compute-node-22 slurmctld[3695710]: Recovered information > about 2 jobs > Mar 13 12:00:47 compute-node-22 slurmctld[3695710]: Recovered state of 0 > reservations > Mar 13 12:00:47 compute-node-22 slurmctld[3695710]: _preserve_plugins: > backup_controller not specified > Mar 13 12:00:47 compute-node-22 slurmctld[3695710]: Running as primary > controller > Mar 13 12:00:47 compute-node-22 slurmctld[3695710]: No parameter for mcs > plugin, default values set > Mar 13 12:00:47 compute-node-22 slurmctld[3695710]: mcs: MCSParameters = > (null). ondemand set. > Mar 13 12:01:47 compute-node-22 slurmctld[3695710]: > SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2, > partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2 > (base) saleh@compute-node-22:~$ > ----------------------------------------------------- > (base) saleh@compute-node-22:~$ sudo systemctl status slurmd > ● slurmd.service - Slurm node daemon > Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor > preset: enabled) > Active: active (running) since Wed 2024-03-13 12:00:53 UTC; 28min ago > Docs: man:slurmd(8) > Process: 3695728 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS > (code=exited, status=0/SUCCESS) > Main PID: 3695731 (slurmd) > Tasks: 1 > Memory: 1.7M > CGroup: /system.slice/slurmd.service > └─3695731 /usr/sbin/slurmd > > Mar 13 12:00:53 compute-node-22 systemd[1]: Starting Slurm node daemon... > Mar 13 12:00:53 compute-node-22 slurmd-compute-node-22[3695728]: Message > aggregation disabled > Mar 13 12:00:53 compute-node-22 systemd[1]: slurmd.service: Can't open PID > file /run/slurmd.pid (yet?) after start: Operation not permitted > Mar 13 12:00:53 compute-node-22 slurmd-compute-node-22[3695731]: slurmd > version 19.05.5 started > Mar 13 12:00:53 compute-node-22 slurmd-compute-node-22[3695731]: slurmd > started on Wed, 13 Mar 2024 12:00:53 +0000 > Mar 13 12:00:53 compute-node-22 systemd[1]: Started Slurm node daemon. > Mar 13 12:00:53 compute-node-22 slurmd-compute-node-22[3695731]: CPUs=56 > Boards=1 Sockets=2 Cores=14 Threads=2 Memory=1031768 TmpDisk=749589 > Uptime=4224271 CPUSpecList=(null) FeaturesAvail=(null) Feature> > (base) saleh@compute-node-22:~$ > ------------------------------------------------ > > that was info > > please let me know how can I resolve the issue > Best regards, > Saleh here is the info about the version: ------------------------------------- (base) saleh@compute-node-22:~$ sinfo -V slurm-wlm 19.05.5 (base) saleh@compute-node-22:~$ -------------------------------------