Ticket 14578 - slurmd fails when cgroup.conf is missing
Summary: slurmd fails when cgroup.conf is missing
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: 22.05.2
Hardware: Linux Linux
: --- C - Contributions
Assignee: Felip Moll
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2022-07-20 08:54 MDT by Gennaro Oliva
Modified: 2023-02-09 08:52 MST (History)
1 user (show)

See Also:
Site: -Other-
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 23.02.0-0pre1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
patch (391 bytes, text/plain)
2022-07-20 08:54 MDT, Gennaro Oliva
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Gennaro Oliva 2022-07-20 08:54:11 MDT
Created attachment 25930 [details]
patch

slurmd fails to start if cgroup.conf is missing even if the cgroup plugin is not specified in the slurm.conf

This is the output of slurm -D with debug5 enabled: 

slurmd: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/cgroup_v2.so
slurmd: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Cgroup v2 plugin type:cgroup/v2 version:0x160502
slurmd: error: cannot read (null)/user.slice/user-0.slice/session-10.scope/cgroup.controllers: No such file or directory
slurmd: error: Couldn't load specified plugin name for cgroup/v2: Plugin init() callback failed
slurmd: error: cannot create cgroup context for cgroup/v2
slurmd: error: Unable to initialize cgroup plugin
slurmd: error: slurmd initialization failed


This is the slurm.conf file:

ClusterName=cluster
SlurmctldHost=slurmctld
MpiDefault=none
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/run/slurmctld.pid
SlurmdPidFile=/run/slurmd.pid
SlurmdSpoolDir=/var/lib/slurm/slurmd
SlurmUser=slurm
StateSaveLocation=/var/lib/slurm/slurmctld
SwitchType=switch/none
TaskPlugin=task/affinity
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
AccountingStorageHost=slurmdbd
AccountingStorageType=accounting_storage/slurmdbd
AccountingStoreFlags=job_comment
JobCompType=jobcomp/linux
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=debug5
SlurmdLogFile=/var/log/slurm/slurmd.log
NodeName=slurmd CPUs=2 State=UNKNOWN
PartitionName=debug Nodes=slurmd Default=YES MaxTime=INFINITE State=UP

You can find attached a quick workaround I added to the Debian package to solve the issue. The attached patch sets the default cgroup basedir, making slurmd behave like cgroup.conf exists and is empty.
Comment 3 Felip Moll 2023-02-09 08:52:13 MST
Hi Gennaro,

Setting the defaults is the correct approach, but we wanted to do for all the default settings and not only one. So we added the following commits which fixes the issues.

|\  
| * 2228ca18a7 Set cgroup.conf defaults even without cgroup.conf
| * e369902e4c Sort variables
| * 741578cd66 Fix function signature
|/

These commits are applied to master and will be in the next 23.02 release, but we decided not to include them in 22.05.

There's more work going on to try to make the cgroup.conf file optional. We're working on that on an internal bug but it will be probably for 23.11.

Having said that I appreciate your contribution and I proceed to close the bug.

Thanks a lot!