Ticket 15314 - slurmd can't find memory cgroup controller inspite of being enabled
Summary: slurmd can't find memory cgroup controller inspite of being enabled
Status: RESOLVED INVALID
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other tickets)
Version: 22.05.5
Hardware: Linux Linux
: --- 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2022-10-28 22:44 MDT by foufou33
Modified: 2022-11-03 00:45 MDT (History)
1 user (show)

See Also:
Site: -Other-
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
patch to add debug output (1.16 KB, patch)
2022-10-28 22:44 MDT, foufou33
Details | Diff
more log (4.84 KB, text/x-log)
2022-10-28 22:45 MDT, foufou33
Details
patch (387 bytes, patch)
2022-11-03 00:45 MDT, foufou33
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description foufou33 2022-10-28 22:44:22 MDT
Created attachment 27517 [details]
patch to add debug output

I was playing around with slurm 22.05.5.1 and cgroupv2 when I noticed that slurmd complained about the memory cgroup not being enabled:

as shown here:

[2022-10-28T23:22:09.434] error: Controller memory is not enabled!
[2022-10-28T23:22:09.434] Resource spec: Reserved abstract CPU IDs: 60-63
[2022-10-28T23:22:09.434] Resource spec: Reserved machine CPU IDs: 30-31,62-63
[2022-10-28T23:22:09.434] error: memory cgroup controller is not available.
[2022-10-28T23:22:09.434] error: Resource spec: unable to initialize system memory cgroup
[2022-10-28T23:22:09.434] error: Resource spec: system cgroup memory limit disabled
[2022-10-28T23:22:09.442] cred/munge: init: Munge credential signature plugin loaded
 

# cat /sys/fs/cgroup/cgroup.controllers 
cpuset cpu io memory
#

I added few debug statements in  _get_controllers (src/plugins/cgroup/v2/cgroup_v2.c, attached patch) 


the --partial-- result:

[2022-10-29T00:14:37.275] debug2: cgroup/v2: _get_controllers: _get_controllers: controller: (cpu) not found
[2022-10-29T00:14:37.275] debug2: cgroup/v2: _get_controllers: _get_controllers: trying controller: (memory
)
[2022-10-29T00:14:37.275] debug2: cgroup/v2: _get_controllers: _get_controllers: comparing with: (freezer)
[2022-10-29T00:14:37.275] debug2: cgroup/v2: _get_controllers: _get_controllers: controller: (freezer) not found
[2022-10-29T00:14:37.275] debug2: cgroup/v2: _get_controllers: _get_controllers: comparing with: (cpuset)


it was trying to compare 'memory\n' (as read from /sys/fs/cgroup/cgroup.controllers)  with 'memory' (as stored in ctl_names)

the \n at the end  of the content of /sys/fs/cgroup/cgroup.controllers  should be removed/ignored
Comment 1 foufou33 2022-10-28 22:45:09 MDT
Created attachment 27518 [details]
more log

the rest of the log generated by my patch
Comment 2 foufou33 2022-11-03 00:45:18 MDT
Created attachment 27566 [details]
patch

adding '\n' to strtock delimiters seems to fix the problem.