Bug 2443 - slurmd does not start when built in hardened environment
Summary: slurmd does not start when built in hardened environment
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Contributions (show other bugs)
Version: 19.05.3
Hardware: Linux Linux
: --- 5 - Enhancement
Assignee: Unassigned Developer
QA Contact:
URL:
: 2373 2440 8438 (view as bug list)
Depends on:
Blocks:
 
Reported: 2016-02-12 07:36 MST by Nenad Vukicevic
Modified: 2020-02-13 08:26 MST (History)
13 users (show)

See Also:
Site: -Other-
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm elf weak symbols for full relro patch (6.94 KB, patch)
2017-11-11 21:50 MST, Philip Kovacs
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Nenad Vukicevic 2016-02-12 07:36:24 MST
While building on Fedora 23 I ran into the following error om slurmd start, once I created RPMs to install (which had a separate issue on its own):

----
Feb 09 18:55:34 dev slurmd[6700]: error: plugin_load_from_file:
dlopen(/usr/lib64/slurm/select_cons_res.so):
/usr/lib64/slurm/select_cons_res.so: undefined symbol:
powercap_get_cluster_current_cap
Feb 09 18:55:34 dev slurmd[6700]: error: Couldn't load specified
plugin name for select/cons_res: Dlopen of plugin file failed
Feb 09 18:55:34 dev slurmd[6700]: fatal: Can't find plugin for select/cons_res
----

After some checking it turned out that Fedora 23 started using harden packages by default.

/usr/lib/rpm/redhat/macros
---
129
130 %_hardening_cflags      -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1
131 # we don't escape symbols '~', '"', etc. so be careful when changing this
132 %_hardening_ldflags     -specs=/usr/lib/rpm/redhat/redhat-hardened-ld
133
134 # Harden packages by default for Fedora 23:
135 # https://fedorahosted.org/fesco/ticket/1384 (accepted on 2014-02-11)
136 %_hardened_build        1
137 %_hardened_cflags       %{?_hardened_build:%{_hardening_cflags}}
138 %_hardened_ldflags      %{?_hardened_build:%{_hardening_ldflags}}
---

I think with this option, all the plugin symbols must be resolved at the dlopen time.  Which is not the case for the above.

I patched the slurm.spec file by adding the following lines at the beginning
of the file (there is probably more correct/elegant way to do this) which is based on https://fedoraproject.org/wiki/Changes/Harden_All_Packages:

%undefine _hardened_build
%global _hardened_cflags "-Wl,-z,lazy"
%global _hardened_ldflags "-Wl,-z,lazy"

The above worked, slurmd was able to start and I was able to run some programs.  I think that slurmd does not use procedures that have unresolved referencies, or that slurmd should not try load those plugins.  On the other hand, slurmcld has the symbol in question defined.
Comment 1 Tim Wickberg 2016-02-19 08:42:42 MST
Updating status flags. I need to test this a bit further internally but this looks like the best approach so far to handle issues around -z,now.
Comment 2 Tim Wickberg 2016-02-19 08:43:44 MST
*** Bug 2373 has been marked as a duplicate of this bug. ***
Comment 3 Tim Wickberg 2016-02-19 08:45:03 MST
*** Bug 2440 has been marked as a duplicate of this bug. ***
Comment 4 Adam Huffman 2016-02-22 08:05:44 MST
Just to add that, as I pointed out in the other bugs, I'm able to build without disabling the hardened build. In my case, I export the CFLAGS and LDFLAGS in the spec file.
Comment 5 Doug Jacobsen 2017-02-15 16:24:48 MST
FYI, I'm seeing this with slurm-17.02.0-0rc1 on Fedora 25.  Haven't looked deeply into it yet, but a vanilla rpmbuild -tb <tarball> on a vanilla Fedora 25 can start slurmcltd but not slurmd for this same reason.

[root@dmjdev slurm]# /usr/sbin/slurmd -Dvvv
slurmd: error: plugin_load_from_file: dlopen(/usr/lib64/slurm/select_cons_res.so): /usr/lib64/slurm/select_cons_res.so: undefined symbol: powercap_get_cluster_current_cap
slurmd: error: Couldn't load specified plugin name for select/cons_res: Dlopen of plugin file failed
slurmd: fatal: Can't find plugin for select/cons_res
[root@dmjdev slurm]#
Comment 6 Philip Kovacs 2017-11-11 21:50:56 MST
Created attachment 5545 [details]
slurm elf weak symbols for full relro patch
Comment 7 Philip Kovacs 2017-11-11 21:52:02 MST
I've developed a patch that allows slurm to operate when built with full relro flags: -Wl,-z,relro,-z,now.  These changes allow slurm to meet the hardening standards of certain distros, e.g. Fedora.  The full relro build allows the GOT sections of the ELF binaries to be marked read-only and thus makes the software more secure.  

The idea is to mark as "weak" those plugin symbols which might not be resolved in every context in which the plugin operates.  Since attribute tagging varies from compiler to compiler, I opted to use __GNUC__ and __ELF__ guards around the declarations, so the code is compiled only for gcc on ELF systems.

Each plugin in the patch has a comment of the form:

TEST_CASE: sacct vs slurmctld

meaning that if a full relro build is compiled without the patch, you can expect an immediate dlopen failure in one of the listed programs when using that plugin.  Both programs should operate normally with the patch.  Other programs may be involved -- the test cases are minimal examples only.

The patch is benign in the sense that it introduces no functional changes to the code and merely tags as weak the problematic functions.  There already is similar code for __APPLE__ builds in which certain data is tagged "weak_import". 

I want to mention also that plugins named *cray*, *bluegene*, *alps*, *cncu* are not included in this patch.  I am not able to test on cray or bluegene systems.  

The patch is clean against the slurm-17.02 and slurm-17.11 branches as of today.

Phil
Comment 8 Sergey Meirovich 2018-07-24 13:42:36 MDT
For us on 17.02.10 that patch is needed to get ANSYS v19.0 to work. Not sure why.
Comment 9 Sergey Meirovich 2018-07-24 13:49:45 MDT
Forgot to add. We are on stock RHEL6 - so in our case that is not related to hardening of OS/Slurm itself.
Comment 10 sofya 2018-11-27 17:00:48 MST
Hello. 

We ran into the same bug with slurm 17.11.7 and Ansys 19 on CentOS Linux release 7.3.1611 (Core).

Should we apply this patch ir is it better to upgrade to slurm-18.08.3?

Thank you,
Sofya
Comment 11 Philip Kovacs 2018-11-27 17:46:18 MST
The fundamental architectural problem w.r.t. hardening is that you cannot have program A load plugin B if B uses symbols in A.  You need a move those symbols into a third party area, library C, and then link plugin B to C so that there are no unresolved symbols when it loads into A.  

I know the slurm devs recognize the problem as I have brought it to their attention on countless other occasions (bugs), but they have not addressed it yet to my knowledge.  Pretty sure 18.x has the same problem.   The fix is to configure and compile slurm with minimal hardening and lazy linkage.  slurm predates many of the modern hardening techniques we use commonly on fedora and rhel and centos.   

This patch was my attempt to allow slurm to be fully hardened by marking as weak the symbols in program A that are required when when plugin B loads.   Do not count on this patch working beyond the version I wrote it for.  The proper way to approach this is to rebuild slurm with lesser hardening as I mentioned, until the slurm devs prioritize hardening.
Comment 12 Tim Wickberg 2018-11-27 17:57:49 MST
(In reply to sofya from comment #10)
> Hello. 
> 
> We ran into the same bug with slurm 17.11.7 and Ansys 19 on CentOS Linux
> release 7.3.1611 (Core).
> 
> Should we apply this patch ir is it better to upgrade to slurm-18.08.3?

Sofya - can you file a separate issue? I'm having a hard time seeing how this would show up specifically with that combination of CentOS alongside Ansys, and that'd be better tackled as a separate support issue until we're sure it's related.
Comment 13 Tim Wickberg 2018-11-27 18:14:25 MST
(In reply to Philip Kovacs from comment #11)
> I know the slurm devs recognize the problem as I have brought it to their
> attention on countless other occasions (bugs), but they have not addressed
> it yet to my knowledge.  Pretty sure 18.x has the same problem.   The fix is
> to configure and compile slurm with minimal hardening and lazy linkage. 
> slurm predates many of the modern hardening techniques we use commonly on
> fedora and rhel and centos.   

To be clear - "hardening" in this context refers to enabling a set of restrictive linker flags.

My understanding is that this is viewed as "safer" by some security folks insofar as that the behavior that is now being disallowed is not a common pattern in most applications, and that behavior can make certain classes of system exploit simpler.

But Slurm isn't most applications, and lazy linking is something Slurm has always relied on within our plugin infrastructure. Use of lazy linking, as Slurm prefers, is not inherently "unsafe" itself, despite protestations from those security folks trying to force this into different packaging systems.

The patch Philip has provided works around this by tagging a number of symbols as weak (thus avoiding some of these complications), but I do not believe it will work on 18.08, and would need to be updated there. It's not something I'm looking to apply upstream at the moment. In my view its a bandaid around the build environment trying to force overly restrictive linker options upon us, and not a permanent fix.

> This patch was my attempt to allow slurm to be fully hardened by marking as
> weak the symbols in program A that are required when when plugin B loads.  
> Do not count on this patch working beyond the version I wrote it for.  The
> proper way to approach this is to rebuild slurm with lesser hardening as I
> mentioned, until the slurm devs prioritize hardening.

Correct. We do not recommend the use of these "hardening" options at this time.
Comment 14 sofya 2018-11-28 11:58:20 MST
Hi Tim,

I opened a separate bug here:

https://bugs.schedmd.com/show_bug.cgi?id=6112

Can you please take a look. it's a high impact for us.

Thank you,
Sofya
Comment 15 John Donners 2018-12-19 00:56:43 MST
Hello Sofya,

I can't see or comment on #6112, so I comment here. The problem with Ansys 19.0 is their use of LD_BIND_NOW=1 icw srun. You can change the ansys190 script as follows:

diff /ansys_inc/v190/ansys/bin/ansys190*
309c309
<                   command="${distcmd} ${extra_mpi_args} -np ${mpinp} ${dansys_script} ${ansysargs}"
---
>                   command="LD_BIND_NOW='' ${distcmd} -genv LD_BIND_NOW=\"$LD_BIND_NOW\" -genvall ${extra_mpi_args} -np ${mpinp} ${dansys_script} ${ansysargs}"
```

and it should work.
Comment 16 Jean-Charles 2019-11-14 09:34:36 MST
Hello,

I get the same issue on Centos 8 and slurm package slurm-19.05.3-2 :

[root@lmaster slurm-19.05.3-2]# slurmd -D -vvvv
slurmd: error: plugin_load_from_file: dlopen(/usr/lib64/slurm/select_cons_res.so): /usr/lib64/slurm/select_cons_res.so: undefined symbol: powercap_get_cluster_current_cap
slurmd: error: Couldn't load specified plugin name for select/cons_res: Dlopen of plugin file failed
slurmd: fatal: Can't find plugin for select/cons_res

 Jean-Charles
Comment 17 Felip Moll 2020-02-05 02:51:28 MST
*** Bug 8438 has been marked as a duplicate of this bug. ***