Bug 4627 - mpi/pmix plugin needs to load libpmix symbols dynamically
Summary: mpi/pmix plugin needs to load libpmix symbols dynamically
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Contributions (show other bugs)
Version: 17.11.2
Hardware: Linux Linux
: --- 5 - Enhancement
Assignee: Unassigned Developer
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-01-13 19:43 MST by Philip Kovacs
Modified: 2019-09-24 12:21 MDT (History)
5 users (show)

See Also:
Site: -Other-
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm_pmix_symbols.patch (11.21 KB, patch)
2018-01-13 19:43 MST, Philip Kovacs
Details | Diff
slurm_pmix_symbols.patch (11.23 KB, patch)
2018-01-13 19:53 MST, Philip Kovacs
Details | Diff
harden.tar.bz2 (1.17 KB, application/x-bzip)
2018-01-15 18:43 MST, Philip Kovacs
Details
PMIx plugin patch: fixes symbol issue (1.05 KB, patch)
2018-01-17 09:01 MST, Boris Karasev
Details | Diff
harden.tar.bz2 patch (585 bytes, patch)
2018-01-17 09:02 MST, Boris Karasev
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Philip Kovacs 2018-01-13 19:43:05 MST
Created attachment 5918 [details]
slurm_pmix_symbols.patch

ref: git show c539d34684 
ref: git show 05f239055b

The plugin mpi/pmix was changed with the above commits so that it uses dlopen on  libpmix.so instead of directly linking to it.

The problem is that the pmix symbols are still referenced in the plugin code as if the library was directly linked to mpi_pmix.so.   This renders the mpi_pmix.so plugin unloadable in hardened environments since the pmix symbols are only available after the dlopen occurs.  

The proper way to handle this is to dlopen libpmix and then dlsym each symbol needed.

Patch attached.
Comment 1 Philip Kovacs 2018-01-13 19:53:12 MST
Created attachment 5919 [details]
slurm_pmix_symbols.patch
Comment 2 Philip Kovacs 2018-01-15 18:43:27 MST
Created attachment 5922 [details]
harden.tar.bz2

Also attached is a small sample program I wrote to demonstrate the issue w.r.t. hardened linker flags.
Comment 3 Artem Polyakov 2018-01-16 14:32:58 MST
Philip,
We talked to Danny and we think we have a better solution for this. Need to check it and we will update shortly.
The main reason of dlopen’ing Pmix library is that we need to make sure that symbols in it have global visibility. This is important for Pmix components to work.
Danny says they were solving it by linking directly and then additionally dlopening with global flag.
If this will work as expected the fix would be much simpler: just change compile flags to link against Pmix.

One question: can you clarify term hardened environments?
Comment 4 Philip Kovacs 2018-01-16 15:12:23 MST
I attached a sample program yesterday (harden.tar.bz) that very clearly documents the hardened environment problem in simplest terms.  Please take a moment look at it.  

tar -xjf harden.tar.bz2
cd harden
make
./client

Essentially, LDFLAGS=-Wl,z,relro,-z,now will disallow any attempt to dlopen using RTDL_LAZY and instead use RTDL_NOW.

You should verify that plugins are built in this manner with 

readelf -d <plugin.so> | grep BIND_NOW

If you see BIND_NOW, the plugin cannot be loaded with unresolved symbols.

I hope that you will look at this problem generally and take steps toward hardening slurm.  

On a different bug I supplied an extensive patch to weaken certain symbols that exist in one context but not another.  

For example, both srun and slurmd load the select plugins, but only one of the two executables resolves all of the symbols in certain select plugin.
Comment 5 Artem Polyakov 2018-01-16 15:39:19 MST
Thank you Philip.
We will carefully check all of the info you provided and reply ASAP (I’m traveling now so haven’t checked tarball from my phone).
We will check with you before applying the changes.
Comment 6 Boris Karasev 2018-01-17 09:01:40 MST
Created attachment 5930 [details]
PMIx plugin patch: fixes symbol issue
Comment 7 Boris Karasev 2018-01-17 09:02:41 MST
Created attachment 5931 [details]
harden.tar.bz2 patch
Comment 8 Boris Karasev 2018-01-17 09:07:08 MST
Philip,
I looked the attached "harden" sample program, and applied to it the fix that mentioned above: direct plugin linking during compile. This fix is resolve the symbols problem for "harden".
Also I reproduced symbols issue for PMIx plugin by built the plugin with "LDFLAGS=-Wl,-z,relro,-z,now". Then applied the similar fix to the plugin, which allowed to avoid the symbol issue for PMIx plugin.
Please see the "harden" patch and PMIx pluging patch in attachment.
Can you confirm those patches are resolve the issue? Please don't forget rerun ./autogen after Slurm patching.
Comment 9 Philip Kovacs 2018-01-17 11:27:02 MST
Boris,

I applied your patch and the mpi/pmix plugin is able to load and run w/hardened flags now.  

$ srun --mpi=list hostname
srun: MPI types are...
srun: none
srun: openmpi
srun: pmi2
srun: pmix
srun: pmix_v1

$ srun --mpi=pmix hostname
p28vm1

$ readelf -d mpi_pmix_v1.so | grep BIND_NOW
 0x0000000000000018 (BIND_NOW) 


It's a little unusual to have to adjust the visibility of the symbols in libpmix with dlopen and RTLD_GLOBAL.  I suppose libpmix on some systems was built with the local visibility settings and you're forced to change them... 

All good here now.  Thanks much for your help.
Comment 10 Philip Kovacs 2018-01-17 11:32:15 MST
Also,  this bug https://bugs.schedmd.com/show_bug.cgi?id=4361 is the exact same issue (relro, now) for a different plugin.   In that case, a link to libnuma is missing.
Comment 11 Tim Wickberg 2018-01-17 13:47:01 MST
Hey folks -

At the moment, SchedMD has no customers pushing for these hardened linking flags, and we internally have no interest in applying these right now.

Even if we do explore these changes, they are significant enough that they'd need to wait for an upcoming major release for consideration.

If we do tackle this, PMIx is not the only affected library, and work will need to be done to ensure all similarly affected libraries are handled (e.g., UCX, Cray PMI), rather than trying to handle this piecemeal.

I'm retagging this as an unassigned enhancement request, to possibly be explored at a later point in time.

- Tim