Ticket 15909 - Force nvml configuration even when nvml is not avaible at compile time
Summary: Force nvml configuration even when nvml is not avaible at compile time
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Build System and Packaging (show other tickets)
Version: 23.02.x
Hardware: Linux Linux
: --- C - Contributions
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2023-01-31 02:49 MST by Gennaro Oliva
Modified: 2023-02-02 15:25 MST (History)
0 users

See Also:
Site: -Other-
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Force nvml configuration without autodetection (761 bytes, text/plain)
2023-01-31 02:49 MST, Gennaro Oliva
Details
slurmd and slurmctld log files when the plugin is present or missing (9.71 KB, application/gzip)
2023-02-02 15:25 MST, Gennaro Oliva
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Gennaro Oliva 2023-01-31 02:49:31 MST
Created attachment 28650 [details]
Force nvml configuration without autodetection

Hi there,
I have to build slurm in two separate environment for licensing reasons, the only difference between the twos is the availability of libnvml. The plan is to build the main slurm in the "free" environment and the gpu_nvml.so plugin in the "non-free" environment and then to allow the free version to use the nvml plugin. To this aim I want to remove the HAVE_NVML clause in src/common/gpu.c as in the attached patch.
Do you see any issues?
Comment 2 Tim Wickberg 2023-01-31 15:37:59 MST
Hey Gennaro -

The one issue I have with removing this is that, if AutoDetect=nvml was set, but the extra package you're splitting the gpu_nvml.so off into hasn't been installed, we'd end up blowing up in a slightly weird spot during the plugin init. That's what these conditionals are trying to protect against... under the assumption that everything came from a single build, not the split-builds you're looking at handling due to the licensing issue.

I'd suggest a stat() get put in to check against the gpu_nvml.so's existence if any change is going to be made here. We'd also want the equivalent changes for oneapi/rsmi for consistency, even though I understand packaging for those is likely not as frequent a request for you at this point.

thanks,
- Tim
Comment 3 Gennaro Oliva 2023-02-02 15:24:58 MST
Hi Tim,
thank you very much for your comments.

(In reply to Tim Wickberg from comment #2)
> The one issue I have with removing this is that, if AutoDetect=nvml was set,
> but the extra package you're splitting the gpu_nvml.so off into hasn't been
> installed, we'd end up blowing up in a slightly weird spot during the plugin
> init. That's what these conditionals are trying to protect against... under
> the assumption that everything came from a single build, not the
> split-builds you're looking at handling due to the licensing issue.

As far as I understood looking at the code, the autodetect=nvml only retrieve
information by the gpu to check them against those provided in the
configuration file.

When the plugin is not available, Slurm just fails to create the plugin context.
I don't see the problem as long as the user get notified, but I'm surely missing something.

> I'd suggest a stat() get put in to check against the gpu_nvml.so's existence
> if any change is going to be made here.

Where do you suggest to put the check for the presence of the plugin?
Changes can be made everywhere in the code.
We don't necessarily have to change this file only. 

>  We'd also want the equivalent
> changes for oneapi/rsmi for consistency, even though I understand packaging
> for those is likely not as frequent a request for you at this point.

RSMI can be included in the main release: it is free software.

I'm attaching debug3 output for slurmctld and slurmd when the plugin is present or missing.

I really appreciate the time you spend on this issue.
Thank you
Comment 4 Gennaro Oliva 2023-02-02 15:25:51 MST
Created attachment 28697 [details]
slurmd and slurmctld log files when the plugin is present or missing