Bug 14483

Summary: sbatch error with SPANK plugin: Plugin file not found
Product: Slurm Reporter: Ole.H.Nielsen <Ole.H.Nielsen>
Component: ConfigurationAssignee: Oriol Vilarrubi <jvilarru>
Status: RESOLVED FIXED QA Contact: Ben Roberts <ben>
Severity: 4 - Minor Issue    
Priority: --- CC: marshall
Version: 21.08.8   
Hardware: Linux   
OS: Linux   
Site: DTU Physics Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 23.02 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Ole.H.Nielsen@fysik.dtu.dk 2022-07-06 05:51:50 MDT
According to the SPANK manual https://slurm.schedmd.com/spank.html it is (implicitly) indicated that the SPANK shared libraries need to be present only on the slurmd nodes.  No requirement is mentioned about SPANK libraries being required on login and slurmctld nodes.

On our test cluster (running AlmaLinux 8.6) we have configured the SPANK plugin https://github.com/University-of-Delaware-IT-RCI/auto_tmpdir.  
The auto_tmpdir RPM package installs the following shared library:

$ rpm -ql auto_tmpdir
/usr/lib/.build-id
/usr/lib/.build-id/3b
/usr/lib/.build-id/3b/39af89648c9715355685e31357c707e3ec571c
/usr/lib64/slurm/auto_tmpdir.so

We have created a new plugstack.conf file in /etc/slurm on all nodes in the test cluster:

$ cat /etc/slurm/plugstack.conf 
#
# SLURM plugin stack configuration
#
# req/opt   plugin                  arguments
# ~~~~~~~   ~~~~~~~~~~~~~~~~~~~~~~  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
required    auto_tmpdir.so          mount=/tmp mount=/var/tmp

This SPANK plugin works as expected in our tests.

However, when I remove the auto_tmpdir RPM on the login node, so that the library /usr/lib64/slurm/auto_tmpdir.so no longer exists, then I get an error when submitting a new job from the login node:

$ sbatch job_container.sh
sbatch: error: spank: auto_tmpdir.so: Plugin file not found
sbatch: error: spank: /etc/slurm/plugstack.conf:6: Failed to load plugin auto_tmpdir.so. Aborting.
sbatch: error: Failed to initialize plugin stack

If I reinstall the auto_tmpdir RPM the error goes away.

Questions: 

1. What is the reason for errors from sbatch when the SPANK library is absent?

2. In the SPANK manual page we are missing documentation of any possible requirement for the SPANK libraries to be installed also on login and slurmctld nodes.  Can you please add extra documentation if required?

Thanks a lot,
Ole
Comment 2 Oriol Vilarrubi 2022-07-06 09:50:55 MDT
Hello Ole,

In the slurmctld node the SPANK libraries are not needed, only on the compute nodes (slurmd and slurmstepd daemons) and in the machines that will execute the various submission commands as srun, sbatch, etc...

This information is found here https://slurm.schedmd.com/spank.html#SECTION_SPANK-PLUGINS. In the local and allocator context it can be seen that it is loaded by srun, sbatch, salloc etc... (those would be the login nodes) and in remote ,slurmd and job_script it states that slurmstepd and slurmd load this plugin, even though in slurmd it is not specifically said(this would be the compute nodes).

But we will consider adding a note stating that the required SPANK plugins need to be present on the compute nodes as well as in the nodes where the user commands will be executed, in order to make things clearer.

Also maybe this sentence (in CONFIGURATION section [https://slurm.schedmd.com/spank.html#SECTION_CONFIGURATION]) was not clear that it will also make the user commands fail if a required SPANK plugin is not found:
> If a SPANK plugin is required, then failure of any of the plugin's functions will cause slurmd to terminate the job
We will also try to rephrase this sentence to make it clearer that the user commands will also be affected in case of a missing SPANK library.

Regards.
Comment 3 Ole.H.Nielsen@fysik.dtu.dk 2022-07-07 00:34:28 MDT
Hi Oriol,

Thanks for the info:

(In reply to Oriol Vilarrubi from comment #2)
> In the slurmctld node the SPANK libraries are not needed, only on the
> compute nodes (slurmd and slurmstepd daemons) and in the machines that will
> execute the various submission commands as srun, sbatch, etc...
> 
> This information is found here
> https://slurm.schedmd.com/spank.html#SECTION_SPANK-PLUGINS. In the local and
> allocator context it can be seen that it is loaded by srun, sbatch, salloc
> etc... (those would be the login nodes) and in remote ,slurmd and job_script
> it states that slurmstepd and slurmd load this plugin, even though in slurmd
> it is not specifically said(this would be the compute nodes).
> 
> But we will consider adding a note stating that the required SPANK plugins
> need to be present on the compute nodes as well as in the nodes where the
> user commands will be executed, in order to make things clearer.

Thanks, precise and complete documentation will be much appreciated.

> Also maybe this sentence (in CONFIGURATION section
> [https://slurm.schedmd.com/spank.html#SECTION_CONFIGURATION]) was not clear
> that it will also make the user commands fail if a required SPANK plugin is
> not found:
> > If a SPANK plugin is required, then failure of any of the plugin's functions will cause slurmd to terminate the job
> We will also try to rephrase this sentence to make it clearer that the user
> commands will also be affected in case of a missing SPANK library.

Yes, this could also do with a bit of clarification so that sites don't make the same mistake that I did.

Will you update me when you have decided on improved documentation?

Best regards,
Ole
Comment 4 Oriol Vilarrubi 2022-07-07 02:05:03 MDT
Hi Ole

Yes, I will discuss this internally and I'll come back to you.

Regards.
Comment 7 Oriol Vilarrubi 2022-07-07 10:13:58 MDT
Hello Ole,

We've modified the documentation to include a note about where are the SPANK plugins needed. we've also rephrased the sentence we talked about in Comment 2 to also make a reference to the job allocation commands.

You can see it in commit bfad62d1 [1]

I'm closing this bug as fixed, do not hesitate to reopen it if needed.

Regards.

[1] https://github.com/SchedMD/slurm/commit/bfad62d1