Bug 2319 - Cannot load auth_munge plugin after upgrade
Summary: Cannot load auth_munge plugin after upgrade
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmstepd (show other bugs)
Version: 15.08.6
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2016-01-08 00:21 MST by David Matthews
Modified: 2016-01-11 01:17 MST (History)
0 users

See Also:
Site: Met Office
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 15.08.7 16.05.0-pre1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description David Matthews 2016-01-08 00:21:41 MST
We are testing upgrading our test system from 15.08.5 to 15.08.6.
When we upgrade a node, any jobs which are running on the node do not complete successfully.

The slurmd.log file on the node contains lots of entries like this:
[2016-01-08T09:55:24.062+00:00] [23342] /usr/lib64/slurm//auth_munge.so: Incompatible Slurm plugin version (15.8.6)
[2016-01-08T09:55:24.062+00:00] [23342] Couldn't load specified plugin name for auth/munge: Incompatible plugin version
[2016-01-08T09:55:24.063+00:00] [23342] cannot create auth context for auth/munge
[2016-01-08T09:55:24.063+00:00] [23342] /usr/lib64/slurm//auth_munge.so: Incompatible Slurm plugin version (15.8.6)
[2016-01-08T09:55:24.063+00:00] [23342] Couldn't load specified plugin name for auth/munge: Incompatible plugin version
[2016-01-08T09:55:24.064+00:00] [23342] cannot create auth context for auth/munge
[2016-01-08T09:55:24.064+00:00] [23342] /usr/lib64/slurm//auth_munge.so: Incompatible Slurm plugin version (15.8.6)
[2016-01-08T09:55:24.064+00:00] [23342] Couldn't load specified plugin name for auth/munge: Incompatible plugin version
[2016-01-08T09:55:24.065+00:00] [23342] cannot create auth context for auth/munge
[2016-01-08T09:55:24.065+00:00] [23342] Unpacking authentication credential: authentication initialization failure

The job output file is also full of lots of similar messages like this:
slurmstepd: Couldn't load specified plugin name for auth/munge: Incompatible plugin version
slurmstepd: cannot create auth context for auth/munge

The slurmctld.log file contains lots of entries like this:
[2016-01-08T09:56:48.116+00:00] error: slurm_receive_msg: Zero Bytes were transmitted or received

Eventually we get this in the slurmd.log file:
[2016-01-08T10:57:05.310+00:00] [23342] Unable to send job complete message: Protocol authentication error
[2016-01-08T10:57:08.000+00:00] [23342] done with job

This appears to be a consequence of the change introduced in 15.08 to verify plugin version numbers:
https://github.com/SchedMD/slurm/blob/slurm-15.08/RELEASE_NOTES#L76-L83
This is the relevant code:
https://github.com/SchedMD/slurm/blob/slurm-15.08/src/common/plugin.c#L207-L216
If I understand correctly, slurmstepd is launched with 15.08.5 so then fails to load the 15.08.6 auth_munge plugin after the node is upgraded.
I'm not sure how this is meant to work but I assume this isn't the intended behaviour.
Certainly we didn't have this problem when we tested upgrading from 14.11 to 15.08.
As things stand, with 15.08 it doesn't appear to be possible to safely upgrade nodes whilst jobs are running.
Comment 1 Tim Wickberg 2016-01-08 02:28:30 MST
Embarrassingly the comment you've highlighted notes what the fix is. I'll have a patch available shortly, and this'll be included in 15.08.7 which we expect to release towards the end of January.
Comment 2 Tim Wickberg 2016-01-08 03:25:29 MST
Actually, changing to check major/minor only would not help. The real solution is to load the authentication plugin up-front, rather than lazy-loading it when the job completes - slurmstepd only communicates after the job has finished since it receives the initial job information directly from slurmd at the start. If I change that check upgrading between versions would still fail.

Commit 870273ca1499 fixes this, and will be in 15.08.7 due out in a few weeks. If you want to apply this patch now you can download it here:
https://github.com/SchedMD/slurm/commit/870273ca1499.patch

While reproducing this I noticed that slurmstepd will not cleanup properly - you may want to check for stray slurmstepd's on your test nodes and kill them manually.

- Tim
Comment 3 David Matthews 2016-01-10 23:43:07 MST
Tim - thanks for the fix.

As far as I can see, slurmstepd exited after it timed out (~1 hour after job completion - see the slurmd.log messages I reported) so I didn't have to do any clean-up

Do you plan on putting out something on the mailing list? I assume anyone who is already running 15.08 is going to hit this when they next upgrade unless they patch their existing release first.
Comment 4 Tim Wickberg 2016-01-11 01:17:06 MST
(In reply to David Matthews from comment #3)
> Tim - thanks for the fix.

Certainly, that's what we're here for. I hope that didn't cause too many issues.

> Do you plan on putting out something on the mailing list? I assume anyone
> who is already running 15.08 is going to hit this when they next upgrade
> unless they patch their existing release first.

We'll note the upgrade issue on the next point release, and may have to warn about it out on the next major as well.

I don't think upgrading the installation while the node is live like you've done is especially common though - a lot of sites tend to upgrade it in a node image + reimage the node during maintenance, or have it installed on a central NFS mount and maintain a "current" symlink to flip between releases (which would still have allowed slurmstepd to dlload() the correct version from its own install directory).