|Summary:||Cannot load auth_munge plugin after upgrade|
|Product:||Slurm||Reporter:||David Matthews <david.matthews>|
|Component:||slurmstepd||Assignee:||Tim Wickberg <tim>|
|Status:||RESOLVED FIXED||QA Contact:|
|Severity:||4 - Minor Issue|
|Site:||Met Office||Alineos Sites:||---|
|Bull/Atos Sites:||---||Confidential Site:||---|
|Cray Sites:||---||HPCnow Sites:||---|
|HPE Sites:||---||IBM Sites:||---|
|NOAA SIte:||---||OCF Sites:||---|
|SFW Sites:||---||SNIC sites:||---|
|Linux Distro:||---||Machine Name:|
|CLE Version:||Version Fixed:||15.08.7 16.05.0-pre1|
Description David Matthews 2016-01-08 00:21:41 MST
We are testing upgrading our test system from 15.08.5 to 15.08.6. When we upgrade a node, any jobs which are running on the node do not complete successfully. The slurmd.log file on the node contains lots of entries like this: [2016-01-08T09:55:24.062+00:00]  /usr/lib64/slurm//auth_munge.so: Incompatible Slurm plugin version (15.8.6) [2016-01-08T09:55:24.062+00:00]  Couldn't load specified plugin name for auth/munge: Incompatible plugin version [2016-01-08T09:55:24.063+00:00]  cannot create auth context for auth/munge [2016-01-08T09:55:24.063+00:00]  /usr/lib64/slurm//auth_munge.so: Incompatible Slurm plugin version (15.8.6) [2016-01-08T09:55:24.063+00:00]  Couldn't load specified plugin name for auth/munge: Incompatible plugin version [2016-01-08T09:55:24.064+00:00]  cannot create auth context for auth/munge [2016-01-08T09:55:24.064+00:00]  /usr/lib64/slurm//auth_munge.so: Incompatible Slurm plugin version (15.8.6) [2016-01-08T09:55:24.064+00:00]  Couldn't load specified plugin name for auth/munge: Incompatible plugin version [2016-01-08T09:55:24.065+00:00]  cannot create auth context for auth/munge [2016-01-08T09:55:24.065+00:00]  Unpacking authentication credential: authentication initialization failure The job output file is also full of lots of similar messages like this: slurmstepd: Couldn't load specified plugin name for auth/munge: Incompatible plugin version slurmstepd: cannot create auth context for auth/munge The slurmctld.log file contains lots of entries like this: [2016-01-08T09:56:48.116+00:00] error: slurm_receive_msg: Zero Bytes were transmitted or received Eventually we get this in the slurmd.log file: [2016-01-08T10:57:05.310+00:00]  Unable to send job complete message: Protocol authentication error [2016-01-08T10:57:08.000+00:00]  done with job This appears to be a consequence of the change introduced in 15.08 to verify plugin version numbers: https://github.com/SchedMD/slurm/blob/slurm-15.08/RELEASE_NOTES#L76-L83 This is the relevant code: https://github.com/SchedMD/slurm/blob/slurm-15.08/src/common/plugin.c#L207-L216 If I understand correctly, slurmstepd is launched with 15.08.5 so then fails to load the 15.08.6 auth_munge plugin after the node is upgraded. I'm not sure how this is meant to work but I assume this isn't the intended behaviour. Certainly we didn't have this problem when we tested upgrading from 14.11 to 15.08. As things stand, with 15.08 it doesn't appear to be possible to safely upgrade nodes whilst jobs are running.
Comment 1 Tim Wickberg 2016-01-08 02:28:30 MST
Embarrassingly the comment you've highlighted notes what the fix is. I'll have a patch available shortly, and this'll be included in 15.08.7 which we expect to release towards the end of January.
Comment 2 Tim Wickberg 2016-01-08 03:25:29 MST
Actually, changing to check major/minor only would not help. The real solution is to load the authentication plugin up-front, rather than lazy-loading it when the job completes - slurmstepd only communicates after the job has finished since it receives the initial job information directly from slurmd at the start. If I change that check upgrading between versions would still fail. Commit 870273ca1499 fixes this, and will be in 15.08.7 due out in a few weeks. If you want to apply this patch now you can download it here: https://github.com/SchedMD/slurm/commit/870273ca1499.patch While reproducing this I noticed that slurmstepd will not cleanup properly - you may want to check for stray slurmstepd's on your test nodes and kill them manually. - Tim
Comment 3 David Matthews 2016-01-10 23:43:07 MST
Tim - thanks for the fix. As far as I can see, slurmstepd exited after it timed out (~1 hour after job completion - see the slurmd.log messages I reported) so I didn't have to do any clean-up Do you plan on putting out something on the mailing list? I assume anyone who is already running 15.08 is going to hit this when they next upgrade unless they patch their existing release first.
Comment 4 Tim Wickberg 2016-01-11 01:17:06 MST
(In reply to David Matthews from comment #3) > Tim - thanks for the fix. Certainly, that's what we're here for. I hope that didn't cause too many issues. > Do you plan on putting out something on the mailing list? I assume anyone > who is already running 15.08 is going to hit this when they next upgrade > unless they patch their existing release first. We'll note the upgrade issue on the next point release, and may have to warn about it out on the next major as well. I don't think upgrading the installation while the node is live like you've done is especially common though - a lot of sites tend to upgrade it in a node image + reimage the node during maintenance, or have it installed on a central NFS mount and maintain a "current" symlink to flip between releases (which would still have allowed slurmstepd to dlload() the correct version from its own install directory).