Bug 9081 - SPANK job prolog/epilog not called
Summary: SPANK job prolog/epilog not called
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other bugs)
Version: 20.02.2
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Dominik Bartkiewicz
QA Contact:
URL:
: 9195 (view as bug list)
Depends on:
Blocks:
 
Reported: 2020-05-19 08:20 MDT by David Gloe
Modified: 2020-12-04 03:37 MST (History)
11 users (show)

See Also:
Site: CRAY
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: Cray Internal
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 20.02.7
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurmd log file (29.15 KB, text/plain)
2020-05-19 08:23 MDT, David Gloe
Details
slurm.conf file (1.88 KB, text/plain)
2020-05-19 08:23 MDT, David Gloe
Details

Note You need to log in before you can comment on or make changes to this bug.
Description David Gloe 2020-05-19 08:20:33 MDT
It appears that in Slurm 20.02.2, the SPANK job prolog and epilog are no longer called.

I have a SPANK plugin that uses the slurm_spank_job_prolog and slurm_spank_job_epilog functions. The plugin was working correctly in Slurm 20.02.1, but not any more with 20.02.2.
Comment 1 David Gloe 2020-05-19 08:23:19 MDT
Created attachment 14299 [details]
slurmd log file
Comment 2 David Gloe 2020-05-19 08:23:36 MDT
Created attachment 14300 [details]
slurm.conf file
Comment 3 David Gloe 2020-05-19 08:27:03 MDT
Looking at the code, I would expect to see debug messages like _run_spank_job_script: calling %s spank prolog/epilog, but there aren't any.
Comment 5 Dominik Bartkiewicz 2020-05-19 08:46:10 MDT
Hi

I think I found the source of this regression.
You need to directly set PlugStackConfig in slurm.conf.
Let me know if this helps.

Dominik
Comment 6 David Gloe 2020-05-19 08:55:18 MDT
Yes, looks like that workaround is effective. We've been relying on the default value for PlugStackConfig.
Comment 10 Marshall Garey 2020-06-09 10:26:00 MDT
*** Bug 9160 has been marked as a duplicate of this bug. ***
Comment 11 Marshall Garey 2020-06-09 10:27:37 MDT
*** Bug 9195 has been marked as a duplicate of this bug. ***
Comment 13 peter.georg 2020-06-24 13:31:38 MDT
Remark for users hit by this issue as well:
We hit the same issue on our site. Setting the PlugStackConfig path fixed the issue. However in our case this is at least somewhat awkward as we are running a "configless" slurm. In this case one has to set PlugStackConfig=/run/slurm/conf/plugstack.conf to work around this issue.
Comment 14 Robin Humble 2020-09-01 03:45:55 MDT
This just bit us in an otherwise undramatic 19.05.6 to 20.02.4 upgrade.

perhaps this gotcha could be put into the 20.x Release Notes until the bug is fixed?
Comment 15 Kaizaad 2020-09-15 16:47:58 MDT
Just bit us too, with an upgrade from 17.x to 20.02.4
+1 for the add to 20.x Release Notes

-k
Comment 16 Dominik Bartkiewicz 2020-09-16 07:04:51 MDT
Hi

Sorry that this took so long. The fix is not included in 20.02.4, and it isn't pushed to git repo yet.
But the workaround is easy and requires only explicit setting PlugStackConfig in slurm.conf.

Dominik
Comment 17 Robin Humble 2020-09-17 01:13:15 MDT
Hi Dominik,

no probs that the bug fix takes a while, and agree that the workaround is easy.

I guess the part that wasn't easy for us was to make the leap from "oh hell - everything in pro/epi is broken" to finding this ticket.

TBH we kinda assumed that such a showstopper (for us) bug wouldn't have lasted through multiple 20.02.x releases, so when we hit issues we didn't even think to look for a ticket. instead during the upgrade we did a deep dive through pro/epi and slurmd and then into the spank plugin which was half-running and half-failing, assuming it was a problem there. it was very confusing. a bug in our modified spank tmpdir plugin would also be our problem, so we didn't think to contact slurm support. only a random plea for help to a friend at NERSC who happened to be awake at the time pointed us to here, and saved us from either rolling back slurm versions or a bunch more hours of downtime. perhaps we would have searched for tickets on our own eventually, but regardless, it was not a good day.

anyway, sorry for the war story, but that's why we thought it'd be helpful to make this ticket a bit more visible to folks doing upgrades.

cheers,
robin
Comment 18 Dominik Bartkiewicz 2020-09-18 04:49:51 MDT
Hi

I totally agree with you.
I internally escalate this bug.

Dominik
Comment 19 Chris Samuel (NERSC) 2020-09-18 11:32:48 MDT
(In reply to Dominik Bartkiewicz from comment #18)

> I totally agree with you.
> I internally escalate this bug.

Thanks Dominik!
Comment 22 Bas van der Vlies 2020-09-24 02:24:12 MDT
We just installed 20.02.5 and also found out the hard way that the spank plugin 'private-tmp` was not working. After setting the `PlugStackConfig` everything works as expected.  I also thought it was a problem in the spank plugin, but it is a slurm problem. Would be nice if this was in the NEWS/Changelog page
Comment 26 Dominik Bartkiewicz 2020-12-04 03:37:01 MST
Hi

One more time, sorry that this took so long.
The patch finally landed in the repo, and it will be included in 20.02.7 and above.
https://github.com/SchedMD/slurm/commit/246ccd109eb6f470
I'll go ahead and close the ticket.

Dominik