It appears that in Slurm 20.02.2, the SPANK job prolog and epilog are no longer called. I have a SPANK plugin that uses the slurm_spank_job_prolog and slurm_spank_job_epilog functions. The plugin was working correctly in Slurm 20.02.1, but not any more with 20.02.2.
Created attachment 14299 [details] slurmd log file
Created attachment 14300 [details] slurm.conf file
Looking at the code, I would expect to see debug messages like _run_spank_job_script: calling %s spank prolog/epilog, but there aren't any.
Hi I think I found the source of this regression. You need to directly set PlugStackConfig in slurm.conf. Let me know if this helps. Dominik
Yes, looks like that workaround is effective. We've been relying on the default value for PlugStackConfig.
*** Bug 9160 has been marked as a duplicate of this bug. ***
*** Bug 9195 has been marked as a duplicate of this bug. ***
Remark for users hit by this issue as well: We hit the same issue on our site. Setting the PlugStackConfig path fixed the issue. However in our case this is at least somewhat awkward as we are running a "configless" slurm. In this case one has to set PlugStackConfig=/run/slurm/conf/plugstack.conf to work around this issue.
This just bit us in an otherwise undramatic 19.05.6 to 20.02.4 upgrade. perhaps this gotcha could be put into the 20.x Release Notes until the bug is fixed?
Just bit us too, with an upgrade from 17.x to 20.02.4 +1 for the add to 20.x Release Notes -k
Hi Sorry that this took so long. The fix is not included in 20.02.4, and it isn't pushed to git repo yet. But the workaround is easy and requires only explicit setting PlugStackConfig in slurm.conf. Dominik
Hi Dominik, no probs that the bug fix takes a while, and agree that the workaround is easy. I guess the part that wasn't easy for us was to make the leap from "oh hell - everything in pro/epi is broken" to finding this ticket. TBH we kinda assumed that such a showstopper (for us) bug wouldn't have lasted through multiple 20.02.x releases, so when we hit issues we didn't even think to look for a ticket. instead during the upgrade we did a deep dive through pro/epi and slurmd and then into the spank plugin which was half-running and half-failing, assuming it was a problem there. it was very confusing. a bug in our modified spank tmpdir plugin would also be our problem, so we didn't think to contact slurm support. only a random plea for help to a friend at NERSC who happened to be awake at the time pointed us to here, and saved us from either rolling back slurm versions or a bunch more hours of downtime. perhaps we would have searched for tickets on our own eventually, but regardless, it was not a good day. anyway, sorry for the war story, but that's why we thought it'd be helpful to make this ticket a bit more visible to folks doing upgrades. cheers, robin
Hi I totally agree with you. I internally escalate this bug. Dominik
(In reply to Dominik Bartkiewicz from comment #18) > I totally agree with you. > I internally escalate this bug. Thanks Dominik!
We just installed 20.02.5 and also found out the hard way that the spank plugin 'private-tmp` was not working. After setting the `PlugStackConfig` everything works as expected. I also thought it was a problem in the spank plugin, but it is a slurm problem. Would be nice if this was in the NEWS/Changelog page
Hi One more time, sorry that this took so long. The patch finally landed in the repo, and it will be included in 20.02.7 and above. https://github.com/SchedMD/slurm/commit/246ccd109eb6f470 I'll go ahead and close the ticket. Dominik