Ticket 9195 - Recompiled plugins appear non-functional under Slurm 20.02
Summary: Recompiled plugins appear non-functional under Slurm 20.02
Status: RESOLVED DUPLICATE of ticket 9081
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other tickets)
Version: 20.02.3
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Marshall Garey
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-06-08 12:07 MDT by Chris Samuel (NERSC)
Modified: 2020-06-09 10:40 MDT (History)
4 users (show)

See Also:
Site: NERSC
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Chris Samuel (NERSC) 2020-06-08 12:07:38 MDT
Hi there,

I've started testing Slurm 20.02 and our local reframe (which was all green with 19.05 with the set of tests I've currently got checked out) is flagging problems with 20.02 on tests for two of our local plugins.

This is a blocker for us upgrading to Slurm 20.02 on Cori.

I've already looked through the "RELEASE_NOTES" for any information about changes to how plugins work, but all it mentions (as usual) is that they have to be recompiled.


One plugin adds an "--sdn" option to commands and then when the job starts on a compute node it sends a message from there to a web service on the node with slurmctld to do a bunch of setup and returns an IP address to the plugin to populate the users environment.

Whilst the --sdn option works there is no sign of any activity from the compute node in the logs on the server and the batch job unsurprisingly gets no $SDN_IP set in its environment.  From the slurmd logs I see:

slurmstepd: debug2: spank: sdn_plugin.so: task_init = 0


The other adds a "--perf=$foo" option to commands that is used to make any kernel changes necessary on the compute node to allow things like "perf" or "VTune" to work (and then reset them again afterwards).

Again the --perf option works on a compute node, and it warns as before if you don't have the right environment module loaded for the option specified, but it again has no effect on the compute node itself.

csamuel@gert01:~> srun --perf=vtune -q interactive -C haswell cat /proc/sys/kernel/perf_event_paranoid
srun: error: nersc_perf: missing required environmental module: vtune/2020

csamuel@gert01:~> module load vtune/2020
csamuel@gert01:~> srun --perf=vtune -q interactive -C haswell cat /proc/sys/kernel/perf_event_paranoid
1

Once more in the logs I see:

slurmstepd: debug2: spank: perf.so: task_init = 0


Looking at the slurmd logs I can see evidence that the zonesort plugin of ours seems to run:

[2020-06-08T11:02:23.619] [1936802.0] zonesort: set sort interval to 0 seconds (2 bytes written), Success
[2020-06-08T11:02:23.848] [1936802.0] zonesort: initiated memory compaction (2 bytes written), Success
[2020-06-08T11:02:23.848] [1936802.0] zonesort: found 2 numa domains
[2020-06-08T11:02:23.848] [1936802.0] zonesort: will perform zonesort for numa node 0
[2020-06-08T11:02:23.848] [1936802.0] zonesort: wrote 2 bytes to zone_sort_free_pages: Success
[2020-06-08T11:02:23.848] [1936802.0] zonesort: will perform zonesort for numa node 1
[2020-06-08T11:02:23.849] [1936802.0] zonesort: wrote 2 bytes to zone_sort_free_pages: Success


So I'm not clear what would cause issues here?


Were there any other changes to the plugin infrastructure that were missed from the RELEASE_NOTES that could be causing this?


All the best,
Chris
Comment 1 Chris Samuel (NERSC) 2020-06-08 16:47:30 MDT
Tracking through the SDN code and running the slurmd and debug3 it looks like it's  failing because a file that should be created by slurm_spank_job_prolog() does not exist, and Aditi and I are suspecting that's not being called.

Is that possible?
Comment 2 Chris Samuel (NERSC) 2020-06-08 16:52:33 MDT
Hi there,

Is it possible that the work done in https://bugs.schedmd.com/show_bug.cgi?id=7286 has caused slurm_spank_job_prolog() to suddenly no longer work?

All the best,
Chris
Comment 5 Jason Booth 2020-06-09 09:29:25 MDT
Hi, Chris, this looks like it may be a duplicate of 
bug #9081 comment#5

As Dominik noted there:

>I think I found the source of this regression.
>You need to directly set PlugStackConfig in slurm.conf.
>Let me know if this helps.

We are still looking into fixing this in but#9081. Would you try the above workaround and let us know if this changes the behavior for you so that we can confirm the issue?
Comment 6 Chris Samuel (NERSC) 2020-06-09 10:03:53 MDT
Hi Jason,

Thanks so much for that pointer, we'll try that out today.

I suspect this is also what my friend Laszlo back in Melbourne is seeing too:

https://bugs.schedmd.com/show_bug.cgi?id=9160

All the best,
Chris
Comment 7 Jason Booth 2020-06-09 10:12:26 MDT
>I suspect this is also what my friend Laszlo back in Melbourne is seeing too:

We have linked those two issues. There is discussion internally about this and this does seem like a duplicate of bug #9160.
Comment 8 Marshall Garey 2020-06-09 10:27:37 MDT
As I did for 9160, I'm closing this as a duplicate of bug 9081. Please re-open it if setting PlugStackConfig in slurm.conf doesn't work.

*** This ticket has been marked as a duplicate of ticket 9081 ***
Comment 9 Chris Samuel (NERSC) 2020-06-09 10:40:25 MDT
Hey Jason, Marshall,

Thanks so much, that's indeed fixed all the plugin failures that reframe found, we're down to the two failures from bug#9186 now.

All the best!
Chris