Ticket 12309 - SPANK ABI forward-compatibility break in 21.08
Summary: SPANK ABI forward-compatibility break in 21.08
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmstepd (show other tickets)
Version: 21.08.x
Hardware: Linux Linux
: --- 5 - Enhancement
Assignee: Nate Rini
QA Contact: Tim Wickberg
URL:
Depends on: 11361
Blocks:
  Show dependency treegraph
 
Reported: 2021-08-18 17:51 MDT by Felix Abecassis
Modified: 2021-09-13 11:07 MDT (History)
4 users (show)

See Also:
Site: NVIDIA (PSLA)
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 21.08.1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Felix Abecassis 2021-08-18 17:51:48 MDT
Commit https://github.com/SchedMD/slurm/commit/00283d01c1201858fba3e826fe90266fcd4f76b4 broke the SPANK ABI forward compatibility. A SPANK plugin (not using internal Slurm APIs) compiled against spank.h from 20.11 cannot be deployed with Slurm 21.08.


For example:
https://github.com/NVIDIA/pyxis/blob/68333337b8e198e9c720b626c890a27f62c60645/pyxis_slurmstepd.c#L989-L991
If compiled againt spank.h from 20.11, the generated code will test again ESPANK_NOSPACE=6, but spank_getenv from 21.08 will return ESPANK_NOSPACE=3005 instead.

We do not have access to bug 11361, mentioned in the commit. But it looks like a simple refactoring, was it worth breaking the ABI compatibility for this?
Comment 1 Nate Rini 2021-08-19 09:05:41 MDT
(In reply to Felix Abecassis from comment #0)
> But it looks
> like a simple refactoring, was it worth breaking the ABI compatibility for
> this?

I certainly hope it is worth it. Thanks for testing the release candidate!

The changes in bug#11361 are a precursor to fixing bug#7573. Before 21.08, the SPANK errors exist in a separate namespace than the Slurm error namespace which leads to those errors getting dropped (outright) or getting jammed into the generic SLURM_ERROR. This is all being done to actually fix the problem permanently instead of a bandaid fix that would likely suffer regressions down the road.

There is currently a patch pending review (in bug#11361) that will add this to the RELEASE_NOTES to explicitly warn about the ABI break.
+ -- All SPANK error codes now start at 3000. Where previously SPANK would give
+    a return code of 1, it will now return 3000.

I assume by your comment#0 that everything works just fine with a recompile?
Comment 3 Felix Abecassis 2021-08-19 09:24:22 MDT
> The changes in bug#11361 are a precursor to fixing bug#7573. Before 21.08, the SPANK errors exist in a separate namespace than the Slurm error namespace which leads to those errors getting dropped (outright) or getting jammed into the generic SLURM_ERROR.

I don't see how it's related to bug 7573. Bug 7573 is about the return code from SPANK plugin callbacks, which always returned an "int" and not spank_err_t:
https://github.com/SchedMD/slurm/blob/slurm-20.11/slurm/spank.h#L47
And it's still the case:
https://github.com/SchedMD/slurm/blob/slurm-21-08-0-0rc2/slurm/spank.h#L49

Are you saying that spank_f will return spank_err_t in the future? If yes, we don't really need to return detailed error codes from our callbacks, as long as the error is not ignored anymore.

> I assume by your comment#0 that everything works just fine with a recompile?

Yes, but my understanding is that the SPANK ABI has been forward compatible for years, it's surprising to suddenly break it for a minor change. You could also morph the SPANK error codes inside the Slurm error codes, but without changing their values, that would likely fix the ABI break.
Comment 4 Felix Abecassis 2021-08-19 09:38:23 MDT
If you are also planning to change the return type of spank_f from int to spank_err_t (which we don't need, our callbacks always returned -1 on error), then please do so for 21.08. To avoid breaking the SPANK ABI again for the next release.
Comment 5 Nate Rini 2021-08-19 09:47:33 MDT
(In reply to Felix Abecassis from comment #3)
> I don't see how it's related to bug 7573. Bug 7573 is about the return code
> from SPANK plugin callbacks, which always returned an "int" and not
> spank_err_t:

spank_err_t will always fit in an int so that is fine. The typedef is more to allow compiler time (lint) checks.

> Are you saying that spank_f will return spank_err_t in the future? If yes,
> we don't really need to return detailed error codes from our callbacks, as
> long as the error is not ignored anymore.
Maybe in a future major release, but either way using an 'int' in the plugins will still work as Slurm hard codes the return type as int almost everywhere.

> > I assume by your comment#0 that everything works just fine with a recompile?
> 
> Yes, but my understanding is that the SPANK ABI has been forward compatible
> for years, it's surprising to suddenly break it for a minor change.

We have tried to avoid any breaking changes but we have always reserved the right to require a SPANK plugin re-compile per [https://slurm.schedmd.com/spank.html]:
> Note: SPANK plugins using the Slurm APIs need to be recompiled when upgrading Slurm to a new major release.
 
> You could also morph the SPANK error codes inside the Slurm error codes, but
> without changing their values, that would likely fix the ABI break.
I would have preferred to do that but the Slurm errors also include all of the existing Linux/POSIX error codes which Slurm honors and uses that would conflict.

(In reply to Felix Abecassis from comment #4)
> If you are also planning to change the return type of spank_f from int to
> spank_err_t (which we don't need, our callbacks always returned -1 on
> error), then please do so for 21.08. To avoid breaking the SPANK ABI again
> for the next release.
We have already hit the feature freeze for 21.08 and any change like that will *not* happen in 21.08 without a major breaking bug getting found. The change of the error codes was already a large change and we try to avoid them.

Also AFAIK, changing int to spank_err_t will not cause an ABI break for any existing complied code on x86_64.
Comment 6 Felix Abecassis 2021-08-19 09:57:02 MDT
> spank_err_t will always fit in an int so that is fine. The typedef is more to allow compiler time (lint) checks.

Sure, but given spank.h, returning ESPANK_NOEXIST from a SPANK plugin callback is wrong, as all the other functions in this file are indeed defined to return spank_err_t. I don't think there was any confusion here, again we always returned -1 and the ask in Bug 7573 is not about being able to return all the possible Slurm error codes from a SPANK plugin.


> We have tried to avoid any breaking changes but we have always reserved the right to require a SPANK plugin re-compile per [https://slurm.schedmd.com/spank.html]:

No, this doesn't apply here, see https://bugs.schedmd.com/show_bug.cgi?id=11308
Comment 7 Nate Rini 2021-08-19 10:26:02 MDT
(In reply to Felix Abecassis from comment #6)
> > We have tried to avoid any breaking changes but we have always reserved the right to require a SPANK plugin re-compile per [https://slurm.schedmd.com/spank.html]:
> 
> No, this doesn't apply here, see
> https://bugs.schedmd.com/show_bug.cgi?id=11308

(In reply to Marshall Garey from bug#11308 comment#1)
> Although you man not need to recompile SPANK plugins that don't call
> "slurm_*" functions, I recommend recompiling them anyway on every major
> version upgrade.

Not sure I follow your logic, while there is a typo of "man" -> "may", at no point did Marshall say a recompile would never be required. He even recommends a recompile for every major version upgrade.

I will see about updating the wording in the doc to make it more explicit to avoid any confusion in the future.

Thanks,
--Nate
Comment 8 Felix Abecassis 2021-08-19 10:39:06 MDT
I was pointing out that the line you quoted from the release notes does not apply here. It's clear that a plugin depending on internal Slurm symbols will require a recompile. You can indeed break the ABI if you decide to, but it's a drastic change for an ABI that has been forward compatible for many releases now. And it's surprising that to fix bug 7573 which is about the Slurm -> SPANK plugin interface, you first break the SPANK plugin -> Slurm interface (where A -> B means "A calls into B").

If the SPANK ABI forward compatibility cannot be guaranteed, we need a new API (guaranteed to be ABI forward compatible forever) for SPANK plugins to query the Slurm version and possibly refuse to load on untested/unsupported versions.
Comment 9 Nate Rini 2021-08-19 10:50:22 MDT
(In reply to Felix Abecassis from comment #8)
> If the SPANK ABI forward compatibility cannot be guaranteed, we need a new
> API (guaranteed to be ABI forward compatible forever) for SPANK plugins to
> query the Slurm version and possibly refuse to load on untested/unsupported
> versions.

The Slurm versions are already exposed via spank_get_item():
>    S_SLURM_VERSION,         /* Current Slurm version (char **)              */
>    S_SLURM_VERSION_MAJOR,   /* Slurm version major release (char **)        */
>    S_SLURM_VERSION_MINOR,   /* Slurm version minor release (char **)        */
>    S_SLURM_VERSION_MICRO,   /* Slurm version micro release (char **)        */

These could be checked for known compatibility by a plugin at load time.
Comment 10 Felix Abecassis 2021-08-19 10:55:05 MDT
No, I mentioned "guaranteed to be ABI forward compatible forever", but these fields are queried through "spank_err_t spank_get_item()" which just had an ABI break in 21.08. So this can't work.
Comment 11 Felix Abecassis 2021-08-19 10:58:45 MDT
Unless you can guarantee that the ABI of spank_getitem will not break again. Otherwise we need something simpler, like a single function call taking no argument and returning an int encoding the major Slurm version (therefore no risk in the arguments or return type getting an ABI break).
Comment 12 Nate Rini 2021-08-19 11:24:14 MDT
(In reply to Felix Abecassis from comment #10)
> No, I mentioned "guaranteed to be ABI forward compatible forever", but these
> fields are queried through "spank_err_t spank_get_item()" which just had an
> ABI break in 21.08. So this can't work.

(In reply to Felix Abecassis from comment #11)
> Unless you can guarantee that the ABI of spank_getitem will not break again.
> Otherwise we need something simpler, like a single function call taking no
> argument and returning an int encoding the major Slurm version (therefore no
> risk in the arguments or return type getting an ABI break).

I have swapped this to an RFE as this functionality does not currently exist and any C Macros will not be useful without a recompile.
Comment 13 Felix Abecassis 2021-08-19 11:25:49 MDT
So the ABI break is considered as not being a bug then?
Comment 14 Nate Rini 2021-08-19 11:29:42 MDT
(In reply to Felix Abecassis from comment #13)
> So the ABI break is considered as not being a bug then?

Not between Major versions as this only requires a recompile and no code changes in any SPANK plugin. Tim will be getting CCed about the RFE status and may decide otherwise.
Comment 17 Nate Rini 2021-08-19 12:17:18 MDT
(In reply to Nate Rini from comment #14)
> (In reply to Felix Abecassis from comment #13)
> > So the ABI break is considered as not being a bug then?
> 
> Not between Major versions as this only requires a recompile and no code
> changes in any SPANK plugin. Tim will be getting CCed about the RFE status
> and may decide otherwise.

We are going to look into changing the SPANK plugin loader to ensure the plugins match the major/minor versions for Slurm and refuse to run them otherwise.
Comment 18 Felix Abecassis 2021-08-19 12:20:04 MDT
Please guarantee ABI compatibility across minor versions of one Slurm major version.
Comment 19 Felix Abecassis 2021-08-19 12:31:33 MDT
I was assuming that in "20.11.8"; "20.11" is the major version (as it's a year-month format not related to API changes), and "8" is the minor version. Perhaps that's not what you have in mind.
Comment 20 Nate Rini 2021-08-19 12:56:44 MDT
(In reply to Felix Abecassis from comment #19)
> "20.11" is the major version (as it's a year-month format not related to API changes), and "8" is the minor version.

Yes, that is correct.
Comment 30 Nate Rini 2021-09-13 11:07:15 MDT
Slurm 21.08.1+ will not enforce a re-compile of the SPANK plugins to the current major release.

Please reply if you have any further questions or issues.

Thanks,
--Nate
Comment 31 Nate Rini 2021-09-13 11:07:51 MDT
(In reply to Nate Rini from comment #30)
Correction:

> Slurm 21.08.1+ will *now* enforce a re-compile of the SPANK plugins to the
> current major release.