Ticket 14981 - Nodes stuck in completing after minor version upgrade
Summary: Nodes stuck in completing after minor version upgrade
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: 22.05.3
Hardware: Linux Linux
: --- 2 - High Impact
Assignee: Oscar Hernández
QA Contact:
URL:
: 17893 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2022-09-15 16:53 MDT by Michael Robbert
Modified: 2023-10-12 13:26 MDT (History)
6 users (show)

See Also:
Site: Colorado School of Mines
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 22.05.5
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
smime.p7s (8.15 KB, application/x-pkcs7-signature)
2022-09-16 10:10 MDT, Michael Robbert
Details
smime.p7s (8.15 KB, application/x-pkcs7-signature)
2022-09-16 12:19 MDT, Michael Robbert
Details
smime.p7s (8.15 KB, application/x-pkcs7-signature)
2022-09-20 09:20 MDT, Michael Robbert
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Michael Robbert 2022-09-15 16:53:30 MDT
We updated the Slurm RPMs from 22.05.1 to 22.05.3 on a bunch of compute nodes today while jobs were running and I am now seeing several nodes stuck in a state of completing. Shortly after the update we started seeing these errors in our slurmd.log files on the nodes:

[2022-09-15T11:00:44.647] Slurmd shutdown completing
[2022-09-15T11:00:44.886] slurmd version 22.05.3 started
[2022-09-15T11:00:44.887] slurmd started on Thu, 15 Sep 2022 11:00:44 -0600
[2022-09-15T11:00:48.027] CPUs=36 Boards=1 Sockets=2 Cores=18 Threads=1 Memory=191890 TmpDisk=9990 Uptime=1800456 CPUSpecList=(null)
FeaturesAvail=(null) FeaturesActive=(null)
[2022-09-15T11:06:15.848] [13648108.batch] plugin_load_from_file: Incompatible Slurm plugin /usr/lib64/slurm/hash_k12.so version (22.
05.3)
[2022-09-15T11:06:15.849] [13648108.batch] error: Couldn't load specified plugin name for hash/k12: Incompatible plugin version
[2022-09-15T11:06:15.850] [13648108.batch] error: cannot create hash context for K12
[2022-09-15T11:06:15.851] [13648108.batch] error: slurm_send_node_msg: hash_g_compute: REQUEST_COMPLETE_BATCH_SCRIPT has error
[2022-09-15T11:06:15.851] [13648108.batch] Retrying job complete RPC for StepId=13648108.batch
[2022-09-15T11:06:30.856] [13648108.batch] error: hash_g_compute: hash plugin with id:0 not exist or is not loaded
[2022-09-15T11:06:30.858] [13648108.batch] error: slurm_send_node_msg: hash_g_compute: REQUEST_COMPLETE_BATCH_SCRIPT has error
[2022-09-15T11:06:30.858] [13648108.batch] Retrying job complete RPC for StepId=13648108.batch
[2022-09-15T11:06:45.860] [13648108.batch] error: hash_g_compute: hash plugin with id:0 not exist or is not loaded
[2022-09-15T11:06:45.862] [13648108.batch] error: slurm_send_node_msg: hash_g_compute: REQUEST_COMPLETE_BATCH_SCRIPT has error
[2022-09-15T11:06:45.862] [13648108.batch] Retrying job complete RPC for StepId=13648108.batch
[2022-09-15T11:07:00.868] [13648108.batch] error: hash_g_compute: hash plugin with id:0 not exist or is not loaded
[2022-09-15T11:07:00.869] [13648108.batch] error: slurm_send_node_msg: hash_g_compute: REQUEST_COMPLETE_BATCH_SCRIPT has error
[2022-09-15T11:07:00.869] [13648108.batch] Retrying job complete RPC for StepId=13648108.batch
[2022-09-15T11:07:01.786] [13648107.batch] plugin_load_from_file: Incompatible Slurm plugin /usr/lib64/slurm/hash_k12.so version (22.
05.3)
[2022-09-15T11:07:01.787] [13648107.batch] error: Couldn't load specified plugin name for hash/k12: Incompatible plugin version
[2022-09-15T11:07:01.789] [13648107.batch] error: cannot create hash context for K12
[2022-09-15T11:07:01.790] [13648107.batch] error: slurm_send_node_msg: hash_g_compute: REQUEST_COMPLETE_BATCH_SCRIPT has error

Many of these nodes still have running jobs on them in addition to the completing jobs so I can't just reboot them. Is there something I can do to easily/quickly kill off all the completing jobs without causing them to show up as failed?

Any thoughts on why this may have happened? I thought that minor version upgrades on running nodes was allowed.

Thanks,
Mike
Comment 4 Michael Robbert 2022-09-16 09:39:34 MDT
I'm updating the importance of this ticket because we are now getting complaints from users because no new jobs can start util we start getting some of these jobs to finish completing. 
I think that I will be able to kill the stuck slurmstepd processes by using pkill to kill based on the jobID's that are stuck, but I'd like to know what impact that will have on the job status. 

Mike
Comment 5 Oscar Hernández 2022-09-16 10:02:35 MDT
Hi Mike,

It looks like you are having some library conflicts here. For the new installation, did you install it in a new directory, or did you replace the previous files?

Slurmd is designed to support running upgrades. However, even if a new slurmd is started, slurmstepd keeps running an older version until the job terminates. This is the expected behavior and slurmstepd is prepared to support this.

The problem comes when libraries that the running stepd was using are modified. I suspect this is your issue, as I have been able to get your same error by replicating your scenario and installing an updated version over the old one.

There is some mention of that in the upgrades section here[1] (I would recommend to take a look):

"A common approach when performing upgrades is to install the new version of Slurm to a unique directory and use a symbolic link...It also avoids potential problems with library conflicts"

That is the reason we do always recommend installing new versions into separated folders. I would expect new jobs submitted to work well though.

As for the running jobs. Are they still running? Cannot think at the moment of any way of killing them gracefully. But killing (kill -9) the longtime runnning slurmstepd processes in the compute nodes should terminate the jobs.
 
Let me know if you think that could be your case, or you have any other information to share.

Kind regards,
Oscar

[1]https://slurm.schedmd.com/quickstart_admin.html#upgrade
Comment 6 Michael Robbert 2022-09-16 10:10:57 MDT
Created attachment 26835 [details]
smime.p7s

   I have updated Slurm the same way that I always do, via “yum update”.
   The RPMs are built using a standard rpmbuild using the provided spec
   file. Obviously, this method replaces all the libraries that
   slurmstepd uses. I might expect some library conflicts during a major
   upgrade, but this was a minor version upgrade.

   So, is an RPM update not supported with running jobs?

   I will work on killing all of the slurmstepd’s that are stuck.

   *Mike Robbert*

   *Cyberinfrastructure Specialist, Cyberinfrastructure and Advanced
   Research Computing*

   Information and Technology Solutions (ITS)

   303-273-3786 | mrobbert@mines.edu[1]

   A close up of a sign

Description automatically generated

   *Our values:* Trust | Integrity | Respect | Responsibility

   *From: *bugs@schedmd.com <bugs@schedmd.com>
   *Date: *Friday, September 16, 2022 at 10:02
   *To: *Michael Robbert <mrobbert@mines.edu>
   *Subject: *[External] [Bug 14981] Nodes stuck in completing after
   minor version upgrade

   *CAUTION:* This email originated from outside of the Colorado School
   of Mines organization. Do not click on links or open attachments
   unless you recognize the sender and know the content is safe.

   * Comment # 5[2] on bug 14981[3] from Oscar Hernández[4] *

   Hi Mike,   

       

   It looks like you are having some library conflicts here. For the new   

   installation, did you install it in a new directory, or did you replace the   

   previous files?   

       

   Slurmd is designed to support running upgrades. However, even if a new slurmd   

   is started, slurmstepd keeps running an older version until the job terminates.   

   This is the expected behavior and slurmstepd is prepared to support this.   

       

   The problem comes when libraries that the running stepd was using are modified.   

   I suspect this is your issue, as I have been able to get your same error by   

   replicating your scenario and installing an updated version over the old one.   

       

   There is some mention of that in the upgrades section here[1] (I would   

   recommend to take a look):   

       

   "A common approach when performing upgrades is to install the new version of   

   Slurm to a unique directory and use a symbolic link...It also avoids potential   

   problems with library conflicts"   

       

   That is the reason we do always recommend installing new versions into   

   separated folders. I would expect new jobs submitted to work well though.   

       

   As for the running jobs. Are they still running? Cannot think at the moment of   

   any way of killing them gracefully. But killing (kill -9) the longtime runnning   

   slurmstepd processes in the compute nodes should terminate the jobs.   

       

   Let me know if you think that could be your case, or you have any other   

   information to share.   

       

   Kind regards,   

   Oscar   

       

   [1]      https://slurm.schedmd.com/quickstart_admin.html#upgrade [5]

   ---------------------------------------------------------------------

   You are receiving this mail because:

     * You reported the bug.

   

   1. mailto:mrobbert@mines.edu
   2. https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D14981%23c5&data=05%7C01%7Cmrobbert%40mines.edu%7C480494f4f08c45f037de08da97fcdf8a%7C997209e009b346239a4d76afa44a675c%7C0%7C0%7C637989409629030106%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=w5zZkMwF4PTsK9L2q%2FMW5suYJ9dpmFLGxissG0FKG90%3D&reserved=0
   3. https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D14981&data=05%7C01%7Cmrobbert%40mines.edu%7C480494f4f08c45f037de08da97fcdf8a%7C997209e009b346239a4d76afa44a675c%7C0%7C0%7C637989409629030106%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=TCJmbWtQN606CFaaUizIxIUuBWXcE1T4GvZxB1Y44Zw%3D&reserved=0
   4. mailto:oscar.hernandez@schedmd.com
   5. https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fquickstart_admin.html%23upgrade&data=05%7C01%7Cmrobbert%40mines.edu%7C480494f4f08c45f037de08da97fcdf8a%7C997209e009b346239a4d76afa44a675c%7C0%7C0%7C637989409629030106%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ttl4xW77pXs1JOdpOFqNuFdWQYUO0nOGFZnfndEx5aM%3D&reserved=0
Comment 7 Oscar Hernández 2022-09-16 10:43:10 MDT
I see. In that case, that is certainly what happened.

>So, is an RPM update not supported with running jobs?

RPMS have the "%_prefix" macro to deal with that. But with the default RPM installation, it is not. As stated in the documentation, different errors with libraries can appear.

As for the way of terminating jobs. If they do not finish when killing the processes, try to "scancel" them afterwards.

Kind regards,
Oscar
Comment 8 Michael Robbert 2022-09-16 12:19:41 MDT
Created attachment 26841 [details]
smime.p7s

   I am still not getting a very clear answer to my question. Are
   in-place upgrades on a running system supported for minor version
   upgrades?

   The documentation that you point to is worded very softly and to me
   indicates that it isn’t required and just a suggestion. Is it true
   that only certain code changes would cause a problem like this? If so,
   what changed in the code for the hash_k12.so library to cause it to be
   unloadable. I don’t see anything in the NEWS file that indicates a
   change to that library.

   If we did move to a per-version path for our installs, would that need
   to be done on a node local filesystem with the symlink changed per
   node at a particular point in time or could it be done on a shared
   filesystem that would affect all nodes at once. I’m having trouble
   understanding how that method would avoid this problem. It seems like
   the when the slurmstepd tries to load a library/plugin it would use
   the symlink which would be pointing at the new install after the
   update.

   *Mike Robbert*

   *Cyberinfrastructure Specialist, Cyberinfrastructure and Advanced
   Research Computing*

   Information and Technology Solutions (ITS)

   303-273-3786 | mrobbert@mines.edu[1]

   A close up of a sign

Description automatically generated

   *Our values:* Trust | Integrity | Respect | Responsibility

   *From: *bugs@schedmd.com <bugs@schedmd.com>
   *Date: *Friday, September 16, 2022 at 10:43
   *To: *Michael Robbert <mrobbert@mines.edu>
   *Subject: *[External] [Bug 14981] Nodes stuck in completing after
   minor version upgrade

   *CAUTION:* This email originated from outside of the Colorado School
   of Mines organization. Do not click on links or open attachments
   unless you recognize the sender and know the content is safe.

   * Comment # 7[2] on bug 14981[3] from Oscar Hernández[4] *

   I see. In that case, that is certainly what happened.   

       

   >So, is an RPM update not supported with running jobs?   

       

   RPMS have the "%_prefix" macro to deal with that. But with the default RPM   

   installation, it is not. As stated in the documentation, different errors with   

   libraries can appear.   

       

   As for the way of terminating jobs. If they do not finish when killing the   

   processes, try to "scancel" them afterwards.   

       

   Kind regards,   

   Oscar

   ---------------------------------------------------------------------

   You are receiving this mail because:

     * You reported the bug.

   

   1. mailto:mrobbert@mines.edu
   2. https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D14981%23c7&data=05%7C01%7Cmrobbert%40mines.edu%7C5e0d80ad8d3d4a19e4aa08da98028b0a%7C997209e009b346239a4d76afa44a675c%7C0%7C0%7C637989433931606513%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=69x%2FNOv08rO0vxuz6tqbPl7Zd7hjMaQbrSVVzdnQ%2BB4%3D&reserved=0
   3. https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D14981&data=05%7C01%7Cmrobbert%40mines.edu%7C5e0d80ad8d3d4a19e4aa08da98028b0a%7C997209e009b346239a4d76afa44a675c%7C0%7C0%7C637989433931606513%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=pA%2B7277Eb5oHQOmsex2YVlmAuBXwjJkK9AbNtXXkjF8%3D&reserved=0
   4. mailto:oscar.hernandez@schedmd.com
Comment 9 Oscar Hernández 2022-09-19 10:50:14 MDT
Hi Mike,

My apologies if I was not clear on that matter. Let me answer in-line.

>    I am still not getting a very clear answer to my question. Are
>    in-place upgrades on a running system supported for minor version
>    upgrades?

In your scenario (installing via yum update in system default locations) they are not supported. As mentioned, replacing libraries while having binaries on run-time using them is not a good practice, it might work most of the time, but can fail in uncontrolled ways. If you want to do it that way, nodes should be drained (making sure there are no running jobs) before upgrading.

Slurm is designed to support it when using independent installations (not overriding previous files). So switching to that strategy would allow you to do rolling upgrades.
     
>    The documentation that you point to is worded very softly and to me
>    indicates that it isn’t required and just a suggestion. Is it true
>    that only certain code changes would cause a problem like this? If so,
>    what changed in the code for the hash_k12.so library to cause it to be
>    unloadable. I don’t see anything in the NEWS file that indicates a
>    change to that library.

Did some testing on it. Slurm was able to handle it apparently well in 21.08. But in 22.05, to improve RPC security, slurmstepd is loading now hash_k12.so, and I suspect that is the reason the error appears now. We expect slurm to find correctly versioned plugins, so the error is expected in your situation. But thanks for pointing this out.
 
>    If we did move to a per-version path for our installs, would that need
>    to be done on a node local filesystem with the symlink changed per
>    node at a particular point in time or could it be done on a shared
>    filesystem that would affect all nodes at once. 

I wouldn't recommend having slurm installed on a shared FS for all nodes, it could have performance implications. Ideal scenario is having each node with their local slurm installation, and the symlink change can be done multiple nodes at a time (using some tool that allows multiple node file management).

>    I’m having trouble
>    understanding how that method would avoid this problem. It seems like
>    the when the slurmstepd tries to load a library/plugin it would use
>    the symlink which would be pointing at the new install after the
>    update.

When compiled, slurm uses "rpath", so, when a having a running binary, each library will be searched in the initial realpath location, no matter if the symlink is pointing now to a different version. I did test your scenario using both procedures, running jobs run without issues when using the symlink strategy.  

Hope I cleared your doubts, but let me know if you still have any other question or something was not clear enough. Did the running jobs situation get resolved by killing the stepd's?

Kind regards,
Oscar
Comment 10 Michael Robbert 2022-09-20 09:20:31 MDT
Created attachment 26887 [details]
smime.p7s

   Working backwards on your questions and comments.

   Yes, killing the slurmstepd’s did alleviate the problem.

   Understanding that Slurm is compiled with rpath explains to me how
   using symlinks doesn’t break the library/plugin loads after an
   upgrade.

   I still don’t understand what changed in hash_k12.so between 22.05.1
   and 22.05.3. And, are the plugins versioned down to the minor version?
   I thought that the API wasn’t supposed to change between minor
   versions.

   *Mike Robbert*

   *Cyberinfrastructure Specialist, Cyberinfrastructure and Advanced
   Research Computing*

   Information and Technology Solutions (ITS)

   303-273-3786 | mrobbert@mines.edu[1]

   A close up of a sign

Description automatically generated

   *Our values:* Trust | Integrity | Respect | Responsibility

   *From: *bugs@schedmd.com <bugs@schedmd.com>
   *Date: *Monday, September 19, 2022 at 10:50
   *To: *Michael Robbert <mrobbert@mines.edu>
   *Subject: *[External] [Bug 14981] Nodes stuck in completing after
   minor version upgrade

   *CAUTION:* This email originated from outside of the Colorado School
   of Mines organization. Do not click on links or open attachments
   unless you recognize the sender and know the content is safe.

   * Comment # 9[2] on bug 14981[3] from Oscar Hernández[4] *

   Hi Mike,   

       

   My apologies if I was not clear on that matter. Let me answer in-line.   

       

   >    I am still not getting a very clear answer to my question. Are   

   >    in-place upgrades on a running system supported for minor version   

   >    upgrades?   

       

   In your scenario (installing via yum update in system default locations) they   

   are not supported. As mentioned, replacing libraries while having binaries on   

   run-time using them is not a good practice, it might work most of the time, but   

   can fail in uncontrolled ways. If you want to do it that way, nodes should be   

   drained (making sure there are no running jobs) before upgrading.   

       

   Slurm is designed to support it when using independent installations (not   

   overriding previous files). So switching to that strategy would allow you to do   

   rolling upgrades.   

       

   >    The documentation that you point to is worded very softly and to me   

   >    indicates that it isn’t required and just a suggestion. Is it true   

   >    that only certain code changes would cause a problem like this? If so,   

   >    what changed in the code for the hash_k12.so library to cause it to be   

   >    unloadable. I don’t see anything in the NEWS file that indicates a   

   >    change to that library.   

       

   Did some testing on it. Slurm was able to handle it apparently well in 21.08.   

   But in 22.05, to improve RPC security, slurmstepd is loading now hash_k12.so,   

   and I suspect that is the reason the error appears now. We expect slurm to find   

   correctly versioned plugins, so the error is expected in your situation. But   

   thanks for pointing this out.   

       

   >    If we did move to a per-version path for our installs, would that need   

   >    to be done on a node local filesystem with the symlink changed per   

   >    node at a particular point in time or could it be done on a shared   

   >    filesystem that would affect all nodes at once.    

       

   I wouldn't recommend having slurm installed on a shared FS for all nodes, it   

   could have performance implications. Ideal scenario is having each node with   

   their local slurm installation, and the symlink change can be done multiple   

   nodes at a time (using some tool that allows multiple node file management).   

       

   >    I’m having trouble   

   >    understanding how that method would avoid this problem. It seems like   

   >    the when the slurmstepd tries to load a library/plugin it would use   

   >    the symlink which would be pointing at the new install after the   

   >    update.   

       

   When compiled, slurm uses "rpath", so, when a having a running binary, each   

   library will be searched in the initial realpath location, no matter if the   

   symlink is pointing now to a different version. I did test your scenario using   

   both procedures, running jobs run without issues when using the symlink   

   strategy.     

       

   Hope I cleared your doubts, but let me know if you still have any other   

   question or something was not clear enough. Did the running jobs situation get   

   resolved by killing the stepd's?   

       

   Kind regards,   

   Oscar

   ---------------------------------------------------------------------

   You are receiving this mail because:

     * You reported the bug.

   

   1. mailto:mrobbert@mines.edu
   2. https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D14981%23c9&data=05%7C01%7Cmrobbert%40mines.edu%7C3d7a83eb0c484f096c6608da9a5f06c6%7C997209e009b346239a4d76afa44a675c%7C0%7C0%7C637992030192791076%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=IUhnQo7N5r2QOHIbT9pP9SluLxRBYZ8F2QH%2FAQDyiLM%3D&reserved=0
   3. https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D14981&data=05%7C01%7Cmrobbert%40mines.edu%7C3d7a83eb0c484f096c6608da9a5f06c6%7C997209e009b346239a4d76afa44a675c%7C0%7C0%7C637992030192947302%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=jhGbbd%2Bbmt7IJGGJN4ZxZnPc4Udyj8N2%2FaS2eBAT6TA%3D&reserved=0
   4. mailto:oscar.hernandez@schedmd.com
Comment 11 Oscar Hernández 2022-09-21 09:45:38 MDT
Hi Mike,

As the CG problem has been solved, I am downgrading this bug severity. Feel free to upgrade again if needed.

>   I still don’t understand what changed in hash_k12.so between 22.05.1
>   and 22.05.3. And, are the plugins versioned down to the minor version?
>   I thought that the API wasn’t supposed to change between minor
>   versions.

Nothing changed in k12.so. The line that fails is actually a check, versions mismatch, so it does not even try to load:

slurm/src/common/plugin.c:

	if ((*version & mask) != (SLURM_VERSION_NUMBER & mask)) {
		int plugin_major, plugin_minor, plugin_micro;
		plugin_major = SLURM_VERSION_MAJOR(*version);
		plugin_minor = SLURM_VERSION_MINOR(*version);
		plugin_micro = SLURM_VERSION_MICRO(*version);

		info("%s: Incompatible Slurm plugin %s version (%d.%02d.%d)",
		     caller, fq_path, plugin_major, plugin_minor, plugin_micro);
		return EPLUGIN_BAD_VERSION;
	}

Why this check? Even though API is not changing, with (non-SPANK) plugins, we expect them to be the same exact version as the components that call them. Otherwise, plugin errors can be difficult to find.

Why it fails now and not in previous versions? Although this k12.so library is used by most slurm components to generate hashes:

- Client commands load it every time they are invoked - when you replace the whole instalation, the new client will match the new library version, so no error is expected.
- Daemons (except slrmtstepd) load it on startup. So, does not matter if it is changed once the daemon is started, they will continue working with the one they have in memory. When restarted, will load the new one, which will also match versions.

Before 22.05, slurmstepd was not using this library. But now, it is using it (just at the end of a job's execution). In your scenario, slurmstepd starts running as an older version -> library is replaced -> when it tries to load it (end of job), fails with the mismatch. Issues like this one could happen in other situations, as binaries expect libraries to be there, that is the reason rolling upgrades replacing files are discouraged.

That's the general overview of the error you reported. Does that make sense to you?

Oscar
Comment 13 Oscar Hernández 2022-10-03 09:52:19 MDT
Hi Mike,

Just to update on this issue. You may have already seen the notice:

https://groups.google.com/g/slurm-users/c/5vLhW-oZLJE

After discussing with my colleagues, the idea is to place a fix in a future 22.05 release to solve the issue experimented when upgrading from RPMs. Meanwhile, the previously discussed slymlink strategy is the safe way for rolling upgrades.

Apologies for the inconvenience caused,

Oscar
Comment 17 Oscar Hernández 2022-10-11 10:58:15 MDT
Hi Mike,

A fix for the upgrade issue has been committed to 22.05. Next 22.05.5 release, which should be coming around this week, will already have the changes.

Just a note on that. As this bug is present in the slurmstepd which is actually running in the cluster (along with the running job) during the upgrade procedure, the same error will inevitably appear again once more. This commit is preventing it from happening when upgrading from 22.05.5 onward. 

So, any upgrade from 22.05.1-4 to 22.05.5-onward will have the issue with stuck CG jobs.

commit 57057885db6b2bd30a9de6e439155d7d73a91669
Author:     Oscar Hernández <oscar.hernandez@schedmd.com>
AuthorDate: Fri Sep 30 09:31:35 2022 +0200

    Initialize hash library on slurmstepd startup
    
    Hash library needs to be loaded at the beggining of slurmstepd execution,
    otherwise it can result in library incompatibility issues when performing
    rolling upgrades.
    
    Bug 14981

Kind regards,
Oscar
Comment 18 Oscar Hernández 2022-10-12 05:09:57 MDT
> So, any upgrade from 22.05.1-4 to 22.05.5-onward will have the issue with
> stuck CG jobs.

When I referenced 22.05.1-4, it should have been 22.05.0-4 (including first 22.05 release). Sorry for the confusion.

I am closing this bug. Please, reopen if there is any other issue/doubt related.

Thanks a lot for your feedback.

Oscar
Comment 19 Colin 2022-11-04 06:44:26 MDT
(In reply to Oscar Hernández from comment #13)
> Hi Mike,
> 
> Just to update on this issue. You may have already seen the notice:
> 
> https://groups.google.com/g/slurm-users/c/5vLhW-oZLJE
> 
> After discussing with my colleagues, the idea is to place a fix in a future
> 22.05 release to solve the issue experimented when upgrading from RPMs.
> Meanwhile, the previously discussed slymlink strategy is the safe way for
> rolling upgrades.
> 
> Apologies for the inconvenience caused,
> 
> Oscar

(In reply to Oscar Hernández from comment #9)
> Hi Mike,
> 
> My apologies if I was not clear on that matter. Let me answer in-line.
> 
> >    I am still not getting a very clear answer to my question. Are
> >    in-place upgrades on a running system supported for minor version
> >    upgrades?
> 
> In your scenario (installing via yum update in system default locations)
> they are not supported. As mentioned, replacing libraries while having
> binaries on run-time using them is not a good practice, it might work most
> of the time, but can fail in uncontrolled ways. If you want to do it that
> way, nodes should be drained (making sure there are no running jobs) before
> upgrading.
> 
> Slurm is designed to support it when using independent installations (not
> overriding previous files). So switching to that strategy would allow you to
> do rolling upgrades.
>      
> >    The documentation that you point to is worded very softly and to me
> >    indicates that it isn’t required and just a suggestion. Is it true
> >    that only certain code changes would cause a problem like this? If so,
> >    what changed in the code for the hash_k12.so library to cause it to be
> >    unloadable. I don’t see anything in the NEWS file that indicates a
> >    change to that library.
> 
> Did some testing on it. Slurm was able to handle it apparently well in
> 21.08. But in 22.05, to improve RPC security, slurmstepd is loading now
> hash_k12.so, and I suspect that is the reason the error appears now. We
> expect slurm to find correctly versioned plugins, so the error is expected
> in your situation. But thanks for pointing this out.
>  
> >    If we did move to a per-version path for our installs, would that need
> >    to be done on a node local filesystem with the symlink changed per
> >    node at a particular point in time or could it be done on a shared
> >    filesystem that would affect all nodes at once. 
> 
> I wouldn't recommend having slurm installed on a shared FS for all nodes, it
> could have performance implications. Ideal scenario is having each node with
> their local slurm installation, and the symlink change can be done multiple
> nodes at a time (using some tool that allows multiple node file management).
> 
> >    I’m having trouble
> >    understanding how that method would avoid this problem. It seems like
> >    the when the slurmstepd tries to load a library/plugin it would use
> >    the symlink which would be pointing at the new install after the
> >    update.
> 
> When compiled, slurm uses "rpath", so, when a having a running binary, each
> library will be searched in the initial realpath location, no matter if the
> symlink is pointing now to a different version. I did test your scenario
> using both procedures, running jobs run without issues when using the
> symlink strategy.  
> 
> Hope I cleared your doubts, but let me know if you still have any other
> question or something was not clear enough. Did the running jobs situation
> get resolved by killing the stepd's?
> 
> Kind regards,
> Oscar




Hi Oscar,

Could you tell me that:
Does in-place upgrades on a running system supported for major version if using the symlink?  (eg: from 21.xx to 22.xx with job running)
Comment 20 Jason Booth 2022-11-04 11:34:00 MDT
Colin, our system does not associate you with a supported site. Furthermore, please refrain from taking over a resolved issue that is from another site. Please open a new issue for your own questions. Note that without a support contract in place, those requests will likely go unanswered.
Comment 21 Jason Booth 2023-10-12 13:26:16 MDT
*** Ticket 17893 has been marked as a duplicate of this ticket. ***