Bug 15078 - Nodes stuck in CG after minor version update
Summary: Nodes stuck in CG after minor version update
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other bugs)
Version: 22.05.4
Hardware: Linux Linux
: --- 2 - High Impact
Assignee: Oscar Hernández
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2022-09-29 19:03 MDT by Kilian Cavalotti
Modified: 2022-10-12 04:58 MDT (History)
5 users (show)

See Also:
Site: Stanford
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 22.05.5
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
various logs (8.18 MB, application/gzip)
2022-09-29 21:11 MDT, Kilian Cavalotti
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Kilian Cavalotti 2022-09-29 19:03:29 MDT
Hi!

We just got bit by what is described in bug 14981, and we'll need to find a way to avoid this for future updates.

The symptom is draining nodes, and this error message in the slurmd logs:

error: hash_g_compute: hash plugin with id:0 not exist or is not loaded

We're also upgrading Slurm versions using RPMs, as we always did since version 2.6 or so, and we very much intend to continue doing so. We never had any issue, and just got this problem for the 1st time upgrading from 22.05.3 to 22.05.4.

Obviously generating library mismatches between minor versions is not a viable approach, and this check will have to be relaxed. This is a major regression and a very problematic change. 

We'll need a fix to avoid the hash check mismatch and ensure compatibility between minor versions.

Thanks!
--
Kilian
Comment 1 Kilian Cavalotti 2022-09-29 19:06:42 MDT
Quick correction in the symptom description: by "draining", I meant that jobs (and nodes) were stuck in Completing state.
Comment 2 Jason Booth 2022-09-29 19:39:09 MDT
Would you attach your slurmd.log from one or two nodes and you slurmctld.log? Please also include the output of a scontrol show node from those two nodes and scontrol show job for the jobs in a completing state if there are any.
Comment 3 Kilian Cavalotti 2022-09-29 21:11:01 MDT
Created attachment 27042 [details]
various logs

Hi Jason,

(In reply to Jason Booth from comment #2)
> Would you attach your slurmd.log from one or two nodes and you
> slurmctld.log? Please also include the output of a scontrol show node from
> those two nodes and scontrol show job for the jobs in a completing state if
> there are any.

Absolutely, all the requested info is attached:
- slurmctld logs
- slurmd logs for 3 completing nodes (sh03-02n70, sh03-11n18, sh03-15n07)
- "scontrol show node" for those 3 nodes
- "squeue -w $those_nodes"
- "scontrol show job" for all the CG jobs on those nodes

Please note that the slurmctld and slurmd logs will contain a 22.05.3 -> 22.05.4 upgrade, then a 22.05.4 -> 22.05.3 downgrade that I tried before realizing we were hitting bug 14981, and then a new upgrade to 22.05.4 once I realized downgrading didn't solve the issue.

Thanks!
--
Kilian
Comment 5 Oscar Hernández 2022-09-30 03:31:30 MDT
Hi Kilian,

Thanks for the provided logs. I have been able to confirm your problem is the same as in bug 14981. Last comment in that bug has a good overall explanation of the scenario. But I can clarify any other question you might have. 

For now, in order to recover the nodes, you will need to manually kill the slurmstepd processes (only the ones from stuck jobs) in them.

Then, to successfully complete the upgrade procedure using system RPMs, you will need to make sure there are no running jobs on the node when you upgrade its slurmd (drain the nodes). A good way of achieving this is by scheduling various maintenance reservations, one for each set of jobs you want to upgrade. 

There is no problem whith having some slurmds upgraded while others still runnning the older version.

At least for the current version. If you still want to perform the upgrade while jobs are running, as commented in the other bug, you will need to specify some prefix (whether building RPMs or compiling with configure) to make sure new installation is in a unique directory, not overriding the previous one. We are taking your concerns into account though.

Let me know if you have any other doubts/concerns.

Kind regards,
Oscar
Comment 6 Kilian Cavalotti 2022-09-30 10:17:24 MDT
Hi Oscar,

(In reply to Oscar Hernández from comment #5)
> Thanks for the provided logs. I have been able to confirm your problem is
> the same as in bug 14981. 

Thanks for double checking, I had little doubt it was something else.

> Last comment in that bug has a good overall
> explanation of the scenario. But I can clarify any other question you might
> have. 

Thanks, I have read the explanation, and I understand the mechanism that now causes the issue. But the rationale is not clear: why introducing a strict check now that has never deemed necessary before, and what existing problem does it solve? Because it does *not* improve RPC security a all: if attackers want to inject rogue RPCs, they can still easily fake version numbers so the hash will match. That check doesn't help anything in that regard.

Beyond this, the explanation about how an existing binary should not be replaced, sorry to say, does not make any sense at all. Replacing existing binaries is the very way distribution packages work, and things like OpenSSH (can't think of anything with a stronger security focus) don't have any problem when their binaries are upgraded in place. You can update OpenSSH packages on a host, and already running sshd processes will continue to work, existing connections will not break, etc. You'll see this in ps, an existing process pointing to a binary that has been deleted:

/proc/31813/exe -> /usr/sbin/sshd;63370b7b\ (deleted)

but that's not a problem at all, SSH continues to work, and new connections will use the new binary.

Replacing existing binaries while they're in use is literally how distribution packages operate, and have been doing so for decades. As did Slurm, until 22.05.

> For now, in order to recover the nodes, you will need to manually kill the
> slurmstepd processes (only the ones from stuck jobs) in them.

Yes, and this is a problem as well: we have jobs execution times ranging from a few minutes to several weeks, so that issue means that we'll have to track all the jobs that have started with the previous version, and manually kill their slurmstepd when they end. We had to put an emergency cron job to do that, but this is obviously not sustainable, it will go on for  weeks, and it generates very confusing messages for our users since slurmstepd fills their jobs' output files with endless error messages like this:

slurmstepd: error: Couldn't load specified plugin name for hash/k12: Incompatible plugin version
slurmstepd: error: cannot create hash context for K12
slurmstepd: error: slurm_send_node_msg: hash_g_compute: REQUEST_COMPLETE_BATCH_SCRIPT has error
slurmstepd: error: hash_g_compute: hash plugin with id:0 not exist or is not loaded
slurmstepd: error: slurm_send_node_msg: hash_g_compute: REQUEST_COMPLETE_BATCH_SCRIPT has error
slurmstepd: error: hash_g_compute: hash plugin with id:0 not exist or is not loaded
slurmstepd: error: slurm_send_node_msg: hash_g_compute: REQUEST_COMPLETE_BATCH_SCRIPT has error

We already had to field a number of support tickets about this.


> Then, to successfully complete the upgrade procedure using system RPMs, you
> will need to make sure there are no running jobs on the node when you
> upgrade its slurmd (drain the nodes). A good way of achieving this is by
> scheduling various maintenance reservations, one for each set of jobs you
> want to upgrade. 

This is also not a viable solution for us. We don't expect to have to drain nodes and stop jobs to upgrade Slurm to a new minor version. We never had to, and I suspect many sites do the same, rolling upgrades of minor versions while jobs are running.

Having to drain nodes for this, with multi-weeks jobs, would have a significant impact on our computing resources capacity, which is not something our users are willing to accommodate. So we'll need a way to continue deploying minor upgrades without having to take a full downtime or even drain nodes.

You don't have to close your existing SSH connections to a machine, or wait for them to terminate before upgrading the OpenSSH server on that host, fortunately. So you shouldn't have to drain nodes and wait jobs to terminate in order to update Slurm either.


> At least for the current version. If you still want to perform the upgrade
> while jobs are running, as commented in the other bug, you will need to
> specify some prefix (whether building RPMs or compiling with configure) to
> make sure new installation is in a unique directory, not overriding the
> previous one. 

I'm a bit little worried about this in the context of building RPMs, because that would break the fundamental rule of distribution packages, that files should be installed in system directories, and not in version-dependent locations. Nobody does that, and neither should Slurm: the `slurm-slurmd` RPM package installs slurmd and slurmstepd in /usr/sbin, there's no reasonable way to make it install them anywhere else, let alone in version-dependent directories.

> We are taking your concerns into account though.

Thanks! 

We actually have multiple problems here:

1. jobs stuck in CG and nodes draining: all of our 15,000+ running jobs that were started before the upgrade will end up being stuck in CG when they end, and bring down nodes with them as those will become unavailable for other jobs. 
We currently have to manually kill slurmstepds as job terminate, but this is not really sustainable.

2. make sure the next version upgrade doesn't end up being the hot mess that this one is: 22.05 in its current form is basically taking away the ability to do rolling upgrades while jobs are running. And this is MAJOR regression. We need to keep the ability to do minor version upgrades, using distribution packages, while jobs are running.

What we would appreciate right now is a patch that we can deploy to relax the version check and make old slurmstepd continue to work with newer slurmd, as it was the case before. 

That will address both issues listed above, resolve the stuck-in-completing problem, and also stop polluting users' logs with slurmstepd error messages.

> Let me know if you have any other doubts/concerns.

Introducing such a breaking change in the upgrade procedures without even a word of warning in the release notes (or anywhere really) is not great. It's been generating a lot of questions from our users, extra work for our support team and wreaked havoc in our cluster for no good reason. So we need a quick way to recover from that.

Sorry if that message sounds harsh, but it's probably clear that we're not very happy right now. :\


Thanks!
--
Kilian
Comment 7 griznog 2022-09-30 10:50:35 MDT
I'd like to add a "me too" here, changing the upgrade procedure for slurmd from:

clush -w @all_nodes systemctl restart slurmd

to:

drain nodes and restart slurmd as old jobs finish or track down old slurmstepd processes and manually kill them ...

is a major regression on a cluster where OverTimeLimit=UNLIMITED.

It is fine and expected to require stopping jobs for a major version upgrade, but minor version upgrades should be, what's the word I'm looking for ..., oh right, minor ;) 

griznog
Comment 8 Jason Booth 2022-09-30 11:05:58 MDT
So, this behavior is expected, and it's preventing internal corruption, but was an oversight on our end when we hardened Slurm recently. 

https://github.com/SchedMD/slurm/commit/40099fb60c8ff343137c018d24d43d5e72fe293c

As a security update for the upcoming 22.05 release, we added hashes to RPCs. Which means that now each component will need to load "hash_k12.so". The drawback this caused is that slurmstepd now needs to load this library. Previously it was not doing this, however now we force a version match. Slurmstepd will fail in the load, which is still running the previous version on rolling upgrades. 

This causes the issue as sites perform a rolling upgrade, replacing old slurm paths.

Unfortunately, this means that in-pace upgrade with running steps from 22.05 maintenance version will fail. 

We do plan to fix this in a future 22.05, but for now, care should be taken when using RPM's and upgrading between minor versions of 22.05.

We do apologize for the unfortunate oversight.
Comment 9 Kilian Cavalotti 2022-09-30 11:12:25 MDT
Hi Jason, 

(In reply to Jason Booth from comment #8)
> So, this behavior is expected, and it's preventing internal corruption, but
> was an oversight on our end when we hardened Slurm recently. 
> 
> https://github.com/SchedMD/slurm/commit/
> 40099fb60c8ff343137c018d24d43d5e72fe293c
> 
> As a security update for the upcoming 22.05 release, we added hashes to
> RPCs. Which means that now each component will need to load "hash_k12.so".
> The drawback this caused is that slurmstepd now needs to load this library.
> Previously it was not doing this, however now we force a version match.
> Slurmstepd will fail in the load, which is still running the previous
> version on rolling upgrades. 
> 
> This causes the issue as sites perform a rolling upgrade, replacing old
> slurm paths.

Thanks for providing the context for that change.

> Unfortunately, this means that in-pace upgrade with running steps from 22.05
> maintenance version will fail. 
> 
> We do plan to fix this in a future 22.05, but for now, care should be taken
> when using RPM's and upgrading between minor versions of 22.05.

That's good to hear (and reassuring!), but what do we do right now?

Can we envision a local patch that would relax the version check in slurmd, so that it can accept connections from existing slurmstepd processes? Unless the error comes from the already-running slurmstepd?

> We do apologize for the unfortunate oversight.

Appreciate this, thank you!

Cheers,
--
Kilian
Comment 10 griznog 2022-09-30 11:15:18 MDT
(In reply to Jason Booth from comment #8)
> We do plan to fix this in a future 22.05, but for now, care should be taken
> when using RPM's and upgrading between minor versions of 22.05.
> 

Thanks for the explanation, Jason. Am I correct that even if I wait for 22.05.5 (or later) I'm still going to need to stop jobs to update from 22.05.3? Or will a future 22.05.x relax this so I can wait for that release and do the normal rolling update?

Since we have effectively no time limits on jobs, we do try to only do major updates with a month or more advanced warning and a planned outage, so I want to minimize the number of those to the extent possible.
Comment 11 Jason Booth 2022-09-30 13:37:49 MDT
Thank you both for your feedback. We will be making a general announcement shortly with more details.
Comment 12 Paul Edmon 2022-10-01 06:33:00 MDT
This also breaks our ability to upgrade here as we have used RPMs to install slurm and do upgrades and have done so for years.  When we did the upgrade from 22.05.2 we also saw the COMPLETING storm and have been forced to rolling reboot all our nodes. Like Kilian we have zero ability to fully drain our cluster in a timely manner this means that we won't be upgrading Slurm, unless there is a major bug, until this is fixed, which is hardly optimal from a security and stability point of view.

Until this is fixed we are sticking with 22.05.3 and not upgrading.  I would appreciate that when a patch is put together for this that upgrade paths be tested for all 22.05 releases to the fixed version to make sure that no COMPLETING jobs are generated by the upgrade, or that if they are that is the last time they will be generated.
Comment 13 Oscar Hernández 2022-10-11 11:11:57 MDT
Hi,

As mentioned in bug 14981. The upgrade issue has been fixed in commit (57057885db6b2bd30a9de6e439155d7d73a91669), which should come along a 22.05.5 release this week.

Downside is that, as the bug is inherent to a slurmstepd which is running during the upgrade, you should expect the CG problems just once more.

So, making it clear:

- Upgrades from 22.05.1-4 to 22.05.5-onward will have the issue with stuck CG jobs.
- Upgrades from 22.05.5 onward will be OK.
- Upgrades from older versions than 22.05.1 to 22.05.5-onward will be OK.

We apologize for the inconvenience this might have caused.

Kind regards,
Oscar
Comment 14 Oscar Hernández 2022-10-12 04:58:30 MDT
 
> - Upgrades from 22.05.1-4 to 22.05.5-onward will have the issue with stuck
> CG jobs.
> - Upgrades from 22.05.5 onward will be OK.
> - Upgrades from older versions than 22.05.1 to 22.05.5-onward will be OK.

Just to avoid possible confusion. All previous 22.05.1 references should actually be 22.05.0 (first 22.05 release). My apologies for the error.

I am resolving this bug. Please, reopen if there is anything else related.

Thanks a lot for the reports and feedback,
Oscar