Bug 5097 - Possible conflicting definitions of MAX_MSG_SIZE
Summary: Possible conflicting definitions of MAX_MSG_SIZE
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Database (show other bugs)
Version: 17.11.5
Hardware: Linux Linux
: --- 5 - Enhancement
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-04-24 23:58 MDT by Christopher Samuel
Modified: 2019-11-22 01:17 MST (History)
1 user (show)

See Also:
Site: Swinburne
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Christopher Samuel 2018-04-24 23:58:51 MDT
Hi there,

On the list, during a discussion of a case where "sacctmgr show runaway" would break because there were too many runaway jobs it appears the cause might be the fact that the message size (MAX_MSG_SIZE) is not large enough.

I had a look at the source code and was confused because the reported value was:

#define MAX_MSG_SIZE     (16*1024*1024)

but when I was searching the source code I was seeing:

#define MAX_MSG_SIZE     (1024*1024*1024)

Then I realised that's because it's defined differently in two different header files (src/common/slurm_persist_conn.c and src/common/slurm_protocol_socket_implementation.c).  The second one was last changed back in 2013 in 2.6.3 (commit f1a49c114fc9c5b10c016b0dd0fb4b1c30f3344a) whilst the first definition only appeared in 2016 in commit 1fea57350f0a486c34742248106f4b8edb772bcb.

Digging further revealed there are 4 different definitions of that symbol:

$ git grep 'define MAX_MSG_SIZE'
src/common/slurm_persist_conn.c:#define MAX_MSG_SIZE     (16*1024*1024)
src/common/slurm_protocol_socket_implementation.c:#define MAX_MSG_SIZE     (1024*1024*1024)
src/plugins/mpi/pmix/pmixp_debug.h:#define MAX_MSG_SIZE 1024
src/slurmdbd/rpc_mgr.c:#define MAX_MSG_SIZE     (16*1024*1024)

It's not clear whether having different values for the same symbol is intentional or accidental, but it does appear that the lower value for the slurmdbd comms can cause a problem that can only be resolved by manually delving into the database.

If it's accidental it's probably best to try and make the value consistent and only define it in one place.

If it is intentional it's probably helpful to change the names of the symbols to properly reflect their specific use cases (though I recognise the PMIX one might be outside of your control).

All the best,
Chris
Comment 1 Tim Wickberg 2018-04-25 00:40:54 MDT
> I had a look at the source code and was confused because the reported value
> was:
> 
> #define MAX_MSG_SIZE     (16*1024*1024)
> 
> but when I was searching the source code I was seeing:
> 
> #define MAX_MSG_SIZE     (1024*1024*1024)
> 
> Then I realised that's because it's defined differently in two different
> header files (src/common/slurm_persist_conn.c and
> src/common/slurm_protocol_socket_implementation.c).  The second one was last
> changed back in 2013 in 2.6.3 (commit
> f1a49c114fc9c5b10c016b0dd0fb4b1c30f3344a) whilst the first definition only
> appeared in 2016 in commit 1fea57350f0a486c34742248106f4b8edb772bcb.

a) Those aren't header files... headers end in .h :)

> Digging further revealed there are 4 different definitions of that symbol:
> 
> $ git grep 'define MAX_MSG_SIZE'
> src/common/slurm_persist_conn.c:#define MAX_MSG_SIZE     (16*1024*1024)
> src/common/slurm_protocol_socket_implementation.c:#define MAX_MSG_SIZE    
> (1024*1024*1024)
> src/plugins/mpi/pmix/pmixp_debug.h:#define MAX_MSG_SIZE 1024
> src/slurmdbd/rpc_mgr.c:#define MAX_MSG_SIZE     (16*1024*1024)
> 
> It's not clear whether having different values for the same symbol is
> intentional or accidental, but it does appear that the lower value for the
> slurmdbd comms can cause a problem that can only be resolved by manually
> delving into the database.

b) It's important to note that these are preprocessor macros, and not symbols. Symbols have scope, and appear in the compiled code. The macros, OTOH, only have scope within their respective source file (unless that source file is included elsewhere - these, as source files and not headers, are not included elsewhere).

> If it's accidental it's probably best to try and make the value consistent
> and only define it in one place.

Fair point. It does look like the imbalance between the slurm_persist_conn.c and 
slurm_protocol_socket_implementation.c definitions could cause problems.

> If it is intentional it's probably helpful to change the names of the
> symbols to properly reflect their specific use cases (though I recognize the
> PMIX one might be outside of your control).

I'll look into it further.

In the meantime, commit 093a40b3f00f99c5acf842d7d663ec2cc32526db removes the (unused) definition from src/slurmdbd/rpc_mgr.c.

But - just to be clear - this isn't an issue you're seeing at Swinburne currently, correct? Unfortunately, I cannot indirectly support the list through your Swinburne support contract, and will need to drop the severity on this to ~ Sev. 5.

- Tim
Comment 2 Christopher Samuel 2018-04-25 01:40:28 MDT
On Wednesday, 25 April 2018 4:40:54 PM AEST bugs@schedmd.com wrote:

> a) Those aren't header files... headers end in .h :)

Sigh, I knew I needed more coffee today! :-(

> Fair point. It does look like the imbalance between the slurm_persist_conn.c
> and slurm_protocol_socket_implementation.c definitions could cause problems.

Thanks.

> In the meantime, commit 093a40b3f00f99c5acf842d7d663ec2cc32526db removes the
> (unused) definition from src/slurmdbd/rpc_mgr.c.

Oh handy, well that's one down. :-)

> But - just to be clear - this isn't an issue you're seeing at Swinburne
> currently, correct? Unfortunately, I cannot indirectly support the list
> through your Swinburne support contract, and will need to drop the severity
> on this to ~ Sev. 5.

Correct (I wanted to be up front about why I was raising this).

Though should it happen to us I'd hate to be in a position the Chris on the 
list was in where they couldn't start any more jobs until they'd done some 
MySQL hacking first.   That's a scary situation to be in.

All the best,
Chris
Comment 3 Tim Wickberg 2018-04-25 10:08:44 MDT
> Though should it happen to us I'd hate to be in a position the Chris on the 
> list was in where they couldn't start any more jobs until they'd done some 
> MySQL hacking first.   That's a scary situation to be in.

Skimming through that thread, there are two separate issues - the main one is that he'd overwhelmed slurmdbd and lost accounting records, the second which you're reporting here (and would not lead to an inability to run further records) is solely in sacctmgr when trying to clean up the results of that data loss.

Deferring this to Sev5 for now.

- Tim