Summary: | Possible conflicting definitions of MAX_MSG_SIZE | ||
---|---|---|---|
Product: | Slurm | Reporter: | Christopher Samuel <chris> |
Component: | Database | Assignee: | Tim Wickberg <tim> |
Status: | OPEN --- | QA Contact: | |
Severity: | 5 - Enhancement | ||
Priority: | --- | CC: | bart |
Version: | 17.11.5 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: | https://bugs.schedmd.com/show_bug.cgi?id=8033 | ||
Site: | Swinburne | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | Target Release: | --- | |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Description
Christopher Samuel
2018-04-24 23:58:51 MDT
> I had a look at the source code and was confused because the reported value > was: > > #define MAX_MSG_SIZE (16*1024*1024) > > but when I was searching the source code I was seeing: > > #define MAX_MSG_SIZE (1024*1024*1024) > > Then I realised that's because it's defined differently in two different > header files (src/common/slurm_persist_conn.c and > src/common/slurm_protocol_socket_implementation.c). The second one was last > changed back in 2013 in 2.6.3 (commit > f1a49c114fc9c5b10c016b0dd0fb4b1c30f3344a) whilst the first definition only > appeared in 2016 in commit 1fea57350f0a486c34742248106f4b8edb772bcb. a) Those aren't header files... headers end in .h :) > Digging further revealed there are 4 different definitions of that symbol: > > $ git grep 'define MAX_MSG_SIZE' > src/common/slurm_persist_conn.c:#define MAX_MSG_SIZE (16*1024*1024) > src/common/slurm_protocol_socket_implementation.c:#define MAX_MSG_SIZE > (1024*1024*1024) > src/plugins/mpi/pmix/pmixp_debug.h:#define MAX_MSG_SIZE 1024 > src/slurmdbd/rpc_mgr.c:#define MAX_MSG_SIZE (16*1024*1024) > > It's not clear whether having different values for the same symbol is > intentional or accidental, but it does appear that the lower value for the > slurmdbd comms can cause a problem that can only be resolved by manually > delving into the database. b) It's important to note that these are preprocessor macros, and not symbols. Symbols have scope, and appear in the compiled code. The macros, OTOH, only have scope within their respective source file (unless that source file is included elsewhere - these, as source files and not headers, are not included elsewhere). > If it's accidental it's probably best to try and make the value consistent > and only define it in one place. Fair point. It does look like the imbalance between the slurm_persist_conn.c and slurm_protocol_socket_implementation.c definitions could cause problems. > If it is intentional it's probably helpful to change the names of the > symbols to properly reflect their specific use cases (though I recognize the > PMIX one might be outside of your control). I'll look into it further. In the meantime, commit 093a40b3f00f99c5acf842d7d663ec2cc32526db removes the (unused) definition from src/slurmdbd/rpc_mgr.c. But - just to be clear - this isn't an issue you're seeing at Swinburne currently, correct? Unfortunately, I cannot indirectly support the list through your Swinburne support contract, and will need to drop the severity on this to ~ Sev. 5. - Tim On Wednesday, 25 April 2018 4:40:54 PM AEST bugs@schedmd.com wrote: > a) Those aren't header files... headers end in .h :) Sigh, I knew I needed more coffee today! :-( > Fair point. It does look like the imbalance between the slurm_persist_conn.c > and slurm_protocol_socket_implementation.c definitions could cause problems. Thanks. > In the meantime, commit 093a40b3f00f99c5acf842d7d663ec2cc32526db removes the > (unused) definition from src/slurmdbd/rpc_mgr.c. Oh handy, well that's one down. :-) > But - just to be clear - this isn't an issue you're seeing at Swinburne > currently, correct? Unfortunately, I cannot indirectly support the list > through your Swinburne support contract, and will need to drop the severity > on this to ~ Sev. 5. Correct (I wanted to be up front about why I was raising this). Though should it happen to us I'd hate to be in a position the Chris on the list was in where they couldn't start any more jobs until they'd done some MySQL hacking first. That's a scary situation to be in. All the best, Chris > Though should it happen to us I'd hate to be in a position the Chris on the
> list was in where they couldn't start any more jobs until they'd done some
> MySQL hacking first. That's a scary situation to be in.
Skimming through that thread, there are two separate issues - the main one is that he'd overwhelmed slurmdbd and lost accounting records, the second which you're reporting here (and would not lead to an inability to run further records) is solely in sacctmgr when trying to clean up the results of that data loss.
Deferring this to Sev5 for now.
- Tim
|