Ticket 6932 - Fix pmix plugin: PMIx proc map generation scalability
Summary: Fix pmix plugin: PMIx proc map generation scalability
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: PMIx (show other tickets)
Version: 19.05.x
Hardware: Linux Linux
: --- C - Contributions
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-04-30 11:30 MDT by Artem Polyakov
Modified: 2020-02-03 13:37 MST (History)
0 users

See Also:
Site: -Other-
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 20.02.0pre1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Patch to fix the scalability bug in PMIx (6.78 KB, application/mbox)
2019-04-30 11:30 MDT, Artem Polyakov
Details
Use xstrfmtcatat() to fix PMIx scalability issue (5.12 KB, patch)
2019-11-15 19:50 MST, Tim Wickberg
Details | Diff
Patch to fix the scalability bug in PMIx (v2) (3.20 KB, patch)
2020-01-24 08:58 MST, Artem Polyakov
Details | Diff
Fix coverity issues in the plugin (2.15 KB, patch)
2020-01-24 08:59 MST, Artem Polyakov
Details | Diff
Fix coverity issues in the plugin (2.15 KB, application/mbox)
2020-01-24 09:04 MST, Artem Polyakov
Details
Fix coverity issues in the plugin (2.15 KB, patch)
2020-01-24 09:04 MST, Artem Polyakov
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Artem Polyakov 2019-04-30 11:30:09 MDT
Created attachment 10068 [details]
Patch to fix the scalability bug in PMIx

Hello
Yesterday we discovered a performance bug with the scalability of one of the PMIx initialization steps.
It is causing a delay in processes launch of multiple minutes order.

I'd really like to get it into 19.x. I understand though that seems like a big change.

To extensively test this modification we created a standalone POC code:
https://github.com/artpol84/poc/tree/master/slurm/pmix_proc_map

This code generates 3 types of process-node mappings: by-core, by-node (round-robin) and random.
Then we generate the mapping and parse it back to verify the correctness.

The new algorithm takes orders of milliseconds to generate mapping for 4K nodes as opposed to minutes with the original one.
Comment 1 Tim Wickberg 2019-04-30 12:56:39 MDT
Hey Artem -

We'll look into this further later this week. 19.05rc1 is due out this afternoon, but this does seem like a good idea to fix in 19.05 if we can.

For the O(N^3) calculation - it looks like the successive strlen() calls hiding within the xstrfmtcat() calls are at least part of the issue?

I'm wondering if it'd be better to add/modify some of the xstring calls to provide something akin to what the new macro is doing instead so this could be used elsewhere - I'm guessing there are other loops constructing strings that would benefit from similar cleanup as well.

- Tim
Comment 2 Artem Polyakov 2019-04-30 13:33:01 MDT
Hi, Tim

In the original implementation I spot the following issues:
1. O(n^2) number of iterations in the algo itself (n - number of nodes)

2. mallocing overhead
In the original algorithm call xstrfmtcat for each integer and for each comma (twice per each process)
2.1 xstrfmtcat is calling _xstrdup_vprintf:
https://github.com/SchedMD/slurm/blob/master/src/common/xstring.c#L273
I haven't explicitly measured the overhead of this, but it might be pretty substantial.
2.2 makespace called by xstrcat re-allocates in 64-byte chunks.
https://github.com/SchedMD/slurm/blob/6a9696214095a4003e28532bb8e84d16e494cd5f/src/common/xstring.c#L64
In my case, it was a MB-level string so required ~15000 reallocs that I believe were the main performance killers.

3. The strcat was inefficient there because of strlen as you pointed out. As string is growing we spend more and more time looking for the end of the string.

So I solved those issues in 2 steps:
1. I used service structures to reduce the number of steps required to retrieve required data from O(n^2) down to O(n) with fixed amount of mallocs there.
2. I'm creating a buffer that is enough to hold the final string at the beginning of the algorithm so we have 1 malloc as opposed to 2xNumber-of-procs.
BTW in my standalone POC we evaluated the accuracy of this estimation (it prints it out) and it looks that for big systems there will be ~3% of space wasted which is pretty good:
]$ for i in 10 100 1000 10000 100000 1000000 10000000 100000000 1000000000; do echo -e $i: `./main $i 40 core`; done
10: used size=1489, allocated size=1611 (diff=7.57%)
100: used size=18889, allocated size=20101 (diff=6.03%)
1000: used size=228889, allocated size=241001 (diff=5.03%)
10000: used size=2688889, allocated size=2810001 (diff=4.31%)
100000: used size=30888889, allocated size=32100001 (diff=3.77%)
1000000: used size=348888889, allocated size=361000001 (diff=3.35%)
10000000: used size=3888888889, allocated size=4010000001 (diff=3.02%)


To be honest when I started re-writing I haven't tried to use xstr... as I assumed that they might not work.
I agree we can think about improving the xstr interface as a whole.

Though I'd prefer to do it incrementally and merge this patch that I have tested and later replacing sprintf with xstr alternative optimized for what we do here.

I also would like to avoid any micro-mallocs on the code path (like _xstrdup_vprintf in xstrfmtcat)
Comment 3 Artem Polyakov 2019-05-13 09:07:52 MDT
Hi, any progress on this?
Comment 4 Tim Wickberg 2019-05-27 19:20:21 MDT
(In reply to Artem Polyakov from comment #3)
> Hi, any progress on this?

Sorry for not responding sooner here. I'm waiting until after 19.05.0 at this point to get a version of this included.

I do still plan to introduce a new xstr* function that covers this use case - PMIx isn't the only place in our code that has this type of scalability issue.

- Tim
Comment 5 Artem Polyakov 2019-06-05 08:55:48 MDT
Thank you for the response.
I see, that 19.05.0 is out. What is the plan regarding this fix?
Comment 6 Artem Polyakov 2019-06-20 22:38:10 MDT
Ping
Comment 17 Tim Wickberg 2019-11-15 19:50:07 MST
Created attachment 12343 [details]
Use xstrfmtcatat() to fix PMIx scalability issue

Hi Artem -

Apologies for not posting this sooner, I'd apparently misplaced this on a stray branch and never posted it.

This uses a new xstring function - xstrfmtcatat() - that we added in response to this string-appending performance issue for excessively large strings. The attached patch set reworks your commit to use this instead of adding a new PMIx-specific string function.

If you can test this out and let me know if this addresses the issue I'd appreciate it; I'll land this in an uncoming 19.05 maintenance release if I hear that it's useful.

- Tim
Comment 18 Artem Polyakov 2019-11-18 21:10:26 MST
Tim,
Boris will run my benchmark to make sure that we are getting good performance and I'll let you know.
Comment 19 Artem Polyakov 2019-11-30 13:32:41 MST
Tim,

we performed some analysis of the performance of the implementation using xstrfmtcatat. You can find the summary here:
https://github.com/artpol84/poc/blob/master/slurm/pmix_proc_map/README.md

In short, it still adds ~50% overhead compared to my original version.
This is high enough from my perspective to stick with the original proposal.
Or if you can propose the version of xstrfmtcatat that will take advantage of the prior knowledge that the buffer size is large enough - it'll work as well.
Comment 20 Artem Polyakov 2019-12-07 20:47:38 MST
Hey, Tim

Any updates? Do you want to have a call to sync interactively?
Comment 21 Artem Polyakov 2020-01-17 16:57:37 MST
Tim,
we really need this fix in the upcoming Slurm release.
Do you have any updates?
Should we have a call to speedup the process?
Comment 22 Tim Wickberg 2020-01-23 12:45:19 MST
Comment on attachment 12343 [details]
Use xstrfmtcatat() to fix PMIx scalability issue

commit 9311754e53f94c3721fa1db711cfe1eac70f4242
Author:     Artem Y. Polyakov <artpol84@gmail.com>
AuthorDate: Tue Jul 16 17:01:34 2019 -0600

    PMIx - preallocate the map buffer space.
    
    Bug 6932.

commit 09f9650b0aeb6288bbaef674494b2d6d5be2d8b7
Author:     Artem Y. Polyakov <artpol84@gmail.com>
AuthorDate: Tue Jul 16 16:51:15 2019 -0600

    PMIx - add pmixp_count_digits_base10() utility function.
    
    Bug 6932.

commit 7b46dc799c1141a2e648087f29901cb5cb5f7e00
Author:     Tim Wickberg <tim@schedmd.com>
AuthorDate: Tue Jul 16 16:03:32 2019 -0600

    PMIx - use xstrfmtcatat() in _set_mapsinfo().
    
    Avoids repeated strlen() calls against an expanding string, which
    adds another O(log(n)) factor to the complexity of this loop.
    
    Bug 6932.
Comment 23 Tim Wickberg 2020-01-23 12:47:11 MST
I've pushed the version I'd proposed to master now.

There have been a few other performance improvements to xstrfmtcatat() on master since you last evaluated it which I believe should mostly close the gap.

As we've discussed offline, we're unwilling to build specialized string manipulation functions just for PMIx, and need that plugin interface to start converging with the rest of the Slurm style and share more of the common library code.

- Tim
Comment 24 Artem Polyakov 2020-01-23 13:48:16 MST
Tim,
I think that the main optimization was dropped from the ultimate patch.

If you'll look into the original patch:
https://github.com/artpol84/slurm/commit/dad809c6b630a25572b0838b5afa09953614d6ea

There is an important reorganization function:
https://github.com/artpol84/slurm/commit/dad809c6b630a25572b0838b5afa09953614d6ea#diff-f3218c2c92dfa384bb498c3e1098388aR369

It allows to build the map by using only one pass over the tasks array as opposed to O(n^2) before.

Can we bring this in?
Comment 25 Tim Wickberg 2020-01-23 14:26:31 MST
If you want to submit a clean standalone patch for that I'll consider it. The GitHub link you have mixes that change in with a lot of stuff that's either now landed separately, or that I won't consider.
Comment 26 Artem Polyakov 2020-01-23 14:37:35 MST
Yes, this is what I'm asking for.
I'll post the patch shortly.
Comment 27 Artem Polyakov 2020-01-24 08:58:36 MST
Created attachment 12829 [details]
Patch to fix the scalability bug in PMIx (v2)
Comment 28 Artem Polyakov 2020-01-24 08:59:06 MST
Created attachment 12830 [details]
Fix coverity issues in the plugin
Comment 29 Artem Polyakov 2020-01-24 09:04:01 MST
Created attachment 12831 [details]
Fix coverity issues in the plugin
Comment 30 Artem Polyakov 2020-01-24 09:04:23 MST
Created attachment 12832 [details]
Fix coverity issues in the plugin
Comment 31 Artem Polyakov 2020-01-24 09:05:45 MST
Tim,
As we agreed, I've attached a patch that fixes the scalability.
I also ran Coverity over the plugin and fixed a few minor issues.
Comment 32 Tim Wickberg 2020-02-03 09:25:55 MST
Comment on attachment 12832 [details]
Fix coverity issues in the plugin

Moved to bug 8392.
Comment 33 Tim Wickberg 2020-02-03 10:17:41 MST
Comment on attachment 12829 [details]
Patch to fix the scalability bug in PMIx (v2)

Thanks Artem. This in on master now (with some style fixes), and will be in 20.02.0pre1 when released this afternoon.

commit 9f472af5949970eeed6440f0d356ba7a1c1a61fa
Author:     Artem Polyakov <artpol84@gmail.com>
AuthorDate: Thu Jan 23 17:58:47 2020 -0800

    PMIx - improve process map parsing algorithm
    
    Reorder ranks according to the node placement to avoid O(n^2) traversal
    
    Signed-off-by: Artem Polyakov <artpol84@gmail.com>
Comment 34 Tim Wickberg 2020-02-03 10:17:56 MST
Marking closed.
Comment 35 Artem Polyakov 2020-02-03 13:37:11 MST
Thank you, Tim

This is great, I appreciate it!