Created attachment 10068 [details] Patch to fix the scalability bug in PMIx Hello Yesterday we discovered a performance bug with the scalability of one of the PMIx initialization steps. It is causing a delay in processes launch of multiple minutes order. I'd really like to get it into 19.x. I understand though that seems like a big change. To extensively test this modification we created a standalone POC code: https://github.com/artpol84/poc/tree/master/slurm/pmix_proc_map This code generates 3 types of process-node mappings: by-core, by-node (round-robin) and random. Then we generate the mapping and parse it back to verify the correctness. The new algorithm takes orders of milliseconds to generate mapping for 4K nodes as opposed to minutes with the original one.
Hey Artem - We'll look into this further later this week. 19.05rc1 is due out this afternoon, but this does seem like a good idea to fix in 19.05 if we can. For the O(N^3) calculation - it looks like the successive strlen() calls hiding within the xstrfmtcat() calls are at least part of the issue? I'm wondering if it'd be better to add/modify some of the xstring calls to provide something akin to what the new macro is doing instead so this could be used elsewhere - I'm guessing there are other loops constructing strings that would benefit from similar cleanup as well. - Tim
Hi, Tim In the original implementation I spot the following issues: 1. O(n^2) number of iterations in the algo itself (n - number of nodes) 2. mallocing overhead In the original algorithm call xstrfmtcat for each integer and for each comma (twice per each process) 2.1 xstrfmtcat is calling _xstrdup_vprintf: https://github.com/SchedMD/slurm/blob/master/src/common/xstring.c#L273 I haven't explicitly measured the overhead of this, but it might be pretty substantial. 2.2 makespace called by xstrcat re-allocates in 64-byte chunks. https://github.com/SchedMD/slurm/blob/6a9696214095a4003e28532bb8e84d16e494cd5f/src/common/xstring.c#L64 In my case, it was a MB-level string so required ~15000 reallocs that I believe were the main performance killers. 3. The strcat was inefficient there because of strlen as you pointed out. As string is growing we spend more and more time looking for the end of the string. So I solved those issues in 2 steps: 1. I used service structures to reduce the number of steps required to retrieve required data from O(n^2) down to O(n) with fixed amount of mallocs there. 2. I'm creating a buffer that is enough to hold the final string at the beginning of the algorithm so we have 1 malloc as opposed to 2xNumber-of-procs. BTW in my standalone POC we evaluated the accuracy of this estimation (it prints it out) and it looks that for big systems there will be ~3% of space wasted which is pretty good: ]$ for i in 10 100 1000 10000 100000 1000000 10000000 100000000 1000000000; do echo -e $i: `./main $i 40 core`; done 10: used size=1489, allocated size=1611 (diff=7.57%) 100: used size=18889, allocated size=20101 (diff=6.03%) 1000: used size=228889, allocated size=241001 (diff=5.03%) 10000: used size=2688889, allocated size=2810001 (diff=4.31%) 100000: used size=30888889, allocated size=32100001 (diff=3.77%) 1000000: used size=348888889, allocated size=361000001 (diff=3.35%) 10000000: used size=3888888889, allocated size=4010000001 (diff=3.02%) To be honest when I started re-writing I haven't tried to use xstr... as I assumed that they might not work. I agree we can think about improving the xstr interface as a whole. Though I'd prefer to do it incrementally and merge this patch that I have tested and later replacing sprintf with xstr alternative optimized for what we do here. I also would like to avoid any micro-mallocs on the code path (like _xstrdup_vprintf in xstrfmtcat)
Hi, any progress on this?
(In reply to Artem Polyakov from comment #3) > Hi, any progress on this? Sorry for not responding sooner here. I'm waiting until after 19.05.0 at this point to get a version of this included. I do still plan to introduce a new xstr* function that covers this use case - PMIx isn't the only place in our code that has this type of scalability issue. - Tim
Thank you for the response. I see, that 19.05.0 is out. What is the plan regarding this fix?
Ping
Created attachment 12343 [details] Use xstrfmtcatat() to fix PMIx scalability issue Hi Artem - Apologies for not posting this sooner, I'd apparently misplaced this on a stray branch and never posted it. This uses a new xstring function - xstrfmtcatat() - that we added in response to this string-appending performance issue for excessively large strings. The attached patch set reworks your commit to use this instead of adding a new PMIx-specific string function. If you can test this out and let me know if this addresses the issue I'd appreciate it; I'll land this in an uncoming 19.05 maintenance release if I hear that it's useful. - Tim
Tim, Boris will run my benchmark to make sure that we are getting good performance and I'll let you know.
Tim, we performed some analysis of the performance of the implementation using xstrfmtcatat. You can find the summary here: https://github.com/artpol84/poc/blob/master/slurm/pmix_proc_map/README.md In short, it still adds ~50% overhead compared to my original version. This is high enough from my perspective to stick with the original proposal. Or if you can propose the version of xstrfmtcatat that will take advantage of the prior knowledge that the buffer size is large enough - it'll work as well.
Hey, Tim Any updates? Do you want to have a call to sync interactively?
Tim, we really need this fix in the upcoming Slurm release. Do you have any updates? Should we have a call to speedup the process?
Comment on attachment 12343 [details] Use xstrfmtcatat() to fix PMIx scalability issue commit 9311754e53f94c3721fa1db711cfe1eac70f4242 Author: Artem Y. Polyakov <artpol84@gmail.com> AuthorDate: Tue Jul 16 17:01:34 2019 -0600 PMIx - preallocate the map buffer space. Bug 6932. commit 09f9650b0aeb6288bbaef674494b2d6d5be2d8b7 Author: Artem Y. Polyakov <artpol84@gmail.com> AuthorDate: Tue Jul 16 16:51:15 2019 -0600 PMIx - add pmixp_count_digits_base10() utility function. Bug 6932. commit 7b46dc799c1141a2e648087f29901cb5cb5f7e00 Author: Tim Wickberg <tim@schedmd.com> AuthorDate: Tue Jul 16 16:03:32 2019 -0600 PMIx - use xstrfmtcatat() in _set_mapsinfo(). Avoids repeated strlen() calls against an expanding string, which adds another O(log(n)) factor to the complexity of this loop. Bug 6932.
I've pushed the version I'd proposed to master now. There have been a few other performance improvements to xstrfmtcatat() on master since you last evaluated it which I believe should mostly close the gap. As we've discussed offline, we're unwilling to build specialized string manipulation functions just for PMIx, and need that plugin interface to start converging with the rest of the Slurm style and share more of the common library code. - Tim
Tim, I think that the main optimization was dropped from the ultimate patch. If you'll look into the original patch: https://github.com/artpol84/slurm/commit/dad809c6b630a25572b0838b5afa09953614d6ea There is an important reorganization function: https://github.com/artpol84/slurm/commit/dad809c6b630a25572b0838b5afa09953614d6ea#diff-f3218c2c92dfa384bb498c3e1098388aR369 It allows to build the map by using only one pass over the tasks array as opposed to O(n^2) before. Can we bring this in?
If you want to submit a clean standalone patch for that I'll consider it. The GitHub link you have mixes that change in with a lot of stuff that's either now landed separately, or that I won't consider.
Yes, this is what I'm asking for. I'll post the patch shortly.
Created attachment 12829 [details] Patch to fix the scalability bug in PMIx (v2)
Created attachment 12830 [details] Fix coverity issues in the plugin
Created attachment 12831 [details] Fix coverity issues in the plugin
Created attachment 12832 [details] Fix coverity issues in the plugin
Tim, As we agreed, I've attached a patch that fixes the scalability. I also ran Coverity over the plugin and fixed a few minor issues.
Comment on attachment 12832 [details] Fix coverity issues in the plugin Moved to bug 8392.
Comment on attachment 12829 [details] Patch to fix the scalability bug in PMIx (v2) Thanks Artem. This in on master now (with some style fixes), and will be in 20.02.0pre1 when released this afternoon. commit 9f472af5949970eeed6440f0d356ba7a1c1a61fa Author: Artem Polyakov <artpol84@gmail.com> AuthorDate: Thu Jan 23 17:58:47 2020 -0800 PMIx - improve process map parsing algorithm Reorder ranks according to the node placement to avoid O(n^2) traversal Signed-off-by: Artem Polyakov <artpol84@gmail.com>
Marking closed.
Thank you, Tim This is great, I appreciate it!