Ticket 6727 - MPI programs hang when called with srun
Summary: MPI programs hang when called with srun
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other tickets)
Version: 18.08.6
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Felip Moll
QA Contact:
URL:
: 6661 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2019-03-19 17:45 MDT by Michael Robbert
Modified: 2020-08-03 14:13 MDT (History)
1 user (show)

See Also:
Site: Colorado School of Mines
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Slurm configuration (3.50 KB, text/plain)
2019-03-19 17:45 MDT, Michael Robbert
Details
Source for test program that reproduces the problem. (1.01 KB, text/x-csrc)
2019-03-20 11:00 MDT, Michael Robbert
Details
Submit script to reproduce problem (236 bytes, text/x-sh)
2019-03-20 11:01 MDT, Michael Robbert
Details
Output of run using libpmi2.so (15.30 KB, text/plain)
2019-03-20 11:10 MDT, Michael Robbert
Details
Output from run using libpmi.so (6.97 KB, text/plain)
2019-03-20 11:11 MDT, Michael Robbert
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Michael Robbert 2019-03-19 17:45:03 MDT
Created attachment 9629 [details]
Slurm configuration

We have had recent reports of jobs hanging and upon investigation it appears to be a problem Intel MPI 2019, but the reason I'm checking with you guys is that the problem does not happen if the tasks are started with mpirun rather than srun. Can you tell us what might be getting setup differently between the 2 job launch programs?
Here are a few more details in case it helps:
The problem doesn't happen all the time, but I have been able to reproduce it reliably if I run a small program that repeatedly does MPI broadcasts. The other mitigating factor is that the job needs to have more than 32 tasks per node. If I run 32 tasks per node or less I can't get the problem to happen, but 33 tasks per node and it happens reliably given enough MPI broadcast calls.
Comment 1 Felip Moll 2019-03-20 10:53:41 MDT
(In reply to Michael Robbert from comment #0)
> Created attachment 9629 [details]
> Slurm configuration
> 
> We have had recent reports of jobs hanging and upon investigation it appears
> to be a problem Intel MPI 2019, but the reason I'm checking with you guys is
> that the problem does not happen if the tasks are started with mpirun rather
> than srun. Can you tell us what might be getting setup differently between
> the 2 job launch programs?
> Here are a few more details in case it helps:
> The problem doesn't happen all the time, but I have been able to reproduce
> it reliably if I run a small program that repeatedly does MPI broadcasts.
> The other mitigating factor is that the job needs to have more than 32 tasks
> per node. If I run 32 tasks per node or less I can't get the problem to
> happen, but 33 tasks per node and it happens reliably given enough MPI
> broadcast calls.

Hi Michael,

Can you show me the exact lines when submitting the job?
Which pmi implementation are you using? pmi2?
Is the I_MPI_PMI_LIBRARY being set correctly and pointing to Slurm pmi2?
Comment 2 Michael Robbert 2019-03-20 11:00:47 MDT
Created attachment 9641 [details]
Source for test program that reproduces the problem.
Comment 3 Michael Robbert 2019-03-20 11:01:19 MDT
Created attachment 9642 [details]
Submit script to reproduce problem
Comment 4 Felip Moll 2019-03-20 11:04:56 MDT
Michael,

Please try to set:

export I_MPI_PMI_LIBRARY=<path_to_slurm_libs>/libpmi2.so

and try again.
Comment 5 Michael Robbert 2019-03-20 11:09:44 MDT
I have just uploaded the source for my test program along with the submit script that I'm submitting with sbatch. The program is compiled with:
mpiicc -o do_bcast do_bcast.c
I have loaded our modules that configure my environment to use the Intel 2019 compilers and the Intel MPI libraries that ship with it.

Now to answer your question, most of my tests have been done without setting the I_MPI_PMI_LIBRARY variable. If I point it to libpmi2.so the program fails during mpi init. If I point to libpmi.so the program runs with the same problem, but includes a warning that reads:
MPI startup(): I_MPI_PMI_LIBRARY environment variable is not supported.

I have checked the documentation and I can't find any reference that says that variable is deprecated, but it is absent from the 2019 documentation. 
I'll upload the output from those 2 most recent tests shortly.
Comment 6 Michael Robbert 2019-03-20 11:10:50 MDT
Created attachment 9645 [details]
Output of run using libpmi2.so
Comment 7 Michael Robbert 2019-03-20 11:11:15 MDT
Created attachment 9646 [details]
Output from run using libpmi.so
Comment 8 Felip Moll 2019-03-21 04:21:56 MDT
Hi Michael,

I did a couple of tests, is the issue still happening if you set:

export I_MPI_PMI_LIBRARY=<path_to_slurm_libs>/libpmi2.so
export I_MPI_PMI2=yes

in your submission script?
Comment 9 Michael Robbert 2019-03-21 08:29:55 MDT
Adding I_MPI_PMI2=yes doesn't appear to change anything, MPI_INIT still fails. 
I do want to make sure that it is clear that if I don't add any PMI variables most programs run fine. 
The only time that we have seen a problem is when we are using all of the following:
1. Start tasks with srun rather than mpirun
2. Program uses Intel MPI 2019
3. Greater than 32 tasks per node
4. A significant number of calls to mpi_bcast

I think that I will go ahead and attempt to open a ticket with Intel and see what they have to say. 
I do have one other question for you. If their answer is "use mpirun" then what capabilities/features of Slurm might we be missing out on because we're not using srun? i.e. process tracking, memory tracking, job cleanup, any accounting data
Or should they be 100% equivalent. The documentation states that srun is preferred, but doesn't clearly state the technical reasons why it is preferred.

Thanks,
Mike
Comment 10 Felip Moll 2019-03-21 09:19:54 MDT
(In reply to Michael Robbert from comment #9)
> Adding I_MPI_PMI2=yes doesn't appear to change anything, MPI_INIT still
> fails. 

I was suggesting this because in my tests, if I don't set I_MPI_PMI2 then I receive:

.....
[-1] MPI startup(): Imported environment partly inaccesible. Map=0 Info=1cf7940
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(805): fail failed
MPID_Init(1743)......: channel initialization failed
MPID_Init(2137)......: PMI_Init returned -1
.........

But I see this is not the same error than yours, which is:

............
MPID_Init(663).......: PMI_Init returned -1
Attempting to use an MPI routine before initializing MPICH
Attempting to use an MPI routine before initializing MPICH
Abort(570127) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(639):
........... 


But when I run it with the variable set I don't see the errors.


> I do want to make sure that it is clear that if I don't add any PMI
> variables most programs run fine. 

Got it.

> The only time that we have seen a problem is when we are using all of the
> following:
> 1. Start tasks with srun rather than mpirun
> 2. Program uses Intel MPI 2019
> 3. Greater than 32 tasks per node
> 4. A significant number of calls to mpi_bcast
 
I will need more time to setup a server with IMPI 2019 able to run 32 tasks per node. I will tell you the results once I have it.


> I think that I will go ahead and attempt to open a ticket with Intel and see
> what they have to say. 

I was gonna recommend this.

> I do have one other question for you. If their answer is "use mpirun" then
> what capabilities/features of Slurm might we be missing out on because we're
> not using srun? i.e. process tracking, memory tracking, job cleanup, any
> accounting data

Well, if you run mpirun a task will not be created individually for each process, which means that Slurm will be unable to control each task individually, but the accounting and memory will be at a job level. This means that you will not be able to see accounting data for each task, neither will be able to limit memory for each task. Job cleanup means you won't have taskprolog/epilog. Moreover mpirun has an extra overhead since it has to start a PMI manager, where srun itself uses Slurm pmi.

Whenever possible I highly recommend using srun.

Have you tried with Intel 2018 or older?

Please, if you make Intel's bug public, yould you mind to link it here?
Comment 11 Michael Robbert 2019-03-21 09:40:00 MDT
We have tried Intel 2018. We have not been able to reproduce the bug in that version. Also, the I_MPI_PMI* variables appear to work as documented. I will note however that even if I don't set those variables srun works fine.
I do not see a way to make the Intel bug public, but if I find a way I will add a link to it.

Mike
Comment 12 Felip Moll 2019-03-21 10:49:47 MDT
(In reply to Michael Robbert from comment #11)
> We have tried Intel 2018. We have not been able to reproduce the bug in that
> version. Also, the I_MPI_PMI* variables appear to work as documented. I will
> note however that even if I don't set those variables srun works fine.
> I do not see a way to make the Intel bug public, but if I find a way I will
> add a link to it.
> 
> Mike

Good.

I don't know if you load the mpivars.sh or other modulefiles before running the job. These modules may set the variables.

I don't know what may have changed with Intel 2019, but it would be good to know for if we have to modify Slurm's pmi2.

Thanks.
Comment 13 Michael Robbert 2019-03-27 11:37:29 MDT
We updated from Intel's 2019.1 release to 2019.3 and I've been unable to reproduce the problem with that version of Intel MPI. When I told that to Intel this was their response:

Yes there were some SLURM related issues that we had in the initial IMPI 2019 releases that have been fixed with update 3.

I don't see this mentioned explicitly in their release notes so I guess it falls under "bug fixes". 

I will note that even with this version srun appears to work perfectly well without setting I_MPI_PMI_LIBRARY. It will also work if I point that variable to libpmi.so, but it does not work if I point it to libpmi2.so with or without also setting I_MPI_PMI2=yes

Mike
Comment 14 Felip Moll 2019-03-28 04:16:26 MDT
(In reply to Michael Robbert from comment #13)
> We updated from Intel's 2019.1 release to 2019.3 and I've been unable to
> reproduce the problem with that version of Intel MPI. When I told that to
> Intel this was their response:
> 
> Yes there were some SLURM related issues that we had in the initial IMPI
> 2019 releases that have been fixed with update 3.
> 

That's good information, thank you.


> I don't see this mentioned explicitly in their release notes so I guess it
> falls under "bug fixes". 
> 
> I will note that even with this version srun appears to work perfectly well
> without setting I_MPI_PMI_LIBRARY. It will also work if I point that
> variable to libpmi.so, but it does not work if I point it to libpmi2.so with
> or without also setting I_MPI_PMI2=yes
> 
> Mike

That's strange, see release notes: https://scc.ustc.edu.cn/zlsc/tc4600/intel/2017.0.098/mpi/Release_Notes.txt

There they add "PMI-2 protocol support (I_MPI_PMI2).".

There are also Intel customer bugs related to pmi2:

https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/737623

Maybe you can also ask that one. Are you getting the same error than before?
Can you try 'export PMI_DEBUG=1' in yout batch script?
Comment 15 Michael Robbert 2019-03-28 09:56:16 MDT
I've done a little more testing today and there are clear differences between Intel MPI from 2018 and 2019. 
With 2018 the I_MPI_PMI2 variable works and when used in conjunction with the libpmi2.so library I can start jobs with srun.
All I do is change that script to load the modules for Intel 2019 and the same script fails. 
Removing the I_MPI_PMI_LIBRARY=/usr/lib64/libpmi2.so allows the same script to run find again. The Slurm debugging output indicates that it is using pmi2.
I_PMI_DEBUG doesn't appear to do anything. I think that the user in the forum post you found just made that up. 

Do you think that it is not doing "the right thing" if I just leave those variables off and use srun? 

Mike
Comment 16 Felip Moll 2019-03-28 10:50:19 MDT
(In reply to Michael Robbert from comment #15)
> I've done a little more testing today and there are clear differences
> between Intel MPI from 2018 and 2019. 
> With 2018 the I_MPI_PMI2 variable works and when used in conjunction with
> the libpmi2.so library I can start jobs with srun.

That's what I tested some days/weeks ago and what I really expected.

> All I do is change that script to load the modules for Intel 2019 and the
> same script fails. 
> Removing the I_MPI_PMI_LIBRARY=/usr/lib64/libpmi2.so allows the same script
> to run find again. The Slurm debugging output indicates that it is using
> pmi2.

If you remove this I_MPI_PMI_LIBRARY I guess Intel will be using the internal PMI library based on MPICH2, which may be compatible with pmi2, but it seems it is not compatible anymore with our libpmi2.so.

> I_PMI_DEBUG doesn't appear to do anything. I think that the user in the
> forum post you found just made that up. 

Yep. I think there must be some Intel variable to debug the problem. In any case, are the errors the same than you shown initially in this bug?
 
> Do you think that it is not doing "the right thing" if I just leave those
> variables off and use srun? 

You can do this, but without knowing what Intel is exactly doing I cannot give you a reliable response here. 

At this point and since 2018 is working fine, it is worth to ask Intel about any related changes, and see if we or (most probably) they have to do changes to fix  things.

Any info that you may get from Intel is appreciated.
Comment 17 Michael Robbert 2019-03-28 11:38:34 MDT
Before I go an open another bug report with Intel I'd like to understand what I'm losing if I use their PMI library vs. if I use Slurm's? 
Additionally, what do gain if I use Slurm's PMI-2 library vs. Slurm's PMI-1 library?

Further I'll point out that there is no documentation on the Slum web site telling users how to use Slurm's PMI-2 with Intel MPI. It suggests pointing to libpmi.so which does still work.
Comment 18 Felip Moll 2019-03-29 02:25:20 MDT
(In reply to Michael Robbert from comment #17)
> Before I go an open another bug report with Intel I'd like to understand
> what I'm losing if I use their PMI library vs. if I use Slurm's? 

I cannot respond to your question, since I cannot know what internally does their PMI library. If internally it is working as a PMI-1 library, then there are some important performance issues to take into account. If they are implementing a PMI-2 one and is perfectly compatible with the standard, I guess you don't lose anything.

Intel PMI is based in MPICH but I don't have the details, that is something we should ask to Intel to be able to compare with our library.

> Additionally, what do gain if I use Slurm's PMI-2 library vs. Slurm's PMI-1
> library?

Honestly I am not experienced in user-level PMI usage, but what I can say is that PMI-1 is deprecated in favour of PMI-2 which introduced many new features, specially oriented to performance.

There are several papers and documents you can read to see these changes, i.e. https://www.mcs.anl.gov/papers/P1760.pdf

For example:

- PMI-1 lacks query functionality since it provides a simple key-value database that processes can put values into and get values from. PMI-1 version does not allow sharing information between MPI processes. I.e. on multicore and multiprocessor systems each MPI process must contact each other to determin which processes resides on the same SMP node. This is extremely inefficient.
  PMI-2, improves that by a new concept of job attributes which allows passing system-specific information to MPI processes.
- PMI-1 uses a flat key-value database which means an MPI process cannot restric the scope of a key, and all information is global, and cannot optimize retrieval. PMI-2 addresses this issue.
- PMI-1 is not threads safe, the MPI implementation must protect calls to PMI-1.
etc. etc.

I started reading through these papers, which are really interesting. Mainly it seems PMI-2 improves PMI-1 performance a lot.


> Further I'll point out that there is no documentation on the Slum web site
> telling users how to use Slurm's PMI-2 with Intel MPI. It suggests pointing
> to libpmi.so which does still work.

Correct, we already have an internal bug to address and update the documentation, but we are still trying to understand why Intel is not working in some situations like in your bug. We were about to document it all since in 2018 version, as you experienced, it works perfectly, but then you after seeing Intel 2019 is not working fine we stopped it until we know more about this issues  :)
Comment 19 Felip Moll 2019-04-01 09:07:28 MDT
(In reply to Michael Robbert from comment #17)
> Before I go an open another bug report with Intel I'd like to understand
> what I'm losing if I use their PMI library vs. if I use Slurm's? 
> Additionally, what do gain if I use Slurm's PMI-2 library vs. Slurm's PMI-1
> library?
> 
> Further I'll point out that there is no documentation on the Slum web site
> telling users how to use Slurm's PMI-2 with Intel MPI. It suggests pointing
> to libpmi.so which does still work.

Hi Michael, what do you think, do you have enough background now to open a new question to Intel?
Comment 20 Michael Robbert 2019-04-01 09:26:42 MDT
I have submitted a new ticket with Intel asking about the change in behavior and if using Slurm's PMI-2 library should still work. We'll see what they say.
Comment 21 Michael Robbert 2019-04-05 13:51:37 MDT
I finally got a response from Intel today, but it is a little disappointing:

=======================
According to engineering I_MPI_PMI_LIBRARY is still supported (see "impi_info -a") however the status of I_MPI_PMI2 is not clear yet (deprecated or may be show up again in version 2020).

 

Overall the same things are done in Intel MPI with PMI-1 and PMI-2. There should no be enhanced functionality by moving to PMI-2. The suggestion is to use ordinary libpmi.so. Or please provide more details why PMI-2 is necessary in your setup.
========================

In our setup I believe that they are probably correct. I don't think that we'll ever launch a job large enough that a user would notice a performance difference in startup time. If you have a larger customer that could demonstrate a real performance difference then they might have a case to present to Intel, but I don't think that we're going to notice a difference.

Any thoughts?
Comment 22 Felip Moll 2019-04-09 04:41:17 MDT
(In reply to Michael Robbert from comment #21)
> I finally got a response from Intel today, but it is a little disappointing:
> 
> =======================
> According to engineering I_MPI_PMI_LIBRARY is still supported (see
> "impi_info -a") however the status of I_MPI_PMI2 is not clear yet
> (deprecated or may be show up again in version 2020).

To me it is not a very good response. They clearly state that PMI2 is supported since 2017 in the NEWS files. Which I don't know is what does it min "status of I_MPI_PMI2", is it the default? 

Re-reading again https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/737623 it seems they have some issues with libpmi2.so.


> 
> Overall the same things are done in Intel MPI with PMI-1 and PMI-2. There
> should no be enhanced functionality by moving to PMI-2. 

Does it mean they are not really using pmi-2 features?? This would to me an important issue. I'd like to see the reasoning behind that.

> The suggestion is to
> use ordinary libpmi.so. Or please provide more details why PMI-2 is
> necessary in your setup.
> ========================
> 
> In our setup I believe that they are probably correct. I don't think that
> we'll ever launch a job large enough that a user would notice a performance
> difference in startup time. If you have a larger customer that could
> demonstrate a real performance difference then they might have a case to
> present to Intel, but I don't think that we're going to notice a difference.
> 
> Any thoughts?

If you're fine with your setup and don't think the pmi vs pmi-2 improvements came with any benefit, then keep libpmi.so. But still I find the Intel's support response not very clear.
Comment 23 Felip Moll 2019-04-10 06:05:00 MDT
(In reply to Felip Moll from comment #22)
> (In reply to Michael Robbert from comment #21)
> > I finally got a response from Intel today, but it is a little disappointing:
> > 
> > =======================
> > According to engineering I_MPI_PMI_LIBRARY is still supported (see
> > "impi_info -a") however the status of I_MPI_PMI2 is not clear yet
> > (deprecated or may be show up again in version 2020).
> 
> To me it is not a very good response. They clearly state that PMI2 is
> supported since 2017 in the NEWS files. Which I don't know is what does it
> min "status of I_MPI_PMI2", is it the default? 

See https://scc.ustc.edu.cn/zlsc/tc4600/intel/2017.0.098/mpi/Release_Notes.txt
Comment 24 Michael Robbert 2019-04-10 09:13:57 MDT
Why are you pointing me to read a copy of Intel's documentation that is hosted at a Chinese supercomputing site? I have read Intel's own documentation and believe you that Intel MPI has supported PMI-2 since version 2017. I also agree with you that it is not good that they appear to have a regression in 2019.  
I have opened a ticket upon your request and Intel has not closed that ticket, so I assume they are still looking into it. For now though, I am satisfied using PMI-1 for now because our cluster does not run large enough jobs or otherwise require any features for PMI-2 at this time.
Comment 25 Felip Moll 2019-04-11 06:36:58 MDT
(In reply to Michael Robbert from comment #24)
> Why are you pointing me to read a copy of Intel's documentation that is
> hosted at a Chinese supercomputing site?

Copy pasted wrong url. 

https://software.intel.com/sites/default/files/managed/ca/c0/intelmpi-2017-releasenotes-linux.pdf


> I have read Intel's own
> documentation and believe you that Intel MPI has supported PMI-2 since
> version 2017.

I just wanted to add documentation to the bug for further readings but I had too many tabs opened and copied the wrong one.

> I also agree with you that it is not good that they appear to
> have a regression in 2019.  
> I have opened a ticket upon your request and Intel has not closed that
> ticket, so I assume they are still looking into it.

Ok, let's see what they response. I really appreciate that you opened an issue to Intel, if it's a problem for you to have to manage this and implies extra time on your side, I can try to contact Intel myself, but given you were somewhat affected by this I though it was appropriate.

> For now though, I am satisfied using PMI-1 for now because our cluster does not run large enough
> jobs or otherwise require any features for PMI-2 at this time.

Ok, that's fine.
Comment 26 Michael Robbert 2019-04-26 09:04:15 MDT
I just got this response from Intel:

Sorry, but actually PMI-2 is not supported in Intel MPI 2019.

It sounds like they deprecated that feature without telling anybody. Like I've said before I don't think that I have a use case where PMI-1 is not sufficient, but I'd be interested in knowing what larger sites like TACC or NERSC think about this. Do you think it would be a good idea to bring this up on the mailing list?

Mike
Comment 27 Nate Rini 2019-04-26 16:03:58 MDT
(In reply to Michael Robbert from comment #26)
> I just got this response from Intel:
> 
> Sorry, but actually PMI-2 is not supported in Intel MPI 2019.

Mike,

Is this a public statement (with a URL) or just in your Intel support ticket?

Thanks,
--Nate
Comment 28 Michael Robbert 2019-04-26 16:09:16 MDT
This statement was just in my support ticket with Intel. No documentation or reference was reported.
Comment 31 Felip Moll 2019-04-27 06:03:31 MDT
(In reply to Michael Robbert from comment #26)
> I just got this response from Intel:
> 
> Sorry, but actually PMI-2 is not supported in Intel MPI 2019.
> 
> It sounds like they deprecated that feature without telling anybody. Like
> I've said before I don't think that I have a use case where PMI-1 is not
> sufficient, but I'd be interested in knowing what larger sites like TACC or
> NERSC think about this. Do you think it would be a good idea to bring this
> up on the mailing list?
> 
> Mike

Mike, I think is a good idea to bring it to the mailing list, feel free to do so. If you have any way to communicate with Intel community, it would be good also to put it there, like the Intel forums.

I was wondering also if requesting Intel to include this information in the documentation would make sense. This would 'force' them to respond in a more official way. What do you think? Can you turn your bug into a request for changing Intel docs? Do you know if it is possible to open a public bug for this issue in Intel?

Thanks for your time on this.
Comment 32 Felip Moll 2019-06-05 04:01:02 MDT
Hi Mike,

This is just some info for you: We are trying to contact Intel about this issue and to see if we get an "official" response on the status of PMI-2.

I am wondering if you have any more news from your Intel bug.
Comment 33 Michael Robbert 2019-06-05 10:15:15 MDT
I asked Intel, in my ticket, to provide public documentation for this change, but have gotten no response from them since I asked for that over a month ago.
Hopefully you'll be able to make more headway.
Comment 34 Felip Moll 2019-06-06 03:45:42 MDT
(In reply to Michael Robbert from comment #33)
> I asked Intel, in my ticket, to provide public documentation for this
> change, but have gotten no response from them since I asked for that over a
> month ago.
> Hopefully you'll be able to make more headway.

Ok, I am waiting for response too.
If you don't mind I'll keep this ticket opened until I get some response.

Thanks
Comment 36 Felip Moll 2019-07-01 03:50:31 MDT
Hi Michael,

I talked with Intel people and they confirmed me that this feature (PMI-2 support) is still not included in Intel 2019. The plan is to include it but they have given priority to other items first.

I am pushing to them to include the feature asap. They told me that will do a release note about this limitation.

Will inform when I have more info.
Comment 37 Felip Moll 2019-08-06 12:20:41 MDT
Last information from Intel is that Intel MPI 2018 and older does include support for PMI-2 but Intel MPI 2019 refactored completely the library and the support for PMI-2 is still under development. 

At this point they indicated me that will add a Release Note in https://software.intel.com/en-us/articles/intel-mpi-library-release-notes, but they didn't specify if it was going to be added once support was included again or before it to inform that it is not ready yet.

I tried to ask them to include the note ASAP but without response for now since I would've preferred an official public response.

I cannot do anything more on this issue, so I am closing the bug.

For the record and future searches:
Intel MPI 2019 Initial release, Update 1, Update 2, Update 3, Update 4 do not support PMI-2.

Thanks
Felip
Comment 38 Nate Rini 2019-08-07 14:28:15 MDT
*** Ticket 6661 has been marked as a duplicate of this ticket. ***
Comment 39 Felip Moll 2020-01-30 09:36:25 MST
It's been a while, but finally Intel added the Release Note, unfortunately not fixing the issue but informing about not supporting PMI-2:

https://software.intel.com/en-us/articles/intel-mpi-library-release-notes-linux

Known Issues and Limitations
...
Intel® MPI Library 2019 Update 5
    ...
    The following features have not yet been implemented:
        ...
        PMI-2 support (please use PMI-1 until PMI-2 is not implemented)
Comment 40 Felip Moll 2020-08-03 14:13:45 MDT
For the record:

Intel recently added support for PMI2 again in their Intel MPI 2019 Update 7.

See release notes in Intel website.