Ticket 3799 - Slurm with Open MPI
Summary: Slurm with Open MPI
Status: RESOLVED TIMEDOUT
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 17.02.2
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2017-05-11 09:59 MDT by Hadrian
Modified: 2017-06-21 16:21 MDT (History)
0 users

See Also:
Site: Case
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Hadrian 2017-05-11 09:59:02 MDT
Hi,

We have a couple of questions about implementing OpenMPI with Slurm 17.02.2.
 
1) We use "rpmbuild -ta --with pmix /usr/local/src/slurm/slurm-17.02.2.tar.bz2" hoping to enable pmix plugin, but it does not show up as a plugin on the installed slurm. "srun --mpi=pmix" would say "Couldn't find the specified pluin name for mpi/pmix"

2) We have issue running Rmpi (R package) on slurm, this is a rather long story so hopefully you can understand it better. Hopefully you can provide a solution.
The R script (slurm_test.r) we are trying to run is quite simple. The script has a loop with three iterations. Each iteration calls a function (foo) that will apply a certain operation using threads by creating and stopping a "cluster".

Here are the results obtained when testing Rmpi with slurm and OpenMPI. Running

sbatch slurm_test_MPI.slurm

as is, i.e., 2 Nodes, 24 Tasks with 1 CPU per task, produces the output in the slurm_test_MPI_01.log

The error is the following:

"There are not enough slots available in the system to satisfy the 23 slots
that were requested by the application:
  /usr/local/gcc-6_3_0/openmpi-2_0_1/R/3.3.3/lib64/R/bin/Rscript

Either request fewer slots for your application, or make more slots available
for use."

It is clear that we are using less tasks than the amount requested, but the job would not run.

The second case is when we request 1 Node, 12 Tasks and 1 CPU per task.
The output is the one showed in the file slurm_test_MPI_02.log. (The file shows that the job was cancelled because I cancelled). To summarize the output, the script runs, and goes through the first itearation of the loop, but then it hangs. We actually know that the scripts gets stuck at line 83 when calling the

stopCluster(cl=clus.rep)

These problems do not happen when using SOCKET as an option.

We do not know what is wrong.


Thanks!
Comment 1 Tim Wickberg 2017-05-11 15:49:14 MDT
(In reply to Hadrian from comment #0)
> Hi,
> 
> We have a couple of questions about implementing OpenMPI with Slurm 17.02.2.
>  
> 1) We use "rpmbuild -ta --with pmix
> /usr/local/src/slurm/slurm-17.02.2.tar.bz2" hoping to enable pmix plugin,
> but it does not show up as a plugin on the installed slurm. "srun
> --mpi=pmix" would say "Couldn't find the specified pluin name for mpi/pmix"

Do you have the PMIx development headers available on the system, and in a location that would be detected automatically? I believe you can pass a location in as an argument there add that to the search path.

There should be a warning in the rpmbuild logs indicating that it could not find , and is not building those modules.

> 2) We have issue running Rmpi (R package) on slurm, this is a rather long
> story so hopefully you can understand it better. Hopefully you can provide a
> solution.
> The R script (slurm_test.r) we are trying to run is quite simple. The script
> has a loop with three iterations. Each iteration calls a function (foo) that
> will apply a certain operation using threads by creating and stopping a
> "cluster".

R integration falls outside the scope of what we support; it's unclear to me where the problem is here but I do not believe it's within Slurm.

My best guess is that the MPI it is using under the covers is not being made aware of the resources that have been assigned to the job. Exactly how to ensure that MPI stack knows about the additional CPUs and nodes available is something you'll need to investigate further.

- Tim
Comment 2 Tim Wickberg 2017-05-22 13:41:54 MDT
Hey Hadrian -

Were you able to get things up and running correctly? I was expecting an answer to my questions from comment 1; if you've sorted this out please let me know if I can close this out.

- Tim

(In reply to Tim Wickberg from comment #1)
> (In reply to Hadrian from comment #0)
> > Hi,
> > 
> > We have a couple of questions about implementing OpenMPI with Slurm 17.02.2.
> >  
> > 1) We use "rpmbuild -ta --with pmix
> > /usr/local/src/slurm/slurm-17.02.2.tar.bz2" hoping to enable pmix plugin,
> > but it does not show up as a plugin on the installed slurm. "srun
> > --mpi=pmix" would say "Couldn't find the specified pluin name for mpi/pmix"
> 
> Do you have the PMIx development headers available on the system, and in a
> location that would be detected automatically? I believe you can pass a
> location in as an argument there add that to the search path.
> 
> There should be a warning in the rpmbuild logs indicating that it could not
> find , and is not building those modules.
> 
> > 2) We have issue running Rmpi (R package) on slurm, this is a rather long
> > story so hopefully you can understand it better. Hopefully you can provide a
> > solution.
> > The R script (slurm_test.r) we are trying to run is quite simple. The script
> > has a loop with three iterations. Each iteration calls a function (foo) that
> > will apply a certain operation using threads by creating and stopping a
> > "cluster".
> 
> R integration falls outside the scope of what we support; it's unclear to me
> where the problem is here but I do not believe it's within Slurm.
> 
> My best guess is that the MPI it is using under the covers is not being made
> aware of the resources that have been assigned to the job. Exactly how to
> ensure that MPI stack knows about the additional CPUs and nodes available is
> something you'll need to investigate further.
Comment 3 Tim Wickberg 2017-06-21 16:21:22 MDT
Marking this closed as 'timedout'. Please reopen if you'd like to continue to discuss this, and please respond to the questions from comment 1.

- Tim

(In reply to Tim Wickberg from comment #2)
> Hey Hadrian -
> 
> Were you able to get things up and running correctly? I was expecting an
> answer to my questions from comment 1; if you've sorted this out please let
> me know if I can close this out.
> 
> - Tim