Created attachment 23320 [details] hello world output Hi, We just upgraded our Grace cluster to 21.08.5 and have encountered a new issue with "mpirun". When a job is submitted to multiple cores on multiple nodes, all the cores on the first node are allocated properly, but on all other nodes all of the tasks for that nodes are executed on only a single core. srun seems to be operating correctly (each tasks gets its own core). We have clusters running 21.08.2 and 21.08.4 and on both of those clusters, each task gets its own core regardless of "mpirun" vs "srun". I have also tested multiple version of OpenMPI with the same results (2.1.2, 3.1.1 and 4.0.5) I've attached the output of simple hello world that demonstrates this. Thanks, Kaylea
Created attachment 23321 [details] slurm conf
Looking through the release notes for 21.08.6, this sounds like our problem: Fix affinity of the batch step if batch host is different than the first node in the allocation. Do you have a patch for this that we can apply?
I think it is more likely you are running into a different set of changes. There are also some changes starting in 20.11 with --exclusive and srun. Here are a summary of the changes: For 20.11.3+, the default behavior for srun within an allocation is: * Exclusive access to all resources it requests (srun --exclusive) * All the resources of the job on the node (srun --whole) These can be overridden by: * srun --overlap (srun can overlap each other) * srun --exact (only use exactly the CPU resources requested) You can use --overlap if your jobs need steps to overlap CPUs.
Hi Jason, I tried export SLURM_OVERLAP=1 but am still seeing that on the non-batch host nodes, all of the tasks are being pinned to one CPU. Things look on the batch host (one task per cpu) Kaylea
Here is my submission script: #!/bin/bash #SBATCH -J 2020b #SBATCH -p day #SBATCH -N 3 #SBATCH --ntasks-per-node=10 export SLURM_OVERLAP=1 module load OpenMPI/4.0.5-GCC-10.2.0 module list echo 'format: Hello from <node>:<cpu_id>, <task_number>' echo "### with srun ###" srun ./c_eb_ompi_405_gcc_ucx echo "### with mpirun ###" mpirun ./c_eb_ompi_405_gcc_ucx
(In reply to Kaylea Nelson from comment #3) > Looking through the release notes for 21.08.6, this sounds like our problem: > > Fix affinity of the batch step if batch host is different than the first > node in the allocation. > > Do you have a patch for this that we can apply? Let me get a patch for you to try that will apply cleanly to 21.08.5. Can you give us some more information on a job that fails? scontrol show job -j <job-id> Let's see if the batch host appears to be different than the first node of the allocation for all these jobs.
Created attachment 23335 [details] 21.05.5 v1 Ok, v1 is the patch that corresponds with commit 538420ad4d5 in 21.08.6 (Fix affinity of the batch step if batch host is different than the first node in the allocation) - but without the NEWS entry. Can you give that a try and see if it fixes things? Thanks! -Michael
JobId=50789647 JobName=2020b UserId=kln26(11135) GroupId=support(11133) MCS_label=N/A Priority=49718 Nice=0 Account=admins QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 DerivedExitCode=0:0 RunTime=00:00:07 TimeLimit=01:00:00 TimeMin=N/A SubmitTime=2022-02-07T17:11:17 EligibleTime=2022-02-07T17:11:17 AccrueTime=2022-02-07T17:11:17 StartTime=2022-02-07T17:13:48 EndTime=2022-02-07T18:13:48 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-02-07T17:13:48 Scheduler=Backfill Partition=admintest AllocNode:Sid=grace2:194801 ReqNodeList=(null) ExcNodeList=(null) NodeList=p08r01n44,p09r07n[11,24] BatchHost=p08r01n44 NumNodes=3 NumCPUs=30 NumTasks=30 CPUs/Task=1 ReqB:S:C:T=0:0:*:1 TRES=cpu=30,mem=150G,node=3 Socks/Node=* NtasksPerN:B:S:C=10:0:*:1 CoreSpec=* JOB_GRES=(null) Nodes=p08r01n44 CPU_IDs=18-27 Mem=51200 GRES= Nodes=p09r07n11 CPU_IDs=0-5,12-15 Mem=51200 GRES= Nodes=p09r07n24 CPU_IDs=0,2,4-10,33 Mem=51200 GRES= MinCPUsNode=10 MinMemoryCPU=5G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/vast/palmer/home.grace/kln26/mpi_tests/test_for_schedmd.sub WorkDir=/vast/palmer/home.grace/kln26/mpi_tests StdErr=/vast/palmer/home.grace/kln26/mpi_tests/slurm-50789647.out StdIn=/dev/null StdOut=/vast/palmer/home.grace/kln26/mpi_tests/slurm-50789647.out Power=
(In reply to Kaylea Nelson from comment #0) > We just upgraded our Grace cluster to 21.08.5 and have encountered a new > issue with "mpirun"... > We have clusters running 21.08.2 and 21.08.4 and on both of those clusters, > each task gets its own core regardless of "mpirun" vs "srun". Are you saying that this issue does not show up on 21.08.[2,4] but it does on 21.08.5?
And what was the Grace cluster running before you upgraded it to 21.08.5? 21.08.4? 20.11?
Can you attach relevant snippets of the slurmctld log and slurmd logs for each of the nodes for job 50789647?
If v1 does not work, could you try adding an explicit task count to the submission script (`-n30`) to see if that fixes the issue with mpirun?
We installed 21.08.5 on our test cluster and have been able to reproduce the issue there. We tried deploying the v1 patch, but the problem persists. I also tried -n 32 on that cluster (2 nodes, 16 cores each) and the problem persists. I'm going to upload the pertinent snippets for a representative job. We were running 20.11.8 on Grace prior the upgrade last week. I tried the same test on two other of our clusters with 21.08.2 and 21.08.4 and there was no affinity issues on the non-batch host nodes (every task go its own cpu on every node).
Created attachment 23361 [details] slurm_out_job18
Created attachment 23362 [details] slurmd and slurmctld logs for job 18
Created attachment 23365 [details] slurm conf for the test cluster
(In reply to Kaylea Nelson from comment #15) > We installed 21.08.5 on our test cluster and have been able to reproduce the > issue there. > ... > I tried the > same test on two other of our clusters with 21.08.2 and 21.08.4 and there > was no affinity issues on the non-batch host nodes (every task go its own > cpu on every node). Could you downgrade you test cluster to 21.08.4 and confirm that the issue goes away? If so, that will help narrow down the issue. If not, then there must be some difference between this cluster and your other clusters that is to blame.
We rolled the test cluster back to 21.08.4 and the issue went away. I'll upload relevant logs in case they are of use.
Created attachment 23368 [details] output for 21.08.4 successful job
Created attachment 23369 [details] slurmd slurmctld 21.08.4 successful job
Upon closer inspection, lesser problems continue with 21.08.4 on the test cluster, but they are a bit different. Some CPUs are reused for multiple tasks, but a larger range of CPUs are used. Oddly, this new issue is no longer limited to the non-batch host. We are seeing reused CPUs on both nodes now.
More updates. Our production cluster running 21.08.4 is operating completely correctly regarding mpirun. We have found it is actually very easy to see the issue when using the --report-bindings flag on mpirun. When working properly, everything is bound. With 21.08.5 on Grace or the test cluster, batch host has bindings but non-batch host node has none. We are looking into why our test cluster on .4 is behaving differently from the production cluster running .4.
We downgraded 2 compute nodes to 21.08.4 on the Grace cluster (the one originally exhibiting the issue) and *the issue went away*. We left the server and database on 21.08.5 for this test (since they are still serving the rest of the nodes which are also still on 21.08.5). So we feel pretty confident that this is an issue introduced in 21.08.5. # Using mpirun clients/server 21.08.5/21.08.5 [p08r02n21.grace.hpc.yale.internal:40919] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B][] [p08r02n21.grace.hpc.yale.internal:40919] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/.][] [p08r02n24.grace.hpc.yale.internal:170940] MCW rank 3 is not bound (or bound to all available processors) [p08r02n24.grace.hpc.yale.internal:170939] MCW rank 2 is not bound (or bound to all available processors) p08r02n21.grace.hpc.yale.internal core 2 (1/4) p08r02n21.grace.hpc.yale.internal core 0 (0/4) p08r02n24.grace.hpc.yale.internal core 0 (3/4) p08r02n24.grace.hpc.yale.internal core 0 (2/4) # Using mpirun clients/server 21.08.4/21.08.5 [aam233@grace1:~/test/mpi] cat slurm-50894200.out [p08r02n21.grace.hpc.yale.internal:42623] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/.][] [p08r02n21.grace.hpc.yale.internal:42623] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B][] [p08r02n24.grace.hpc.yale.internal:172625] MCW rank 2 bound to socket 0[core 0[hwt 0]]: [B/.][] [p08r02n24.grace.hpc.yale.internal:172625] MCW rank 3 bound to socket 0[core 1[hwt 0]]: [./B][] p08r02n21.grace.hpc.yale.internal core 2 (1/4) p08r02n21.grace.hpc.yale.internal core 0 (0/4) p08r02n24.grace.hpc.yale.internal core 0 (2/4) p08r02n24.grace.hpc.yale.internal core 2 (3/4)
The most likely candidate for the difference between 21.08.4 and 21.08.5 is commit https://github.com/SchedMD/slurm/commit/6e13352fc2: "Fix srun -c and --threads-per-core imply --exact "In 21.08, srun --cpus-per-task and threads-per-core were supposed to imply --exact. This worked when that new behavior was introduced, but was regressed by commit 5154ed2 before 21.08 was released. That commit did not handle setting --exact in step_req if -c was also set." The issue fixed by this commit affected 21.08.0 - 21.08.4, but not 20.11.3+. ...Except your problem jobs in 21.08.5 don't seem to use any of these options (or do they?). But perhaps OpenMPI uses these options internally when calling srun. Would you be willing to test 21.08.5 and revert commit 6e13352fc2 to see if that is the main difference? In the meantime, I will work on creating a local reproducer with OpenMPI 4.0.5. Do you have other flavors of MPI that you can also test to see if they are affected?
What mechanism does c_eb_ompi_405_gcc_ucx use to get the CPU id? MPI_Get_processor_name()?
I use sched_getcpu(). I've attached the program.
Created attachment 23391 [details] mpi hello world
OpenMPI 4.0.5 seems to be fixed when we revert commit 6e13352fc2. Thanks! However, we are still having problems with our other OpenMPIs (2.1.2 and 3.1.1). Its binding each rank to every core on one of the sockets. - If every core assigned on a particular node is a single socket, --report-bindings reports: MCW rank 4 is not bound (or bound to all available processors) - If the cores assigned span two sockets, --report-bindings reports bindings, such as: [c19n01.farnam.hpc.yale.internal:28584] MCW rank 12 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]]: [B/B][./.] [c19n01.farnam.hpc.yale.internal:28584] MCW rank 13 bound to socket 1[core 2[hwt 0]], socket 1[core 3[hwt 0]]: [./.][B/B] I try to force "--bind-to core", I get the following error: -------------------------------------------------------------------------- A request was made to bind to that would result in binding more processes than cpus on a resource: Bind to: CORE Node: p08r01n27 #processes: 2 #cpus: 1 You can override this protection by adding the "overload-allowed" option to your binding directive. -------------------------------------------------------------------------- This behavior also happens on our cluster running 21.0.8.2. Any ideas?
(In reply to Kaylea Nelson from comment #31) > OpenMPI 4.0.5 seems to be fixed when we revert commit 6e13352fc2. Thanks! Great! > However, we are still having problems with our other OpenMPIs (2.1.2 and > 3.1.1). > Its binding each rank to every core on one of the sockets. > ... > This behavior also happens on our cluster running 21.0.8.2. > > Any ideas? Did this same issue happen on 20.11.3+?
Can you also give us more information on: * what PMI[x] libraries Slurm is configured with * if you are using a job_submit plugin, and if that changes --threads-per-core, --cpus-per-task, or --hint=nomultithread
To answer your questions - We use PMI2 - we have SLURM_HINT=nomultithread set on Grace. None of the others are set in job_submit or default environment. However, we see all of the below behavior on also on our clusters without that hint. I did some experimenting and I think the discrepancy between OpenMPI 3.1.1 vs 4.0.5 has to do with mpirun bind-to and map-by. - If I run any version of OpenMPI outside of Slurm (just ssh to a node), mpirun uses bind-to numa, map-by numa. This makes sense because this is the default behavior for OpenMPI if you have more than two tasks. - If I run inside sbatch with simple parameters of -N1 -n=<anything greater than 2>, they start to deviate. What I don't know is which one is the expected behavior inside Slurm. -- 3.1.1 remains at numa/numa -- 4.0.5 becomes core/core - If I try to force `--bind-to core` (in attempt to make 3.1.1 behave more like 4.0.5) -- 3.1.1: a) on a job with cores all on one socket, this runs as core/numa b) on a job with cores that span two sockets, this fails with the error in my previous comment -- 4.0.5: remains core/core, since it was already binding this way Ways I have gotten 3.1.1 with `--bind-to core` across two sockets to work (b above): -- if I request every core on the node 3.1.1b works (runs successfully as core/numa). -- if I set "-c", 3.1.1b works (runs successfully as core/numa) So my questions are: - Why does OpenMPI 4.0.5 mpirun run as bind-to core, map-by core under Slurm but the older versions don't? Which way is the expected behavior? - Why does --bind-to core fail across multiple sockets with older versions of OpenMPI unless I set -c1 or request all of the cores on the node? Thanks!
(In reply to Kaylea Nelson from comment #35) > So my questions are: > - Why does OpenMPI 4.0.5 mpirun run as bind-to core, map-by core under Slurm > but the older versions don't? Which way is the expected behavior? > > - Why does --bind-to core fail across multiple sockets with older versions > of OpenMPI unless I set -c1 or request all of the cores on the node? Scanning OpenMPI's NEWS file, perhaps it has something to do with these commits: 4.0.5 -- August, 2020 --------------------- ... - Disable binding of MPI processes to system resources by Open MPI if an application is launched using SLURM's srun command. ... - Fix a problem with mpirun when the --map-by option is used. Thanks to Wenbin Lyu for reporting. I think the first one is commit https://github.com/open-mpi/ompi/commit/c72f295dfa; I'm not sure what commit the second one is associated with, but it seems less likely to be relevant. It's hard to tell if the first one applies in the case of an mpirun under sbatch in a Slurm allocation. I'll keep looking into it, but this may be something to report to the OpenMPI guys directly. I would also check to make sure this isn't something fixed in the latest versions of OpenMPI 4.x, but NEWS doesn't seem to show any changes along those lines. -Michael
(In reply to Kaylea Nelson from comment #35) > - we have SLURM_HINT=nomultithread set on Grace. Did you set this for all jobs, and if so, how? This implies --threads-per-cpu=1, which implies --exact. This might be why reverting commit 6e13352fc2 fixed things for you. See https://github.com/SchedMD/slurm/blob/slurm-21-08-5-1/src/common/proc_args.c#L784-L789. Regarding your other questions - for the life of me, I can't reproduce the single-cpu pinning on the non-batch node with mpirun, even with a true multi-node, multi-socket setup. I'll keep trying, but feel free to add more information or another example of the issue. I want to first focus on the issue you are seeing with an unmodified 21.08.5; the other issues can be addressed after. Or, if they are distinctly different and need attention, feel free to open up a separate ticket to work on them in parallel. That will prevent muddying up the waters here. Thanks! -Michael
Ok Kayla, we are able to reproduce the issue. The key was setting `SLURM_HINT=nomultithread` for the sbatch process (`SLURM_HINT=nomultithread sbatch 13351.batch`). This will imply --threads-per-core=1, which implies --exact. Another way to reproduce it is to set `export SLURM_EXACT=1` in the batch script. I still need to look into why this only happens for mpirun, and not for srun. As for why the batch node is ok with mpirun, but not the other node, it appears that mpirun reuses the batch step on the batch node (which has access to all the node) instead of creating its own new step. This might be a separate issue altogether. We'll keep looking into it. In the meantime, we recommend using srun instead of mpirun where possible. Thanks, -Michael
*** Ticket 13474 has been marked as a duplicate of this ticket. ***
Thanks for you continued work on this. I'll take up the different behavior between OpenMPI 4 and <4 with OpenMPI and see what they say. I wonder if the "bind-to core" failing with older OpenMPI version is related to the initial issue because it resolves with the use of -c 1. I'll leave it here and can test if the issue persists when the primary affinity issue is fixed. If not, then I'll submit it as a separate ticket.
So, we've got a fix pending review that will prevent --threads-per-core from implying --exact. This should fix the OpenMPI issues in 21.08.5 where tasks are pinned to a single CPU. Let me explain why it was doing that, by comparing vanilla srun vs. mpirun. Let's assume we launch a batch script with SLURM_EXACT=1: SLURM_EXACT=1 sbatch script.batch #!/bin/bash #SBATCH -N 2 #SBATCH --ntasks-per-node=10 ... srun <prog> mpirun <prog> Since SLURM_EXACT=1 for sbatch, srun inherits that and implicitly does --exact. srun also inherits the node and task counts (-N2, --ntasks-per-node=10). So with --exact, srun gets exactly what it requests - two nodes and 20 tasks-worth of cores. But this just happens to be the same as getting the whole allocation. So there is little difference between --exact and no --exact in this case. mpirun internally execve()'s something like `srun --ntasks=2 orted ...` (for one orted on each node). In this case, there is a *big* difference between --exact and no --exact, because the # of tasks is getting overwritten by mpirun. So with --exact, srun now only gets 2 tasks-worth of the job allocation. Now, each orted only has access to 1 core to spawn 10 tasks on. This is why there is single-cpu pinning. I believe that this issue will likely affect other MPI libraries besides OpenMPI.
Kayla, This has been fixed in the upcoming 21.08.6 release with commits https://github.com/SchedMD/slurm/compare/0f6a61dcf9df...08ad49515fb6. Thanks for bringing this to our attention! I'm going to go ahead and mark this ticket as resolved. Please open up new tickets for your other MPI issues if you still need us to look at them. Thanks! -Michael
Hi Micheal, Great thanks! Just to clarify, with the fix in 21.08.6 is it still a problem to specify -c X when using mpirun? I ask because I know some of our users run hybrid MPI/OpenMP jobs and that has been traditionally the setup. Thanks, Kaylea
*** Ticket 13339 has been marked as a duplicate of this ticket. ***
(In reply to Kaylea Nelson from comment #71) > Great thanks! Just to clarify, with the fix in 21.08.6 is it still a problem > to specify -c X when using mpirun? I ask because I know some of our users > run hybrid MPI/OpenMP jobs and that has been traditionally the setup. Yes. If you have -c/--cpus-per-task set, this will still imply --exact, and you'll get the same issue whenever you use mpirun. -Michael
Kayla, sorry for the commotion, but we have some more updates: Now, as of 21.08.6 (commit https://github.com/SchedMD/slurm/commit/f6f6b4ff59), -c/--cpus-per-task will no longer imply --exact. --exact is what messes up mpirun and causes the single-cpu pinning. Now, mpirun behavior should be the same as it was in 21.08.4 and before, regardless of -c or --threads-per-core. So disregard comment 73. -c can be set for sbatch/salloc without breaking mpirun. In 22.05, -c will imply --exact, but a -c set by sbatch/salloc will no longer be inherited by srun. This should keep MPI happy and not break existing MPI programs (since mpirun doesn't use -c when calling srun internally). Thanks! -Michael
Thanks for the update, that's great news! We look forward to 22.05.
*** Ticket 14751 has been marked as a duplicate of this ticket. ***