Bug 1328 - Configuration Options
Summary: Configuration Options
Alias: None
Product: Slurm
Classification: Unclassified
Component: Configuration (show other bugs)
Version: 14.11.0
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Brian Christiansen
QA Contact:
Depends on:
Reported: 2014-12-17 09:07 MST by Will French
Modified: 2015-01-07 04:16 MST (History)
2 users (show)

See Also:
Site: Vanderbilt
Alineos Sites: ---
Bull/Atos Sites: ---
Confidential Site: ---
Cray Sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---

slurm.conf (4.12 KB, application/octet-stream)
2014-12-17 09:07 MST, Will French
Updated slurm.conf (4.12 KB, audio/x-mod)
2014-12-18 08:20 MST, Brian Christiansen
Example cgroup.conf (123 bytes, application/octet-stream)
2014-12-18 08:20 MST, Brian Christiansen
whereami.c (2.79 KB, text/x-csrc)
2014-12-29 03:00 MST, Brian Christiansen
Memory eating test programs. (671 bytes, application/x-compressed-tar)
2015-01-06 05:18 MST, Brian Christiansen

Note You need to log in before you can comment on or make changes to this bug.
Description Will French 2014-12-17 09:07:07 MST
Created attachment 1517 [details]


We are in the process of transitioning from Torque+Moab to SLURM. We currently have ~60 of our ~700 compute nodes under SLURM management, but we have a few minor issues/questions we are seeking advice on. There seem to be multiple ways to accomplish some of these tasks, so we were hoping you could weigh in and point us to the "best" solutions. I hope I've selected the correct Severity level for this type of ticket. For future reference, please let me know if another level is appropriate.

1. Hardware threads (HTs). Our sense is that HTs improve performance in somewhat limited cases, and in other cases can degrade performance. We also feel that reporting logical rather than physical CPUs might confuse users, so our inclination is to not use HW threads. As our configuration stands, we have SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE and ThreadsPerCore=1 in each of our NodeName lines (slurm.conf attached). We have NOT disabled hyperthreading in Bios. Is this an acceptable/optimal way of handing HTs, or would you suggest an alternative? We notice SLURM complaining in the logs but jobs appear to be happy.

2. What is the best way to request certain groups of hardware? For instance, a GPU node, 8-core node, 12-core node, Intel node, AMD node, etc. Is there some sort of feature or attribute we should add to the NodeName line that can then be included in a job script? Or is it better to split these into partitions? 

3. On a related note, we have a small group (8) of 256 GB nodes. We would like these to be available for use with a relatively short queue time, but we don't want them to remain idle in their own separate partition as our users' jobs seldom require that much memory. Is there a way to limit the wall time on these nodes for jobs not requiring much memory, to ensure that the hardware is available in a relatively short amount of time? 

4. We have our primary and backup servers up and running. When we kill slurmctld on the primary, the secondary picks up where the primary left off. Works beautifully. We were wondering if there is a way (command) we could use to determine which server is active?
Comment 1 Brian Christiansen 2014-12-18 07:58:12 MST
Feel free to mark your configuration questions with severity levels 1-4. In theory you could have a configuration question that is more important than others. You can reserve 5 for feature requests. 

1. If you don't want the users to be concerned about seeing the thread count and/or you don't plan on using the hyper-threads, the best thing would be to turn off hyper-threads in the bios.

If you still have a use case for using the hyper-threads or would rather not change the bios, we recommend the following configurations.

First, we recommend configuring the nodes with actual resources it has. This will help Slurm make the right choices when placing tasks. It will also stop Slurm from complaining about the nodes. In addition to the NodeName line add CPUs=<core_count -- not thread_count>. This will tell Slurm to only schedule the cores. Slurm will also only show the total core count instead of the thread count.

NodeName=vmp[101-103,105-110,112-120] RealMemory=60000 CPUs=8 Sockets=2 CoresPerSocket=4 ThreadsPerCore=2

We also recommend setting up task affinity using the task/cgroup task plugin. This way the tasks will be bound to the appropriate cores. With your current setup, tasks aren't being bound at all and are free to move around on the socket.

Without a task plugin a job is not bound to any part of the node and can, maliciously or not, run wherever they want.

You can just set SelectTypeParameters to just CR_Core_Memory. In this scenario, CR_ONE_TASK_PER_CORE doesn't matter.

2. You can put features on a node and call them out with the --constraint option.
NodeName=node11 State=UNKNOWN Feature=bigmem

sbatch --constraint=bigmem

3. One possible solution is to do:

1. Assign a bigmem feature to the 8 nodes and put a smallmem feature on the rest.
2. Put a higher weight on the bigmem nodes so that they are considered last.
NodeName=node11 State=UNKNOWN Feature=bigmem Weight=100
3. Write a job submit plugin that routes jobs to and from the bigmem nodes based off of their requested memory and walltime. It could do -- in order:
    1. Put the bigmem feature on the job if the job is requesting lots of memory.
    2. Put the smallmem feature on jobs that have a long walltime and don't request a lot of memory.
    3. Jobs that don't match the two previous steps will be allowed to run wherever but will avoid the bigmem nodes because of their weight.

4. Try "scontrol ping"

I see that you are using accounting and job completion logging. Unless needed, you could turn off the jobcomp logging as you can get that information from the accounting database.

We also recommend configuring the cgroup ProctrackType plugin. This fences the job's process in so that they can't get anywhere where they shouldn't. It also makes cleanup after a process easy.
Be sure to setup a cgroup.conf and create the release agent files.

I've attached a modified slurm.conf with the changes talked about. I also attached an example cgroup.conf.

Let us know if you have any other questions.

Comment 2 Brian Christiansen 2014-12-18 08:20:12 MST
Created attachment 1518 [details]
Updated slurm.conf
Comment 3 Brian Christiansen 2014-12-18 08:20:43 MST
Created attachment 1519 [details]
Example cgroup.conf
Comment 4 Will French 2014-12-19 07:48:34 MST
Thanks, Brian. This is very helpful. 

About cgroups - do CgroupMountpoint and CgroupReleaseAgentDir need to be local to each box, or can we put these on our parallel file system that gets mounted across all nodes?
Comment 5 Brian Christiansen 2014-12-19 10:29:38 MST
CgroupMountpoint must be local because it's a kernel filesystem like /proc. CgroupReleaseAgentDir can be on shared file system because they are scripts.

Let us know if you have more questions.

Comment 6 Will French 2014-12-22 05:42:23 MST
I have updated our slurm.conf file to use proctrack/cgroup and task/cgroup. Can you suggest a good way to test that these plugins are functioning correctly?

I have tried testing by running a MPI application in which I request a small number of tasks (--ntasks=4) but then launch a greater number (e.g. 8) of processes within a node (this is a dual quad core node). If I log into the node and run top, it shows 8 processes running at full load, which I assume means that the cores are not being partitioned off correctly? If I ask for 16 processes in a node, top reports 16 processes running at ~50% load each.

As a side note, to date we have not gotten OpenMPI or MPICH2 compiled programs to launch properly using srun. What we see is a single process launched by MPI, even when we've requested a larger number of processes with the -n flag. Things work fine when we use OpenMPI's or MPICH2's native mpirun or mpiexec commands. This is how I ran the test described above. 

srun appears to work correctly (the appropriate number of processes are launched) when running an app compiled with Intel's MPI library. In this case, srun throws an error when we try launching more processes than allocated cores, so we cannot run the test above with srun.

Library version info:

OpenMPI 1.4.4
MPICH2 1.4.1p1
GCC 4.6.1
Intel Cluster Studio 2013 (sp1.2.144)
Comment 7 Brian Christiansen 2014-12-24 06:10:15 MST
For testing task affinity you first get the cpu mask of the job. You can get the mask several ways.

1. Use the attached whereami.c program (slightly modified version of testsuite/expect/test1.91.prog.c) and run it with srun. It prints out the the cpu mask of each task. 

brian@compy:~/slurm/14.11/compy$ srun srun -n1 ~/tools/whereami
   0 compy1     - MASK:0x11

2. Or you can get the mask by passing the program's pid to the taskset command:

brian@compy:~/slurm/14.11/compy$ taskset -p 21793
pid 21793's current affinity mask: 11

3. Or look in the cgroup:
brian@compy:~/slurm/14.11/compy/cgroup/cpuset/slurm_compy1/uid_1003/job_41518$ cat cpuset.cpus
brian@compy:~/slurm/14.11/compy/cgroup/cpuset/slurm_compy1/uid_1003/job_41518/step_0$ cat cpuset.cpus 

You then correlate the mask to the cpus. The taskset man page explains this well:

man taskset:
The CPU affinity is represented as a bitmask, with the lowest order bit corresponding to the first logical  CPU
and  the highest order bit corresponding to the last logical CPU.  Not all CPUs may exist on a given system but
a mask may specify more CPUs than are present.  A retrieved mask will reflect only the bits that correspond  to
CPUs physically on the system.  If an invalid mask is given (i.e., one that corresponds to no valid CPUs on the
current system) an error is returned.  The masks are typically given in hexadecimal.  For example,

       is processor #0

       is processors #0 and #1

       is all processors (#0 through #31).
end man

You can use lstopo, or hwloc-ls, to find the cpu numbering.

brian@compy:~$ hwloc-ls
Machine (7952MB)
  Socket L#0 + L3 L#0 (8192KB)
    L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
      PU L#0 (P#0)
      PU L#1 (P#4)
    L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
      PU L#2 (P#1)
      PU L#3 (P#5)
    L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L33s3
      PU L#4 (P#2)
      PU L#5 (P#6)
    L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
      PU L#6 (P#3)
      PU L#7 (P#7)

For example, if I request 2 tasks the job uses two cores because I've configured slurm to only consider the cores.
brian@compy:~/slurm/14.11/compy$ srun -n2 whereami | sort
   0 compy1     - MASK:0x11
   1 compy1     - MASK:0x22

If you break this out into binary I have:
0001 0001
0010 0010
So task 0 is using threads 0 and 4 which is core 0. And task 1 is using threads 1 and 5 which is core1.

Another example:
brian@compy:~/slurm/14.11/compy$ srun -n4 whereami | sort
   0 compy1     - MASK:0x11
   1 compy1     - MASK:0x22
   2 compy1     - MASK:0x44
   3 compy1     - MASK:0x88

0001 0001
0010 0010
0100 0100
1000 1000
So task 0 gets core1, task 1 gets core 1, and so on...

If I run a job that just spins on the cpu, you'll see that cpus that task is bound to will be the only ones that are utilized.
brian@compy:~/slurm/14.11/compy$ cat ~/jobs/spin.sh 

while [ 1 ]; do echo hello > /dev/null; done;
brian@compy:~/slurm/14.11/compy$ srun -n1 ~/jobs/spin.sh 
   0 compy1     - MASK:0x11

The spin would only consume the first core, or threads 0 and 1. You wouldn't see it spinning on the other cpus.

If I submit the spin job like this:
brian@compy:~/slurm/14.11/compy$ srun -n1 --cpus-per-task=2 ~/jobs/spin.sh 
   0 compy1     - MASK:0x33

The program can bounce around on the first two cores.

Another test for cgroups is to confirm that detached process are cleaned up. When using the proctrack/pgid plugin, detaches from the parent, the process will continue running after the job is done. With cgroups, all of the processes spawned in the job will be cleaned up.
brian@compy:~/ctests/detach$ cat detach2.c 
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>

int main() {

With proctrack/pgid:
brian@compy:~/slurm/14.11/compy$ srun ~/tools/detach 
brian@compy:~/slurm/14.11/compy$ ps -ef | grep detach
brian     9129 21862  0 10:31 ?        00:00:00 /home/brian/tools/detach

With proctrack/cgroup:
brian@compy:~/slurm/14.11/compy$ srun ~/tools/detach 
brian@compy:~/slurm/14.11/compy$ ps -ef | grep detach
brian     9267 30758  0 10:32 pts/15   00:00:00 grep --color=auto detach
Comment 8 Brian Christiansen 2014-12-24 06:13:01 MST
For the MPI issues, have you looked at the following documentation?

Comment 9 Brian Christiansen 2014-12-29 03:00:35 MST
Created attachment 1526 [details]
Comment 10 Will French 2015-01-02 05:11:28 MST
Hi Brian,

Thank you for the test scripts. I've run the tests you suggested and all the tasks appear to be getting pinned to the appropriate CPUs. Process cleanup also seems to be functioning correctly. These are very nice features and one of the main reasons we became interested in SLURM in the first place.

I'd like to get more feedback about the MPI test I've run. Here's the script:

---------------BEGIN SCRIPT----------------------
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --constraint=intel
#SBATCH --mem=756M
#SBATCH --time=00:10:00    
#SBATCH --output=my.stdout
#SBATCH --job-name="single_node_test"

# get the task affinity info
echo "   "
echo "**********************"
echo "Task Affinity:  "
echo "   "
whereami | sort
echo "**********************"

mpiexec -n 8 lmp_openmpi -in npt.in
------------------END SCRIPT----------------------

As you can see, I've requested 4 tasks but then launch 8 processes using mpiexec. srun will not allow this, which is a nice feature. However, we've had trouble getting this older version of OpenMPI to work with srun, so we're still running mpiexec in this scenario. We have users on our cluster who are using this version of OpenMPI, so it's important that it behave properly. We do plan to update to a more recent version of OpenMPI in the coming months.

Next what I do is log into the node where the job allocation was granted, and run top. Here is some of that output:

-------------------BEGIN TOP OUTPUT----------------
top - 12:47:52 up 254 days, 23:19,  1 user,  load average: 2.70, 2.21, 1.29
Tasks: 350 total,   9 running, 341 sleeping,   0 stopped,   0 zombie
Cpu(s): 20.3%us,  0.8%sy,  0.2%ni, 73.8%id,  4.9%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  24597700k total, 12580012k used, 12017688k free,   223040k buffers
Swap: 32768544k total,  2152172k used, 30616372k free,  9292804k cached

10867 frenchwr   5 -15  191m  13m 7152 R 99.4  0.1   0:22.17 lmp_openmpi
10868 frenchwr   5 -15  191m  13m 7320 R 99.4  0.1   0:22.43 lmp_openmpi
10869 frenchwr   5 -15  191m  13m 7364 R 99.4  0.1   0:22.43 lmp_openmpi
10870 frenchwr   5 -15  190m  12m 6936 R 99.4  0.1   0:22.44 lmp_openmpi
10871 frenchwr   5 -15  191m  13m 7140 R 99.4  0.1   0:22.42 lmp_openmpi
10872 frenchwr   5 -15  190m  12m 6880 R 99.4  0.1   0:22.42 lmp_openmpi
10874 frenchwr   5 -15  190m  12m 6552 R 99.4  0.1   0:22.44 lmp_openmpi
10873 frenchwr   5 -15  190m  12m 6896 R 97.4  0.1   0:22.43 lmp_openmpi 
-------------------END TOP OUTPUT------------------ 

As you can see, 8 processes are running at almost max CPU load. Is this an indication that the job is able to access all 8 CPU cores on this node? I expected the processes to be running at closer to half load since the allocation requested 4 tasks (CPUs), but perhaps that is misguided? Note that if I request 16 processes that these run at approximately half load.

Here's the task affinity output (I interpret this to mean 4 cores are allocated, with 2 threads per core. This seems correct.):

Task Affinity:

   0 vmp552 - MASK:0x3333

Here's how we have the node configured:

NodeName=vmp552 Arch=x86_64 CoresPerSocket=4
   CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.18 Features=intel
   NodeAddr=vmp552 NodeHostName=vmp552 Version=14.11
   OS=Linux RealMemory=20500 AllocMem=0 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1
   BootTime=2014-04-22T14:28:32 SlurmdStartTime=2014-12-23T15:41:25
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Comment 11 Brian Christiansen 2015-01-02 08:44:00 MST
Hey Will,

Regarding your MPI test, it looks like it's getting access to 4 cores and is using the whole core -- 2 threads per core. If you would like to run where the job only uses one thread per core, then you would want to use the task/affinity TaskPlugin instead of task/cgroup and use the --hint=nomultithread sbatch/srun option. There was a bug in task/affinity which didn't consider the CPUs option being set to the number of cores, like you are using, which has been fixed in 14.11.3 (not released yet).


brian@compy:~/slurm/14.11/compy$ srun -n2 ~/tools/whereami
   1 compy1     - MASK:0x22
   0 compy1     - MASK:0x11
brian@compy:~/slurm/14.11/compy$ srun -n2 --hint=nomultithread ~/tools/whereami
   0 compy1     - MASK:0x1
   1 compy1     - MASK:0x2

--hint=nomultithread doesn't work with task/cgroup. If you wanted this to be the default you can setup the SLURM_HINT environment variable.

brian@compy:~/slurm/14.11/compy$ export SLURM_HINT=nomultithread
brian@compy:~/slurm/14.11/compy$ srun -n2 ~/tools/whereami
   0 compy1     - MASK:0x1
   1 compy1     - MASK:0x2

Is this behavior more along the lines that you are trying to accomplish?

Comment 12 Will French 2015-01-03 02:27:49 MST
Hi Brian,

Thanks for the clarification. We are fine with allowing hyperthreading but I was a little unclear on whether the CPU load reported by top was expected to be at full or at half per process with hyperthreading. After running some benchmarks and running the same test on our AMD nodes (no hyperthreading), I'm convinced that CPUs are indeed being partitioned off correctly. Great!

The final aspect of cgroups I'd like to verify is functioning properly is RAM usage. I've run some tests in which I intentionally request too little memory and the job fails with an error message, which is what I expected. However, I still have a few specific questions:

1. When I run free -m from within a job, it appears that the job can still "see" all available memory on a node. This is a little different than how CPUs are handled, where the job can only see the cores that were allocated -- is that correct?

2. My understanding is that with cgroups a job cannot run a node out of memory (thus affecting other jobs on the node). Is this correct? For example, if two jobs are sharing a node, and one job suddenly tries to allocate an array that would require more RAM than is available on the node, what will happen?

3. I saw this disclaimer:

"There can be a serious performance problem with memory cgroups on conventional multi-socket, multi-core nodes in kernels prior to 2.6.38 due to contention between processors for a spinlock. This problem seems to have been completely fixed in the 2.6.38 kernel."

We're running CentOS 6.5 and uname -r shows 2.6.32-358.14.1.el6.x86_64. Just want to verify that we would not be affected.

4. Do you recommend any changes to the cgroup.conf file for better controlling memory usage? The defaults appear reasonable.
Comment 13 Brian Christiansen 2015-01-06 05:16:58 MST
1. Correct. free shows the whole system. The job would need to look at it's cgroup heirarchy.

2. Yes, cgroups will help prevent jobs from running a node out of memory. Jobs are fenced into their requested memory.

For example, using ConstrainRAMSpace=yes, if I submit a job with --mem=50, slurm will create a cgroup with a 50 mb limit. This limits the job to 50mb of physical memory. If the job goes over that amount the oom_killer will step in and kill the process. From my tests, the job will end up using more virtual space and will stay under the physical limit and will eventually be killed when it does hit the 50 mb limit. You can limit swapping by setting ConstrainSwapSpace=yes. You can also control how much of the requested memory is put into the cgroup's limit by setting AllowedRAMSpace and AllowedSwapSpace in the cgroup.conf.

I've included my programs that I've been testing with. You can use those to get a feel how it's working.
srun --mem=50 ~/ctests/eat_mem/eat_while
srun --mem=50 ~/ctests/eat_mem/eat 200

3. As far as we know 2.6.32-358 should be fine. Bug 1054 is one example of it being confirmed to work.

4. I would start with these:

Then you can tune with Allowed[RAM|Swap]Space if needed.
Comment 14 Brian Christiansen 2015-01-06 05:18:54 MST
Created attachment 1534 [details]
Memory eating test programs.
Comment 15 Will French 2015-01-07 04:14:56 MST
Thanks, Brian. I think we are all set with Cgroups. You can go ahead and close this ticket. We'll be opening others in the future for separate issues. 

Thanks again,

Comment 16 Brian Christiansen 2015-01-07 04:16:47 MST
Cool. Let us know how we can help.