Bug 11183 - job_container/tmpfs empty at start
Summary: job_container/tmpfs empty at start
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other bugs)
Version: 20.11.5
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Marshall Garey
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2021-03-23 13:10 MDT by Paul Edmon
Modified: 2021-03-25 14:57 MDT (History)
0 users

See Also:
Site: Harvard University
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Paul Edmon 2021-03-23 13:10:26 MDT
I know this is a pretty new feature so the usual caveat of all sorts of unknown bugs may exist, still we have a secure environment we would like to use this for for sharding out local scratch space.  I this up on our test cluster to kick the tires with this setting:

[pedmon@holy2a24201 ~]$ cat /etc/slurm/job_container.conf
AutoBasePath=true
Basepath=/scratch

When I fire off an interactive session I see:

[pedmon@holy2a24201 ~]$ salloc -w holy2a24202
salloc: Pending job allocation 67113078
salloc: job 67113078 queued and waiting for resources
salloc: job 67113078 has been allocated resources
salloc: Granted job allocation 67113078
salloc: Waiting for resource configuration
salloc: Nodes holy2a24202 are ready for job
bash: ulimit: max user processes: cannot modify limit: Operation not permitted
[pedmon@holy2a24202 ~]$ cd /scratch
[pedmon@holy2a24202 scratch]$ ls -latr
total 8
drwxr-xr-x.  2 root root    6 Apr  2  2020 crash
drwxr-xr-x.  3 root root   19 Mar 22 13:06 cache
dr-xr-xr-x. 25 root root 4096 Mar 23 14:54 ..
drwxrwxrwt. 12 root root 4096 Mar 23 14:54 tmp
drwxrwxrwt.  6 root root   59 Mar 23 15:08 .
drwx------   3 root root   49 Mar 23 15:08 67113078

So it made a directory with the JOBID which is great, but the ownership is not for my user so I can't use it, nor has /tmp or /dev/shm been mounted there.  I didn't see any errors in the slurmd log even after running with -Dvvvvv but I was wondering if you had any insight on what's going on or what configuration mistake I've made.

Thanks!
Comment 1 Marshall Garey 2021-03-24 17:56:35 MDT
From what I can see you don't have anything misconfigured. Here's how the job_container/tmpfs plugin works.

Here's my job_container.conf:

# job_container.conf
AutoBasePath=true
BasePath=/home/marshall/job_containers

First, a quick explanation for why my directories will look different than yours:

I'm using multiple-slurmd to simulate a real cluster but it's all just on a single node. So with multiple-slurmd there will be a directory inside BasePath for each node, and the job_id directories for the jobs are created inside the node(s) where the job resides. I have nodes n1-[1-10] defined in my slurm.conf. You won't see the nodename directories inside basepath since you aren't using multiple-slurmd, just the job id.


Before a job is launched, slurmd creates a mount namespace for the job in /<basepath>/<job_id>/.ns. slurmd makes the root (/) directory a private mount for the job. Then, a private /tmp directory is mounted at <BasePath>/</job_id>/.<job_id>. Then the basepath directory is unmounted so the job can't see the basepath mount, but the job can see the private /tmp mount. From outside the job, you can't see the job's /tmp mount (since it's private) but can see the basepath mount and mount namespace for the job. root can view what's actually in the job's /tmp directories by looking at basepath/jobid/.jobid.

In addition, /dev/shm is *not* mounted at the basepath. /dev/shm is unmounted; then a new private tmpfs is created at /dev/shm.

Since slurmd runs as root, the directories in basepath are owned by root.

To try to illustrate how this is working, here are excerpts from findmnt from inside and outside the job.

Viewing mounts from outside the job:

$ findmnt -l -o target,source,fstype,propagation
TARGET                                     SOURCE                                              FSTYPE      PROPAGATION
/                                          /dev/nvme0n1p5                                      ext4        shared
/dev/shm                                   tmpfs                                               tmpfs       shared
/home/marshall/job_containers/n1-1         /dev/nvme0n1p5[/home/marshall/job_containers/n1-1]  ext4        private
/home/marshall/job_containers/n1-1/609/.ns nsfs[mnt:[4026533234]]                              nsfs        private


Viewing mounts from within the job:

$ findmnt -l -o target,source,fstype,propagation
TARGET                              SOURCE                                                      FSTYPE      PROPAGATION
/                                   /dev/nvme0n1p5                                              ext4        private
/tmp                                /dev/nvme0n1p5[/home/marshall/job_containers/n1-1/609/.609] ext4        private
/dev/shm                            tmpfs                                                       tmpfs       private


Here's my current /tmp directory from outside the job:

$ ls /tmp
cscope.49356/
sddm-:0-DxMYRL=
sddm-auth9350208e-6093-4d49-b324-ee6e2eb46be7=
ssh-I9zVn59cDxhk/
systemd-private-4864c0321a9141f18c5b1ca5545cd58b-apache2.service-6jY2vg/
systemd-private-4864c0321a9141f18c5b1ca5545cd58b-colord.service-rnwl5g/
systemd-private-4864c0321a9141f18c5b1ca5545cd58b-haveged.service-r7ENHf/
systemd-private-4864c0321a9141f18c5b1ca5545cd58b-ModemManager.service-tB7D9g/
systemd-private-4864c0321a9141f18c5b1ca5545cd58b-systemd-logind.service-ZlsM6g/
systemd-private-4864c0321a9141f18c5b1ca5545cd58b-systemd-resolved.service-HZKwJf/
systemd-private-4864c0321a9141f18c5b1ca5545cd58b-systemd-timesyncd.service-FqkVDi/
systemd-private-4864c0321a9141f18c5b1ca5545cd58b-upower.service-GvkFCi/
Temp-3dca24b8-59a7-44da-bd35-bf28d885c0da/
Temp-8a7d60ac-d3fa-4f72-8c03-644a5fc24bd0/
xauth-1000-_0


And my /tmp directory from inside the job:

$ ls /tmp
$

<It's empty>.

For a user to use these private /tmp or /dev/shm mounts in their job, they simply need to use "/tmp" or "/dev/shm". They can't use /basepath/jobid. The specifics of how these are mounted are hidden from the user and they're cleaned up at the end of the job.

So from within the job, if I create a file in /tmp, then I can see it (as root) in /basepath/jobid/.jobid:

In the job:
$ touch qwerty

As root:
# ls /home/marshall/job_containers/n1-1/609/.609/
qwerty

If I allocate another job on another node and do the same thing:

In the job:
$ touch asdf

As root:
# ls /home/marshall/job_containers/n1-2/610/.610/
asdf

Does this make sense? Do you have any more questions about how it works?
Comment 2 Paul Edmon 2021-03-25 07:52:46 MDT
Ah okay, now I see it.  So it is working properly.  It would be good to 
add this description to the docs on this as its not clear that its 
making hidden directories or what exactly it is doing. Thus it is hard 
to confirm it is actually doing what is intended.

I don't know if you want this to be a separate ticket but I have two 
feature requests for this:

1. Set the mount path to something other than /tmp: We actually have 
users using /scratch instead of /tmp for local temporary space.  So it 
would be better if we could direct it to create one of these for 
/scratch instead.

2. Multiple paths: In principle I could see wanting to do this for more 
than one storage path.  So have /tmp, /scratch and maybe /globalscratch 
all sharded like this.  Thus it would be a nice extension to be able to 
do this for multiple storage paths.

Thanks for the info!

-Paul Edmon-

On 3/24/2021 7:56 PM, bugs@schedmd.com wrote:
>
> *Comment # 1 <https://bugs.schedmd.com/show_bug.cgi?id=11183#c1> on 
> bug 11183 <https://bugs.schedmd.com/show_bug.cgi?id=11183> from 
> Marshall Garey <mailto:marshall@schedmd.com> *
>  From what I can see you don't have anything misconfigured. Here's how the
> job_container/tmpfs plugin works.
>
> Here's my job_container.conf:
>
> # job_container.conf
> AutoBasePath=true
> BasePath=/home/marshall/job_containers
>
> First, a quick explanation for why my directories will look different than
> yours:
>
> I'm using multiple-slurmd to simulate a real cluster but it's all just on a
> single node. So with multiple-slurmd there will be a directory inside BasePath
> for each node, and the job_id directories for the jobs are created inside the
> node(s) where the job resides. I have nodes n1-[1-10] defined in my slurm.conf.
> You won't see the nodename directories inside basepath since you aren't using
> multiple-slurmd, just the job id.
>
>
> Before a job is launched, slurmd creates a mount namespace for the job in
> /<basepath>/<job_id>/.ns. slurmd makes the root (/) directory a private mount
> for the job. Then, a private /tmp directory is mounted at
> <BasePath>/</job_id>/.<job_id>. Then the basepath directory is unmounted so the
> job can't see the basepath mount, but the job can see the private /tmp mount.
>  From outside the job, you can't see the job's /tmp mount (since it's private)
> but can see the basepath mount and mount namespace for the job. root can view
> what's actually in the job's /tmp directories by looking at
> basepath/jobid/.jobid.
>
> In addition, /dev/shm is *not* mounted at the basepath. /dev/shm is unmounted;
> then a new private tmpfs is created at /dev/shm.
>
> Since slurmd runs as root, the directories in basepath are owned by root.
>
> To try to illustrate how this is working, here are excerpts from findmnt from
> inside and outside the job.
>
> Viewing mounts from outside the job:
>
> $ findmnt -l -o target,source,fstype,propagation
> TARGET                                     SOURCE
>                 FSTYPE      PROPAGATION
> /                                          /dev/nvme0n1p5
>                 ext4        shared
> /dev/shm                                   tmpfs
>                 tmpfs       shared
> /home/marshall/job_containers/n1-1
> /dev/nvme0n1p5[/home/marshall/job_containers/n1-1]  ext4        private
> /home/marshall/job_containers/n1-1/609/.ns nsfs[mnt:[4026533234]]
>                 nsfs        private
>
>
> Viewing mounts from within the job:
>
> $ findmnt -l -o target,source,fstype,propagation
> TARGET                              SOURCE
>                  FSTYPE      PROPAGATION
> /                                   /dev/nvme0n1p5
>                  ext4        private
> /tmp
> /dev/nvme0n1p5[/home/marshall/job_containers/n1-1/609/.609] ext4        private
> /dev/shm                            tmpfs
>                  tmpfs       private
>
>
> Here's my current /tmp directory from outside the job:
>
> $ ls /tmp
> cscope.49356/
> sddm-:0-DxMYRL=
> sddm-auth9350208e-6093-4d49-b324-ee6e2eb46be7=
> ssh-I9zVn59cDxhk/
> systemd-private-4864c0321a9141f18c5b1ca5545cd58b-apache2.service-6jY2vg/
> systemd-private-4864c0321a9141f18c5b1ca5545cd58b-colord.service-rnwl5g/
> systemd-private-4864c0321a9141f18c5b1ca5545cd58b-haveged.service-r7ENHf/
> systemd-private-4864c0321a9141f18c5b1ca5545cd58b-ModemManager.service-tB7D9g/
> systemd-private-4864c0321a9141f18c5b1ca5545cd58b-systemd-logind.service-ZlsM6g/
> systemd-private-4864c0321a9141f18c5b1ca5545cd58b-systemd-resolved.service-HZKwJf/
> systemd-private-4864c0321a9141f18c5b1ca5545cd58b-systemd-timesyncd.service-FqkVDi/
> systemd-private-4864c0321a9141f18c5b1ca5545cd58b-upower.service-GvkFCi/
> Temp-3dca24b8-59a7-44da-bd35-bf28d885c0da/
> Temp-8a7d60ac-d3fa-4f72-8c03-644a5fc24bd0/
> xauth-1000-_0
>
>
> And my /tmp directory from inside the job:
>
> $ ls /tmp
> $
>
> <It's empty>.
>
> For a user to use these private /tmp or /dev/shm mounts in their job, they
> simply need to use "/tmp" or "/dev/shm". They can't use /basepath/jobid. The
> specifics of how these are mounted are hidden from the user and they're cleaned
> up at the end of the job.
>
> So from within the job, if I create a file in /tmp, then I can see it (as root)
> in /basepath/jobid/.jobid:
>
> In the job:
> $ touch qwerty
>
> As root:
> # ls /home/marshall/job_containers/n1-1/609/.609/
> qwerty
>
> If I allocate another job on another node and do the same thing:
>
> In the job:
> $ touch asdf
>
> As root:
> # ls /home/marshall/job_containers/n1-2/610/.610/
> asdf
>
> Does this make sense? Do you have any more questions about how it works?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 3 Marshall Garey 2021-03-25 08:32:51 MDT
Hey Paul,

No worries! It was confusing to me and others at first, too. We already have an internal bug open to improve the documentation, and I just updated that bug.

As to your request, there is already an existing ticket/enhancement request for this: bug 11135. Your requests sound the same as others on there, but can you re-post your requests on bug 11135 so we have a record of it over there? It's good for us to know when lots of people want the same thing.

What you can do right now for your requests:

#1 - it should be really simple for you to change the plugin. Simply change "tmp" to "scratch" in the plugin where it mounts and unmounts the directory. There will probably be a few places to change it. Another way of doing this might be to symlink /scratch to /tmp in a job prolog, since the container plugin does its work immediately before the job prolog. I haven't tested making a symlink to /tmp in the job prolog but I think it would work.

#2 - no workarounds that I can think of.
Comment 4 Paul Edmon 2021-03-25 10:31:02 MDT
Cool I will do that.  Thanks.

This is a great new feature by the way, we are eager to use it.

-Paul Edmon-

On 3/25/2021 10:32 AM, bugs@schedmd.com wrote:
>
> *Comment # 3 <https://bugs.schedmd.com/show_bug.cgi?id=11183#c3> on 
> bug 11183 <https://bugs.schedmd.com/show_bug.cgi?id=11183> from 
> Marshall Garey <mailto:marshall@schedmd.com> *
> Hey Paul,
>
> No worries! It was confusing to me and others at first, too. We already have an
> internal bug open to improve the documentation, and I just updated that bug.
>
> As to your request, there is already an existing ticket/enhancement request for
> this:bug 11135  <show_bug.cgi?id=11135>. Your requests sound the same as others on there, but can you
> re-post your requests onbug 11135  <show_bug.cgi?id=11135>  so we have a record of it over there? It's
> good for us to know when lots of people want the same thing.
>
> What you can do right now for your requests:
>
> #1 - it should be really simple for you to change the plugin. Simply change
> "tmp" to "scratch" in the plugin where it mounts and unmounts the directory.
> There will probably be a few places to change it. Another way of doing this
> might be to symlink /scratch to /tmp in a job prolog, since the container
> plugin does its work immediately before the job prolog. I haven't tested making
> a symlink to /tmp in the job prolog but I think it would work.
>
> #2 - no workarounds that I can think of.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 5 Marshall Garey 2021-03-25 14:57:41 MDT
Sounds good - closing this as infogiven.