Per discussion started in bug #11123 , it would be very useful if the job_container/tmpfs plugin could be used to make a private /dev/shm like it currently does but not a private /tmp. OSC currently only has the need for private /dev/shm and we mount our compute node local disks directly to /tmp. I would imagine maybe enabling job_container/tmpfs plugin in slurm.conf but not defining a BasePath in job_container.conf would be one way to configure such a thing if it were made possible.
*** Bug 11109 has been marked as a duplicate of this bug. ***
We, in Princeton, need private directories besides /tmp. I.e. we need /tmp to be configurable, say
so that each of these directories would then be private under BasePath.
I would also like to have a configurable list of directories. At a minimum /tmp, /dev/shm and /var/tmp would be needed.
We would also like to have a list of directories, ideally with the possibility to change mount type and mount options.
* Mount type: on cluster A, /tmp might need to be a tmpfs. On cluster B, /tmp might need to be a bind-mount from a local filesystem.
* Mount options: in addition to to the memory cgroup, we want to limit the size of the tmpfs, or back the tmpfs with huge pages, it requires additional mount options.
On one of our smaller cluster, we have our own homemade SPANK plugin to handle this, like others are doing right now. This plugin can be configured by a file in the fstab format, to satisfy the constraints above:
$ cat /etc/slurm/fstab
tmpfs /dev/shm tmpfs rw,nodev,nosuid,size=16G,huge=always 0 0
/raid/scratch /tmp none defaults,bind 0 0
But just having a configuration option for bind-mounts and a configuration option for tmpfs would be a good start.
Thanks for all the suggestions, and we'll certainly be looking into some aspects of this seeing how much interest this has perked up so quickly.
But - at the moment I cannot commit to any specific extensions. Additional directory configuration, and options to modify the mount options, both do strike me as useful, but will need further development.
If a site is interested in sponsoring some of this, and/or wishes to propose a patch, I'll certainly be willing to consider that.
Just to throw in our two cents from Harvard two additional features we would like to see are:
1. Use a different directory than /tmp
2. Multiple directories able to be specified
This tmpfs plugin is really handy, thanks for putting it together.
Created attachment 18779 [details]
Adds an option to specify multiple dirs to handle for private tmp
Adds Dirs=/tmp,/var/tmp kind of option so one can have multiple job container tmpfs directories. They all use the same BasePath and if one does not specify Dirs it defaults to /tmp.
This patch also removes namespace unmount in fini + adds a file to /run to indicate that the bind mount base_path was done (in container_p_restore). With these changes restart of slurmd is reliable and does not break running jobs. Not sure if /run/ is a good path to use for this on all systems.
A few things could be simplified - e.g. temp dirs end up looking like /scratch/slurmtemp/3755/.3755/_var_tmp and their perm is 1777 (easier than changing ownership of each of /scratch/slurmtemp/3755/.3755/* to the user). A few more snprintf length checks could be added (but if your private dirs are close to PATH_MAX length you have other problems).
Anyway, it works in my tests.
This requirement has come up on our end as well- at the minimum to add `/var/tmp` and ideally have configurable number of directories. I ll give the above patch a try when I have the time.
>This patch also removes namespace unmount in fini + adds a file to /run to >indicate that the bind mount base_path was done (in container_p_restore). With >these changes restart of slurmd is reliable
Thats interesting. We have been using it in production for a while as well and have not seen restarts of slurmd being unreliable or disruptive to running jobs. A description of that behavior would certainly help!
Aditi: I described the slurmd restart issue here: https://bugs.schedmd.com/show_bug.cgi?id=11093
Thanks Felix, Interesting that you are seeing this behavior. On my end I did just try to reproduce by setting basepath to /var/run and then submitting a job and then killing a job. And then restarting slurm. But in my case slurmd recovered and the running job terminated fine..Obviously there could be subtle differences here. Just for reference this is the kernel I am on: 4.15.0-140-generic. setns and mount calls can have some subtle differences if using an older kernel.
this is what i used in namespace.conf:
NodeName=linux_vb BasePath=/var/run/storage AutoBasePath=true InitScript=/usr/local/etc/test.py
And I killed slurmd using pkill. Maybe you killed more aggressively?
Another thing that is helpful for debugging is that in my case slurmstepd was running when i killed slurmd:
root@linux_vb:/usr/local/etc# ps aux | grep slurm
root 3109 0.0 0.6 279968 6504 ? Sl 14:57 0:00 slurmstepd: [4.extern]
slurm 6742 0.0 0.8 690184 8860 ? Sl 15:17 0:00 /usr/local/sbin/slurmctld -i
root 8786 0.0 0.6 213404 6524 ? Sl 15:28 0:00 slurmstepd: [4.extern]
root 8811 0.0 0.6 346528 6596 ? Sl 15:28 0:00 slurmstepd: [4.interactive]
root 8963 0.0 0.1 14428 1048 pts/0 S+ 15:29 0:00 grep --color=auto slurm
In tmpfs its the slurmstepd that actually keeps the namespace active even if upper directory gets unmounted- in this `/var/run/storage`. As long as slurmstepd is safe during a job- namespace should remain active even if upper directory is unmounted. But again if your kernel is older- then problems could be due to underlying factors and its newer then we all are going to hit it soon anyway :)
I am personally unsure whats the right approach here but seems like the solution patch above works for you all.
Aditi, I suggest we take this discussion to the other bug, I'll answer there.
Created attachment 19261 [details]
Adds an option to specify multiple dirs to handle for private tmp
Updated for 20.11.6
We use the the this plugin https://github.com/hpc2n/spank-private-tmp. This can handle multiple dirctorties. So it would be nice if this feature can also be enabled for job_container/tmpfs plugin. Will this patch be applied?
I’ll be unavailable until August 20.
For HPC related matters, contact the helpdesk at email@example.com
For other urgent matters, contact firstname.lastname@example.org
I saw some activity on the "master" branch related to this bug, e.g.:
(In reply to Felix Abecassis from comment #26)
> I saw some activity on the "master" branch related to this bug, e.g.:
Looks good ;-) Thanks Felix for the info
Thank you for all the suggestions, we've landed patches that will allow it to work with an arbitrary set of Dirs based on some of the contributed code along with a few additional small changes. These changes would at the earliest appear in 23.02.
The other suggestions like having expanded mount options/types sound useful, but if there is interest in them I think we should talk about them in a new ticket.
Thank you again for the contributions and discussion on this plugin!