When using JobContainerType=job_container/tmpfs it appears that the job-specific/private /tmp directory remains owned by root in certain cases: (1) User allocates 1+ nodes w/ salloc. Unless the user uses srun the private /tmp directories remain owned by root, e.g.: /lscratch/job_tmp/509: drwx------ 2 root root 6 May 12 13:54 .509 A user might allocate 1+ nodes and then SSH to the nodes to do their work. But in this case the job's /tmp is inaccessible to them. If the user executes srun in this allocation then the job's /tmp becomes owned by them on all nodes in the allocation. (2) Multi-node job allocated using sbatch. In this case the job's /tmp on the ordinal node is owned by the user but it remains owned by root on the other nodes in the allocation. But similar to #1, if/once an srun command is executed from the batch script, the job's /tmp directory becomes owned by the user and is accessible.
Thanks for the report, I've been able to reproduce what you are describing here! I'm taking a look to see what the cause is here and I'll let you know what I find! Thanks, --Tim
As I mentioned in 11673, we did push a patch series to master that fixes this problem (and actually cleans the plugin up a bit at the same time), but had a bug when restarting the slurmd with jobs running. The proper fix for it required changing the job_container API slightly so could only land in master. There are some possibilities for a workaround in 20.11 though. It is tied to *something* attempting to join the container, so if you use the interactive step feature instead of salloc + ssh it should handle the issue you noted initially. The second ovservation is trickier, is it actually causing an issue or is it just a further observation from seeing the first problem? Thanks! --Tim
Thanks, Tim. The 2nd scenario is actually problematic for batch jobs that use SSH to get MPI communication started. We were able to just forgo use of job_container/tmpfs for now, and that might remain our chosen path until the fix is usable. But it probably depends on the timeline. Does the fix going into master mean that it won't be available until the next major release of Slurm (~21.X), or might it be included in 20.11.8?
(In reply to Jake Rundall from comment #6) > Thanks, Tim. The 2nd scenario is actually problematic for batch jobs that > use SSH to get MPI communication started. Thanks for the extra context there! > We were able to just forgo use of job_container/tmpfs for now, and that > might remain our chosen path until the fix is usable. But it probably > depends on the timeline. > > Does the fix going into master mean that it won't be available until the > next major release of Slurm (~21.X), or might it be included in 20.11.8? The fix will be in 21.08, but won't be in 20.11.8 (due to the need to change the API). In my tests, if anything tries to join the container first (eg there is a prolog script) it will change the ownership before we try to launch the job. Thanks! --Tim
I just wanted to check in and see if this answered your question! Thanks, --Tim
Yes, thank you!
(In reply to Jake Rundall from comment #9) > Yes, thank you! Sounds good! I'll resolve this for now, but let us know if you have any other questions! Thanks! --Tim