Bug 11123 - New job_container/tmpfs plugin does not work , produces numerous errors
Summary: New job_container/tmpfs plugin does not work , produces numerous errors
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other bugs)
Version: 20.11.5
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Marshall Garey
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2021-03-17 12:26 MDT by Trey Dockendorf
Modified: 2021-03-18 14:16 MDT (History)
2 users (show)

See Also:
Site: Ohio State OSC
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf (50.60 KB, text/plain)
2021-03-17 12:26 MDT, Trey Dockendorf
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Trey Dockendorf 2021-03-17 12:26:23 MDT
Created attachment 18512 [details]
slurm.conf

I am testing job_container/tmpfs and it simply does not work as advertised.

My job_container.conf is this:

$ cat /etc/slurm/job_container.conf 
AutoBasePath=false
BasePath=/dev/shm

I want /dev/shm to be private inside a job so if a user uses /dev/shm it is private to their job and cleaned up when the job ends.

Errors:

$ salloc -w slurmd01 -A PZS0708 srun --interactive --pty /bin/bash
salloc: Pending job allocation 2001790
salloc: job 2001790 queued and waiting for resources
salloc: job 2001790 has been allocated resources
salloc: Granted job allocation 2001790
salloc: Waiting for resource configuration
salloc: Nodes slurmd01 are ready for job
slurmstepd: error: container_p_join: open failed /dev/shm/2001790/.active: No such file or directory
slurmstepd: error: container_g_join failed: 2001790
slurmstepd: error: write to unblock task 0 failed: Broken pipe
slurmstepd: error: container_p_join: open failed /dev/shm/2001790/.active: No such file or directory
slurmstepd: error: container_g_join(2001790): No such file or directory
srun: error: slurmd01: task 0: Exited with exit code 1
salloc: Relinquishing job allocation 2001790

$ sbatch -w slurmd01 -A PZS0708 --wrap 'scontrol show job=$SLURM_JOB_ID'                                             
Submitted batch job 2001791

$ cat slurm-2001791.out 
slurmstepd: error: container_p_join: open failed /dev/shm/2001791/.active: No such file or directory
slurmstepd: error: container_g_join failed: 2001791
slurmstepd: error: write to unblock task 0 failed: Broken pipe
slurmstepd: error: container_p_join: open failed /dev/shm/2001791/.active: No such file or directory
slurmstepd: error: container_g_join(2001791): No such file or directory
Comment 1 Trey Dockendorf 2021-03-17 12:27:06 MDT
To clarify, the node I am attempting to run on is configless and have verified it has latest configs and for good measure I have restarted slurmd before the errors are produced.
Comment 2 Trey Dockendorf 2021-03-17 12:28:52 MDT
Also this issue is somewhat time sensitive.  We have center wide downtime on March 31 and that would be when we deploy 20.11.5 to make use of this new feature to replace the SPANK plugin we currently use for private /dev/shm.  Doing this change live seems rather risky since we would be needing to change how private /dev/shm is handled, so be much easier during our March 31 downtime.
Comment 7 Marshall Garey 2021-03-17 16:10:45 MDT
Hi Trey,

Thanks for this bug report. I reproduced what you were seeing, but I found that the job_container/tmpfs plugin is working. But we need to clarify a couple things in the documentation. We already have an internal bug open for this (bug 11107, though you can't see it since it's private). No worries about the confusion though. There was some confusion among some of us as well.

In bug 11109, you said:

"If we moved to job_container/tmpfs it looks like we'd be limited to /dev/shm only and not both /dev/shm and /tmp.  Is it possible with job_container/tmpfs to setup multiple private locations like /dev/shm and /tmp ? My read of the config docs and my initial read of code is that only one BasePath per either config or node group is allowed."


Actually, both /dev/shm and /tmp are created as private directories for the job. You are correct that only one BasePath per either config or node group is allowed, but BasePath isn't doing what you think it is doing.

For each job, the job_container/tmpfs plugin creates a <job_id> directory and then creates private /tmp and /dev/shm directories inside that <job_id> directory. The user can then use /tmp and /dev/shm however they want in the job, and it will use these private directories. These directories will be torn down at the end of the job.

BasePath is where these directories are actually mounted. It needs to be a location that can mount/unmount directories freely. BasePath cannot be /tmp or /dev/shm.

The errors you are seeing is because there are issues with mounting/unmounting directly in /tmp and in /dev/shm. You need to change BasePath to something that is not /tmp or /dev/shm. For example, if I set BasePath=/mnt then everything is fine.

Hopefully I cleared this up. Can you let me know if this makes sense, or if I can clarify something?

- Marshall
Comment 8 Trey Dockendorf 2021-03-18 07:07:24 MDT
Thanks, I think I misunderstood what this plugin does. I was hoping it would allow me to make selective locations private so a private /dev/shm would be mounted to like /dev/shm/slurm.$SLURM_JOB_ID but seen as /dev/shm inside the job.  Having both /tmp and /dev/shm mounted to same place like a scratch directory is not what we would need at this time unfortunately.

At the very least I think some documentation updates are needed if this plugin will continue to behave as it currently does.

I think this case can be closed.  We will continue to use https://github.com/treydock/spank-private-tmp/tree/osc for making /dev/shm private within a job.
Comment 9 Marshall Garey 2021-03-18 10:24:29 MDT
Okay, so I looked at this again and I have good news and bad news. The bad news is that I apparently didn't know what I was talking about and we really need to update the documentation. The good news is I was wrong and hopefully the plugin actually does what you want (or something close to it). Also we are working on improving the documentation.


Okay, so let me start over and explain things how I now understand them. The job_container/tmpfs plugin's job is to create a private /tmp and a private /dev/shm for the job.

* The private /tmp is mounted inside BasePath in a <job_id> subdirectory:
$BasePath/<job_id>

static int _mount_private_tmp(char *path)
{
	if (!path) {
		error("%s: cannot mount /tmp", __func__);
		return -1;
	}
#if !defined(__APPLE__) && !defined(__FreeBSD__)
	if (mount(NULL, "/", NULL, MS_PRIVATE|MS_REC, NULL)) {
		error("%s: making root private: failed: %s",
		      __func__, strerror(errno));
		return -1;
	}
	if (mount(path, "/tmp", NULL, MS_BIND|MS_REC, NULL)) {
		error("%s: /tmp mount failed, %s",
		      __func__, strerror(errno));
		return -1;
	}
#endif
	return 0;
}


* The private /dev/shm: /dev/shm is unmounted, then a private tmpfs is mounted at /dev/shm.

static int _mount_private_shm(void)
{
	int rc = 0;

	rc = umount("/dev/shm");
	if (rc && errno != EINVAL) {
		error("%s: umount /dev/shm failed: %s\n",
		      __func__, strerror(errno));
		return rc;
	}
#if !defined(__APPLE__) && !defined(__FreeBSD__)
	rc = mount("tmpfs", "/dev/shm", "tmpfs", 0, NULL);
	if (rc) {
		error("%s: mounting private /dev/shm failed: %s\n",
		      __func__, strerror(errno));
		return -1;
	}
#endif
	return rc;
}

It was surprising to me (and others) how BasePath actually works.

So, you want BasePath to specify wherever the private /tmp will be mounted, which can't be /tmp or /dev/shm, which makes sense once I understand what is actually happening.

Does this make sense?
Comment 10 Trey Dockendorf 2021-03-18 10:42:03 MDT
So the private /dev/shm is good, is that cleaned up somehow or just unmounted and goes away when job ends?

For /tmp, we mount our compute node's local disk to /tmp and we need to keep users using things like $TMPDIR (/tmp/slurm.$SLURM_JOB_ID) on that local disk.  Is there a way to get the benefits of private /dev/shm but not make /tmp private?  Would I be able to maybe enable job_container/tmpfs in slurm.conf but not define a BasePath and still get benefits of private /dev/shm but not get private /tmp?

Thanks,
- Trey
Comment 12 Marshall Garey 2021-03-18 13:00:48 MDT
(In reply to Trey Dockendorf from comment #10)
> So the private /dev/shm is good, is that cleaned up somehow or just
> unmounted and goes away when job ends?

Since it's a private tmpfs, my understanding is that the mount is purged when the last process dies, and Slurm kills all job processes when the job ends. There's nothing explicit in the job_container/tmpfs plugin that unmounts or cleans up /dev/shm, though.

For private /tmp, the job_containter/tmpfs plugin unmounts it from the topmost directory (BasePath), then the directories are traversed and files and directories are removed. See the functions container_p_delete() and _rm_data() to see how it actually works.


> For /tmp, we mount our compute node's local disk to /tmp and we need to keep
> users using things like $TMPDIR (/tmp/slurm.$SLURM_JOB_ID) on that local
> disk.  Is there a way to get the benefits of private /dev/shm but not make
> /tmp private?  Would I be able to maybe enable job_container/tmpfs in
> slurm.conf but not define a BasePath and still get benefits of private
> /dev/shm but not get private /tmp?
> 
> Thanks,
> - Trey

Not right now. It's certainly possible to add.
Comment 13 Trey Dockendorf 2021-03-18 13:59:40 MDT
Should this become a RFE then to get the ability to use job_container/tmpfs for private /dev/shm without also using it for private /tmp?  Or should I open something new?  It won't be until our 2022 cluster most likely where we could redo our partition schema and support local disk somewhere other than /tmp that gets mounted privately to /tmp.
Comment 14 Marshall Garey 2021-03-18 14:07:29 MDT
Can you open a new bug for the RFE? That way it can be kept track of much easier, and somebody doesn't have to scroll through all the discussion on this bug.

Thanks for your patience with me on this one as I've figured out how this new plugin works.
Comment 15 Trey Dockendorf 2021-03-18 14:14:02 MDT
I opened 11135 RFE.  This case can be closed I believe since I think I now know how this plugin works and what's needed for us to actually be able to use it, which is what the RFE should cover.
Comment 16 Marshall Garey 2021-03-18 14:16:00 MDT
Thanks! Closing as infogiven.