Bug 11093

Summary:	job_container/tmpfs doesn't cleanup after slurmd restart
Product:	Slurm	Reporter:	Felix Abecassis <fabecassis>
Component:	slurmd	Assignee:	Carlos Tripiana Montes <tripiana>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	agaur, alex, lyeager, marshall, plazonic, rundall, tdockendorf, ward.poelmans
Version:	21.08.x
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=18482
Site:	NVIDIA (PSLA)	Alineos Sites:	---
Atos/Eviden Sites:	---	Confidential Site:	---
Coreweave sites:	---	Cray Sites:	---
DS9 clusters:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Linux Distro:	---
Machine Name:		CLE Version:
Version Fixed:	20.11.6	Target Release:	---
DevPrio:	---	Emory-Cloud Sites:	---
Bug Depends on:
Bug Blocks:	11336
Attachments:	Fix restart of slurmd when tmpfs in use

Description Felix Abecassis 2021-03-15 16:08:50 MDT

It's unclear what are the guarantees offered by Slurm regarding a live restart of slurmd on a compute node. It seems to work well even when there are running jobs, but if it isn't a supported use case, let me know and we can close this bug.


I'm testing the job_container/tmpfs plugin that was recently merged, on a single-node development setup, with the following configuration:
$ cat /etc/slurm/namespace.conf
BasePath=/var/run/slurm
AutoBasePath=true

I can verify that the per-job folder is indeed cleaned up normally when the job terminates:
$ srun sh -c 'echo $SLURM_JOB_ID; echo slurm > /tmp/foo ; sleep 30s'
31

$ sudo ls -la /var/run/slurm/ioctl/31
ls: cannot access '/var/run/slurm/ioctl/31': No such file or directory

But, if slurmd is restarted manually when the job is running, the storage will not be cleaned up during the job epilog, leaking storage or memory:
$ srun sh -c 'echo $SLURM_JOB_ID; echo slurm > /tmp/foo ; sleep 30s'
32

$ sudo ls -l /var/run/slurm/ioctl/32/.32
total 4
-rw-rw-r-- 1 fabecassis fabecassis 6 Mar 15 14:44 foo

The fini() function of this plugin unmounts the whole basepath, so a slurmd restart can't work:
https://github.com/SchedMD/slurm/blob/323e17897f9fe234d9b4920cae1c6bf536a1b990/src/plugins/job_container/tmpfs/job_container_tmpfs.c#L166


As a result, the restarted slurmd executes the epilog, it can't join the namespace since .ns is a not a namespace file:
slurmd: error: container_p_join: setns failed for /var/run/slurm/ioctl/35/.ns: Invalid argument
slurmd: error: container_g_join(35): Invalid argument


Similarly, container_p_delete also fails for the same reason:
slurmd: error: container_p_delete: umount2 /var/run/slurm/ioctl/35/.ns failed: Invalid argument
slurmd: error: container_g_delete(35): Invalid argument

The folder cleanup is then not performed:
https://github.com/SchedMD/slurm/blob/323e17897f9fe234d9b4920cae1c6bf536a1b990/src/plugins/job_container/tmpfs/job_container_tmpfs.c#L700-L704


I suppose the folder could always be cleaned up in p_delete, but container_p_join should probably be fixed too, otherwise I imagine it might break spank plugins that expect to run an epilog inside the per-job /tmp.

Comment 1 Trey Dockendorf 2021-03-16 16:53:32 MDT

Ohio Supercomputer Center will be evaluating and likely deploying job_container/tmpfs in the near future so please add site "Ohio State OSC" to those interested in seeing this bug fixed in the 20.11.x release series.

Comment 2 Carlos Tripiana Montes 2021-03-17 08:45:43 MDT

Dear Felix,

It seems like in slurmd.c:main:L359:

	if (container_g_restore(conf->spooldir, !conf->cleanstart))
		error("Unable to restore job_container state.");

which is:

extern int container_g_restore(char * dir_name, bool recover)

and then it calls to job_container_tmpfs.c:L183:

extern int container_p_restore(char *dir_name, bool recover)

is not making a difference between using or not this flag:

-c     Clear system locks as needed. This may be required if slurmd terminated abnormally.

So it will always execute de restore to mount the namespace before trying to work with it. This means, that it should not make container_p_restore nor container_p_restore fail if container_p_restore executes well.

There are some debug messages useful for tracking down the issue better. So, if you can put the slurmd in debug mode and then track down the issue and post a copy of this log, it would be much appreciated.

Just a final note, test in in latest master commit, if you can.

Thanks.

Comment 3 Felix Abecassis 2021-03-17 13:06:11 MDT

> So it will always execute de restore to mount the namespace before trying to work with it. 

Despite the name, the container_p_restore function of job_container/tmpfs doesn't restore the previous state, it just re-initializes the BasePath directory. Slurmd was properly terminated in my case (otherwise fini() wouldn't have been called), and the restore function succeeded too:
slurmd: debug:  job_container/tmpfs: _read_slurm_jc_conf: Reading job_container.conf file /etc/slurm-llnl/job_container.conf
slurmd: debug3: job_container/tmpfs: _parse_jc_conf_internal: empty init script detected
slurmd: debug:  job_container/tmpfs: container_p_restore: namepsace.conf read successfully
slurmd: debug3: job_container/tmpfs: container_p_restore: tmpfs: Base namespace created

Before the restart of slurmd, when a job is still running:
$ findmnt -R /var/run/slurm/ioctl
TARGET                    SOURCE                 FSTYPE OPTIONS
/run/slurm/ioctl          tmpfs[/slurm/ioctl]    tmpfs  rw,nosuid,nodev,noexec,relatime,size=3279188k,mode=755
└─/run/slurm/ioctl/52/.ns nsfs[mnt:[4026532562]] nsfs   rw

After the restart of slurmd:
$ findmnt -R /var/run/slurm/ioctl
TARGET           SOURCE              FSTYPE OPTIONS
/run/slurm/ioctl tmpfs[/slurm/ioctl] tmpfs  rw,nosuid,nodev,noexec,relatime,size=3279188k,mode=755

You can see that only the bind-mount of the BasePath (on top of itself) was re-done, the nsfs mount was not. Restoring the namespace file would requiring re-applying the logic of bind-mounting /proc/<job PID>/ns/mnt to /run/slurm/ioctl/52/.ns


> Just a final note, test in in latest master commit, if you can.

Should have mentioned that, apologies, I initially tested on master 2 days ago, commit was 323e17897f. Current master, db2446ac54, still has the issue (only minor changes were done to this plugin between those 2 commits).

Comment 4 Carlos Tripiana Montes 2021-03-18 03:55:56 MDT

Ah, yes, now I see the point.

Let me discuss this first with the people involved in the development of the plugin. I need to see if this a limitation, a feature not yet implemented (but planned) or just a bug.

Thanks for such a well-explained answers.

Comment 5 Carlos Tripiana Montes 2021-03-23 02:13:20 MDT

Hi Felix,

I'm going to have a closer look on this and see if we can implement a fix soon.

It would be great for us to have a copy of your slurm.conf and namespace.conf.

Finally, bear in mind the plugin is under development. As an example, namespace.conf is now renamed to job_container.conf. The doc at https://slurm.schedmd.com/job_container.conf.html is constantly updated. In the event of a repo update, if something fails please check this doc 1st.

Thank you for your patience.

Comment 6 Carlos Tripiana Montes 2021-03-23 09:21:01 MDT

When I said under development I was thinking more in the fact that this is first stable version release, and we are still actively receiving user input like yours, and we will use this input to fix everything discovered. And if it means some changes in doc, it will be available there.

Sorry for the misunderstanding.

Comment 7 Felix Abecassis 2021-03-23 09:49:24 MDT

My job_container.conf is fairly simple:
$ cat /etc/slurm/job_container.conf 
BasePath=/var/run/slurm
AutoBasePath=true

I don't think the slurm.conf is relevant here.

Comment 11 Josko Plazonic 2021-03-28 18:02:25 MDT

It's not only cleanup and epilog that has the problem - also any further job steps (for running jobs) will fail to join the namespace. Due to how /dev/shm is handled I am not sure that .ns mount can be recovered so in previous version of this plugin we just made fini function a no-op, i.e. return success with nothing done.

Comment 12 Felix Abecassis 2021-03-28 18:43:25 MDT

> It's not only cleanup and epilog that has the problem - also any further job steps (for running jobs) will fail to join the namespace.

True, that's a good point.

> Due to how /dev/shm is handled I am not sure that .ns mount can be recovered

If you have a PID that is guaranteed to be still in the original mount namespace created by the plugin (i.e., *not* a PID from the user's app, since the user might have created a new mount namespace), you can recover with something like this:
$ sudo mount --bind /proc/${PID}/ns/mnt /basepath/${SLURM_JOBID}/.ns

> we just made fini function a no-op, i.e. return success with nothing done.

I agree that it seems like the right approach. In addition container_p_restore must also be careful not to mount over the basepath again, if the bind-mount is already present.

Comment 16 Josko Plazonic 2021-03-31 15:37:57 MDT

There is one more problem - doing the following in container_p_restore:

        if (mount(jc_conf->basepath, jc_conf->basepath, "xfs", MS_BIND, NULL)) {
                error("%s: Initial base mount failed, %s",
                      __func__, strerror(errno));
                return SLURM_ERROR;
        }
        if (mount(jc_conf->basepath, jc_conf->basepath, "xfs",
                  MS_PRIVATE | MS_REC, NULL)) {
                error("%s: Initial base mount failed, %s",
                      __func__, strerror(errno));
                return SLURM_ERROR;
        }

This is done unconditionally, on every slurmd restart and it breaks tmpfs as well. With a check on this + removing unmount of .ns I get a reliable restart of slurmd and cleanup at the end of the job. Will add as patch to #11135.

Comment 17 Marshall Garey 2021-03-31 15:47:45 MDT

(In reply to Josko Plazonic from comment #16)
> There is one more problem - doing the following in container_p_restore:
> 
>         if (mount(jc_conf->basepath, jc_conf->basepath, "xfs", MS_BIND,
> NULL)) {
>                 error("%s: Initial base mount failed, %s",
>                       __func__, strerror(errno));
>                 return SLURM_ERROR;
>         }
>         if (mount(jc_conf->basepath, jc_conf->basepath, "xfs",
>                   MS_PRIVATE | MS_REC, NULL)) {
>                 error("%s: Initial base mount failed, %s",
>                       __func__, strerror(errno));
>                 return SLURM_ERROR;
>         }
> 
> This is done unconditionally, on every slurmd restart and it breaks tmpfs as
> well. With a check on this + removing unmount of .ns I get a reliable
> restart of slurmd and cleanup at the end of the job. Will add as patch to
> #11135.

Carlos is on vacation right now, so I'll just make a quick response.

Can you upload your patch here, not on 11135? It doesn't seem very relevant to the enhancement request in 11135 and just seems relevant for this bug.

I know Carlos was working on a patch for this issue before he left on vacation and I think he was close. If you upload your patch here he can compare it with what he's been working on.

Comment 18 Josko Plazonic 2021-03-31 16:07:45 MDT

Created attachment 18780 [details]
Fix restart of slurmd when tmpfs in use

Edited out non relevant stuff from 11135.

Patch removes code out of fini function and creates a file under /run to try to keep track of mount binds having been done for a specific basepath (in case basepath changes). /run may not be appropriate for all systems but on most it should be /dev/shm so cleared on boot.

One could also try to detect this other ways - e.g. by parsing out /proc/self/mountinfo and looking for basepath that does *not* have shared: in the line. I am not sure there is any other way of detecting this specific bind mount and it is obviously a lot messier to do.

Anyway, not a perfect solution but this should work for us for now.

Comment 19 Felix Abecassis 2021-03-31 17:53:07 MDT

Commenting on https://bugs.schedmd.com/show_bug.cgi?id=11135#c10

> But in my case slurmd recovered and the running job terminated fine

To clarify, the job terminates fine. But the slurmd log shows error and storage/memory will be leaked since this job is not cleaned up.

> And I killed slurmd using pkill. Maybe you killed more aggressively?

Nope, just ctrl-C (SIGINT). I don't think running slurmd under a systemd service would change the behavior though, Slurmd gets the opportunity to terminate properly. Actually, if you SIGKILL the process, the bug won't happen (since fini() will not be called).

> In tmpfs its the slurmstepd that actually keeps the namespace active even if upper directory gets unmounted- in this `/var/run/storage`.

The tmpfs is not the issue, it will indeed be kept alive because there are still references to it. The issue is the bind-mount of .ns, which is forever gone. So slurmd can't setns() anymore.

Comment 20 Aditi Gaur 2021-03-31 18:10:28 MDT

Okay that clarifies it. I had assumed job termination was not clean. But when I recheck the directories- I can see that the previous content has not cleaned up.

I will try and apply the patch mentioned in this bug to see if that does the cleanup correctly. 

One method I use for making sure slurmd doesnt mount a job's dir twice is to touch .active inside it. Maybe during fini we could check if there are any active namespaces and if there are, then not unmount the base dir. And then in restore check again if .active is present, and then not try to mount again. Thats something that could work...but would have to test it out.

Comment 21 Aditi Gaur 2021-03-31 19:07:09 MDT

I tested the patch above and that seems clean things up nicely.

I guess we could also do a check in .fini function as well and look for any active mounts by checking for .active file inside a job mount dir. That file already is there and gets created for all active jobs (sbcast uses it to not re-create job container). So checking for that should be easy. 

We don't care all that much about fini either way. If slurmd does get to fini where an unmount is necessary we would mostly be rebooting at that point- but maybe in other environments where state after slurmd finish is important- the unmount would have to be manual. 

Either way- thanks for the patch Josko!

Comment 22 Felix Abecassis 2021-03-31 19:12:56 MDT

>  If slurmd does get to fini where an unmount is necessary we would mostly be rebooting at that point- but maybe in other environments where state after slurmd finish is important- the unmount would have to be manual. 

I think I would be fine if we dropped the unmount from fini(), as long as we get rid of the leak! Even if we don't always reboot in this scenario, we usually promptly restart slurmd.

Comment 25 Carlos Tripiana Montes 2021-04-12 10:09:22 MDT

Hi guys,

Let me explain line by line all your comments. Thanks.

> But, if slurmd is restarted manually when the job is running, the storage will not be cleaned up during the job epilog, leaking storage or memory:
> $ srun sh -c 'echo $SLURM_JOB_ID; echo slurm > /tmp/foo ; sleep 30s'
> 32
> 
> $ sudo ls -l /var/run/slurm/ioctl/32/.32
> total 4
> -rw-rw-r-- 1 fabecassis fabecassis 6 Mar 15 14:44 foo
>
> The fini() function of this plugin unmounts the whole basepath, so a slurmd restart can't work:
> https://github.com/SchedMD/slurm/blob/323e17897f9fe234d9b4920cae1c6bf536a1b990/src/plugins/job_container/tmpfs/job_container_tmpfs.c#L166

I have fully tested this extent to be false when the slurmd gets a signal that we control and slurmd shutdowns properly. But if it get SIGKILL, fini is not called and the basepath isn't umounted.

The key is that this only umount can umount all the dependant child mountpoints. It will leak the basepath mount resource if any child leaks too, like if its signaled with SIGKILL.

We can check this mounts and resources easily with a command like:

# For slurmd:
findmnt -M /home/tripi/slurm/master_11093/inst/run/slurm/n1 --output-all
# For a job:
findmnt -M /home/tripi/slurm/master_11093/inst/run/slurm/n1/63/.ns --output-all


> As a result, the restarted slurmd executes the epilog, it can't join the namespace since .ns is a not a namespace file:
> slurmd: error: container_p_join: setns failed for /var/run/slurm/ioctl/35/.ns: Invalid argument
> slurmd: error: container_g_join(35): Invalid argument
> 
> Similarly, container_p_delete also fails for the same reason:
> slurmd: error: container_p_delete: umount2 /var/run/slurm/ioctl/35/.ns failed: Invalid argument
> slurmd: error: container_g_delete(35): Invalid argument
> 
> The folder cleanup is then not performed:
> https://github.com/SchedMD/slurm/blob/323e17897f9fe234d9b4920cae1c6bf536a1b990/src/plugins/job_container/tmpfs/job_container_tmpfs.c#L700-L704

That's true, in a restart it fails. Because the plugin does not restore the mount point for running jobs. And, after a restart, with leaks or not, if a step wants to join setns will fail, or at the end of the job umount will fail (it's not mounted), then it will skip files deletetion.


> Despite the name, the container_p_restore function of job_container/tmpfs doesn't restore the previous state, it just re-initializes the BasePath directory. Slurmd was properly terminated in my case (otherwise fini() wouldn't have been called), and the restore function succeeded too:
> slurmd: debug:  job_container/tmpfs: _read_slurm_jc_conf: Reading job_container.conf file /etc/slurm-llnl/job_container.conf
> slurmd: debug3: job_container/tmpfs: _parse_jc_conf_internal: empty init script detected
> slurmd: debug:  job_container/tmpfs: container_p_restore: namepsace.conf read successfully
> slurmd: debug3: job_container/tmpfs: container_p_restore: tmpfs: Base namespace created
> 
> Before the restart of slurmd, when a job is still running:
> $ findmnt -R /var/run/slurm/ioctl
> TARGET                    SOURCE                 FSTYPE OPTIONS
> /run/slurm/ioctl          tmpfs[/slurm/ioctl]    tmpfs  rw,nosuid,nodev,noexec,relatime,size=3279188k,mode=755
> └─/run/slurm/ioctl/52/.ns nsfs[mnt:[4026532562]] nsfs   rw
> 
> After the restart of slurmd:
> $ findmnt -R /var/run/slurm/ioctl
> TARGET           SOURCE              FSTYPE OPTIONS
> /run/slurm/ioctl tmpfs[/slurm/ioctl] tmpfs  rw,nosuid,nodev,noexec,relatime,size=3279188k,mode=755
> 
> You can see that only the bind-mount of the BasePath (on top of itself) was re-done, the nsfs mount was not. Restoring the namespace file would requiring re-applying the logic of bind-mounting /proc/<job PID>/ns/mnt to /run/slurm/ioctl/52/.ns

This further testing went into the right direction. fini gets called unless slurmd dies without the chance shutdown (SIGKILL). But the previous mounts (per job) don't get restored.


> It's not only cleanup and epilog that has the problem - also any further job steps (for running jobs) will fail to join the namespace. Due to how /dev/shm is handled I am not sure that .ns mount can be recovered so in previous version of this plugin we just made fini function a no-op, i.e. return success with nothing done.

Yes, nexts steps after a restart can't join the namespace: because the mount point is not binded. There's no such recover state on restart feature implemented in this plugin yet. But we can't just remove umount for basepath from fini. It will leak mount forever an ever until HW reboot, and don't fix the issue of recovering mounts too.


>> Due to how /dev/shm is handled I am not sure that .ns mount can be recovered
> 
> If you have a PID that is guaranteed to be still in the original mount namespace created by the plugin (i.e., *not* a PID from the user's app, since the user might have created a new mount namespace), you can recover with something like this:
> $ sudo mount --bind /proc/${PID}/ns/mnt /basepath/${SLURM_JOBID}/.ns

This can be achieved using function that checks spool dir for jobs steps: "List stepd_available(const char *directory, const char *nodename)". Then check alive jobs with "int stepd_connect(const char *directory, const char *nodename, slurm_step_id_t *step_id, uint16_t *protocol_version)".

>> we just made fini function a no-op, i.e. return success with nothing done.
> 
> I agree that it seems like the right approach. In addition container_p_restore must also be careful not to mount over the basepath again, if the bind-mount is already present.

That's not correct. The bind-mount can have been leaked by the previous dead slurmd, but the nex one can't use this mount since was private to the dead pid. You need to mount the basepath again, and then recover per job their mounts too for the same reason.


> There is one more problem - doing the following in container_p_restore:
> 
>         if (mount(jc_conf->basepath, jc_conf->basepath, "xfs", MS_BIND, NULL)) {
>                 error("%s: Initial base mount failed, %s",
>                       __func__, strerror(errno));
>                 return SLURM_ERROR;
>         }
>         if (mount(jc_conf->basepath, jc_conf->basepath, "xfs",
>                   MS_PRIVATE | MS_REC, NULL)) {
>                 error("%s: Initial base mount failed, %s",
>                       __func__, strerror(errno));
>                 return SLURM_ERROR;
>         }
> 
> This is done unconditionally, on every slurmd restart and it breaks tmpfs as well. With a check on this + removing unmount of .ns I get a reliable restart of slurmd and cleanup at the end of the job. Will add as patch to #11135.

It doesn't break anything, it just mounts it again. If fini was called, there's no leaks. If there's were problems, then it mounts again and the leak is still there until next HW reboot. It won't hurt. Remember you can join the namespace of the dead slurmd. But yes we need a way to see if a job after restart is alive and remount this .ns too, or otherwise try to remove the files (but .ns will not be deleted if there a leak for it, until slurmd starts after HW reboot).

Be have a patch that covers all the problems detected by you and some others detected by us, and this one is able to clean all and recover on startup. It's also able to handle a leaks caused by a SIGKILL os similar. It tries to recover those as well and tries to umount and delete too. Can happen that an umount or file removal can't be done, it skips them and will try on next reboot. Will do it rigth for sure if slurmd starts after HW reboot.

Once the patch has gone its way tru review and commited to master we will notify you.

Regards,

Comment 27 Felix Abecassis 2021-04-12 11:11:04 MDT

> I have fully tested this extent to be false when the slurmd gets a signal that we control and slurmd shutdowns properly.

Can you clarify what is "false" here? Thanks

> It will leak the basepath mount resource if any child leaks too, like if its signaled with SIGKILL.

If the job gets a SIGKILL? I haven't seen a problem in this situation. Can you describe what you are seeing?

> It will leak mount forever an ever until HW reboot, and don't fix the issue of recovering mounts too.

I would argue that leaking a single mount consumes way less resources than leaking the storage of a job, which can consume an unbounded amount of memory of disk.

And what do you mean that it doesn't fix the issue of recovering mounts? It should work fine *as long as you don't over-mount the basepath*.


> That's not correct. The bind-mount can have been leaked by the previous dead slurmd, but the nex one can't use this mount since was private to the dead pid. 

What do you mean it was "private to the dead pid"? The mount is not created inside a mount namespace, hence you can use it again.

> You need to mount the basepath again, and then recover per job their mounts too for the same reason.

I don't think that's true as long as you don't bind-mount the basepath on top of itself.

Comment 28 Carlos Tripiana Montes 2021-04-13 01:41:37 MDT

Hi Felix,

Sorry if anything was not clear enough. I'll do my best to explain it and cover your questions.

> > I have fully tested this extent to be false when the slurmd gets a signal that we control and slurmd shutdowns properly.
> 
> Can you clarify what is "false" here? Thanks

The complete answer was:

"I have fully tested this extent to be false when the slurmd gets a signal that we control and slurmd shutdowns properly. But if it get SIGKILL, fini is not called and the basepath isn't umounted.

The key is that this only umount can umount all the dependant child mountpoints. It will leak the basepath mount resource if any child leaks too, like if its signaled with SIGKILL.

We can check this mounts and resources easily with a command like:

# For slurmd:
findmnt -M /home/tripi/slurm/master_11093/inst/run/slurm/n1 --output-all
# For a job:
findmnt -M /home/tripi/slurm/master_11093/inst/run/slurm/n1/63/.ns --output-all"

So, here the points are:

* If slurmd gets a signal, can happen 2 things:
  1. Catch the signal and handle a grateful shutdown.
  2. Do not catch the signal and die without cleaning.

* In 1. scenario, fini is called not only for this plugin, but for all loaded plugins. So, this plugin will execute umount for basepath.

* In 2. scenario, fini is not executed. It doesn't call umount, and the base NS bind gets leaked with that alive mount. The same applies for every active .ns bind mount bounded to running jobs.

> > It will leak the basepath mount resource if any child leaks too, like if its signaled with SIGKILL.
> 
> If the job gets a SIGKILL? I haven't seen a problem in this situation. Can
> you describe what you are seeing?

I was talking about the slurmd process itself, not about job commands or steps. KILL signal can't be catched by any process. It makes any process die without the opportunity to execute any handler that will make the process to shutdown gratefully. In this situation, it applies scenario 2. explained above.

> > It will leak mount forever an ever until HW reboot, and don't fix the issue of recovering mounts too.
> 
> I would argue that leaking a single mount consumes way less resources than
> leaking the storage of a job, which can consume an unbounded amount of
> memory of disk.

I wasn't meant to say we want to keep the files on disk. A patched slurmd should check if it can umount or not a namespace (because some leak), but also delete all files after, no matter if the umount(s) failed or not. Can happen as well that, if there's a leak for a .ns mount, that .ns file can't be deleted (EBUSY errno), and the parent folder can't be deleted ofc. You can just ignore this file and the parent folder, like the bind mount resource till next fresh slurmd start (after HW reboot). Then, slurmd will be able to delete this .ns and the parent folder.

Note here that I'm explaining what slurm should do after path, currently this is not implemented.
 
> And what do you mean that it doesn't fix the issue of recovering mounts? It
> should work fine *as long as you don't over-mount the basepath*.

They key here I think is the fact that we want to keep conservative behaviour in which slurmd tries to umount all before shutdown and mount it again on start. Adding the ability to handle those problems with leaks. Reasons below.

> > That's not correct. The bind-mount can have been leaked by the previous dead slurmd, but the next one can't use this mount since was private to the dead pid. 
> 
> What do you mean it was "private to the dead pid"? The mount is not created
> inside a mount namespace, hence you can use it again.
> 
> > You need to mount the basepath again, and then recover per job their mounts too for the same reason.
> 
> I don't think that's true as long as you don't bind-mount the basepath on
> top of itself.

Overall your approach is right but has 2 side problems after slurmd reboot:

1. If we don't handle umount, they leak and we don't treat them. Despite not being very problematic at first sight, a node can be alive for days and days holding huge amounts of different jobs. You can reach to the point of leaking too much mounts. Note that each mount has file descriptors assoc, and FDs have a limited number. We want to avoid leaks of any type as much as possible in Slurm, no matter the type. So we are preparing a patch that minimizes leaks to only those kernel don't let us clean in any way (those generated in the even of a slurmd crash: fini not called).

2. We can go in your direction and avoid umounts, then let the p_delete the only one in charge of removing the files for a job. We found that in some situations, p_delete can happen to not be called as well, so we need a way to check on startup for dead job which needs the files to be deleted from disk as well.


As a summary our patch is going in the conservative direction of:

1. At shutdown: try to umount, and avoid kernel bind mount leaks.
2. At startup: remount everything again, check for dead jobs: free kernel and delete files.
3. Both at shutdown at startup, avoid failing for those failing umounts of file deletions due to previous leaks.

With this approach, we can keep system as close as 100% clean, as possible in every scenario.

Comment 31 Felix Abecassis 2021-04-13 09:45:29 MDT

> It doesn't call umount, and the base NS bind gets leaked with that alive mount. The same applies for every active .ns bind mount bounded to running jobs.

It's not necessarily a leak, as in: it is not gone forever. On restart Slurmd can recover all the bind mounts: the basepath and the namespace files. This is what we were discussing initially on this thread.

> Can happen as well that, if there's a leak for a .ns mount, that .ns file can't be deleted (EBUSY errno), and the parent folder can't be deleted ofc.

Shouldn't that only happen if the job terminates when slurmd is down, right? Is that possible?
If that's the case and you have a way to detect it, then it's safe to unmount the namespace file and perform normal cleanup, since this should be the last reference to the mount namespace of the job.

> If we don't handle umount, they leak and we don't treat them. Despite not being very problematic at first sight, a node can be alive for days and days holding huge amounts of different jobs. You can reach to the point of leaking too much mounts.

Sorry, but I don't see how that can happen as long as slurmd is restarted. Again, it's not necessarily a leak as long as slurmd is given the chance to cleanup each job that finished.

> Note that each mount has file descriptors assoc, and FDs have a limited number.

I don't think mounts are related to file descriptors here.

Comment 32 Aditi Gaur 2021-04-13 10:49:34 MDT

>I wasn't meant to say we want to keep the files on disk. A patched slurmd should >check if it can umount or not a namespace (because some leak), but also delete >
>all files after, no matter if the umount(s) failed or not.

Just want to say that when umount fails- then file deletion also fails mostly. I thought about not exiting after a umount failure in container_p_delete, but generally it becomes tough to delete anything after that point. Just something to keep in mind.

Comment 33 Felix Abecassis 2021-04-13 14:37:35 MDT

I would also like to clarify that no one suggested to remove the umount in container_p_delete. Obviously that would cause a massive leak (for no good reason, since the job is finished).

The suggestion was to remove the umount in fini(), e.g. the patch from Josko. There is no risk of a persistent leak in this situation. Only the basepath mount will be persistent, but per-job mounts will not leak. Unless container_p_delete is not called obviously, but this looks like a separate flaw in the job_container plugin API then.

Comment 34 Carlos Tripiana Montes 2021-04-14 01:07:59 MDT

Hi,

Thanks all for the interesting comments and discoveries. They were very helpful to spot all the possible culprits and problems related to the issue. We have now a patchset proposal going through review process that will cover all the problems related to (including proper cleanup):

1. Safe restart
2. Crash + Start
3. Per job crashes
4. Spurious leaks

With some other minor improvements. Once the patch has completed its way, and committed, we will notify you.

Regards.

Comment 48 Carlos Tripiana Montes 2021-04-22 00:47:19 MDT

Hi everybody,

Commits a3557b4cc1a..5e4fd8ebe4ff0 have been pushed both to 20.11 and master branches.

20.11.6 is going to be released soon, and it will include these fixes too.

We are going to mark this bug as resolved by now. Please, reopen it if necessary.

Regards.

Comment 50 Felix Abecassis 2021-05-21 10:54:29 MDT

As a FYI for the long list of folks CC'ed here, I've filed two new bugs related to slurmd restart (from the current Slurm master branch):
https://bugs.schedmd.com/show_bug.cgi?id=11673
https://bugs.schedmd.com/show_bug.cgi?id=11674

They are probably less impactful than this one, since it seems it could only cause *some* jobs to fail, instead of persistently leaking storage on a compute node.