Ticket 6474

Summary:	A single job that starts out seeing only one of the GPU cards (nvidia-smi) suddenly starts seeing both cards.
Product:	Slurm	Reporter:	Magnus Jonsson <magnus>
Component:	slurmstepd	Assignee:	Director of Support <support>
Status:	RESOLVED DUPLICATE	QA Contact:
Severity:	3 - Medium Impact
Priority:	---
Version:	18.08.4
Hardware:	Linux
OS:	Linux
Site:	SNIC	Alineos Sites:	---
Atos/Eviden Sites:	---	Confidential Site:	---
Coreweave sites:	---	Cray Sites:	---
DS9 clusters:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	HPC2N	Linux Distro:	Ubuntu
Machine Name:	kebnekaise	CLE Version:
Version Fixed:		Target Release:	---
DevPrio:	---	Emory-Cloud Sites:	---
Attachments:	logs and configuration files.

Description Magnus Jonsson 2019-02-11 03:00:36 MST

Created attachment 9116 [details]
logs and configuration files.

A user is running two(2) jobs on one node at the same time with one GPU each.

After a while (several days) the jobs start seeing both GPUs.

Attached files includes output from users job with nvidia-smi loop running in slurm.batch cgroup (every 4 hours).

Comment 1 Michael Hinton 2019-02-11 10:30:15 MST

Hey Magnus,

I'll go ahead and take a look at it.

Thanks,
Michael

Comment 2 Michael Hinton 2019-02-11 18:25:47 MST

Hey Magnus,

The sbatch logs clearly show coupling between the jobs starting Feb 8 sometime between 5:37 and 9:18. After that time, the jobs start seeing each other's processes in the nvidia-smi output. This would seem to indicate that the cgroups got messed up somehow, since I believe nvidia-smi simply shows the nvidia gpu devices it sees in /dev/.

So here is my current theory of what is going wrong:

systemd might be resetting cgroups for the jobs. Does your slurmd.service file have `Delegate=yes`? Without that, if `systemctl daemon-reload` gets called (e.g. by apt), it may wipe out cgroup device restrictions. See https://github.com/SchedMD/slurm/commit/cecb39ff0 for a complete explanation. See also bug 5292, bug 5300, and bug 5061.

The commit above fixed the slurmd.service file for 17.11.8+. It’s possible you haven’t updated your copy since then. 

The only thing that stands out from the logs is in syslog-20190208.b-cn1503. On Feb 8 at 6:16, apt did some stuff. It’s possible things got messed up then.

Let me know if adding `Delegate=yes` to slurmd.service solves the problem. If not, we'll probably need to gather more information.

Thanks,
Michael

Comment 3 Magnus Jonsson 2019-02-12 01:09:47 MST

Seems to have solved the issue.

Thanks.

Comment 4 Michael Hinton 2019-02-12 08:40:49 MST

Ok great. Feel free to reopen this if the issue reappears.

Thanks,
Michael

*** This ticket has been marked as a duplicate of ticket 5292 ***