6474 – A single job that starts out seeing only one of the GPU cards (nvidia-smi) suddenly starts seeing both cards.

Bug 6474 - A single job that starts out seeing only one of the GPU cards (nvidia-smi) suddenly starts seeing both cards.

Summary: A single job that starts out seeing only one of the GPU cards (nvidia-smi) su...

Status:	RESOLVED DUPLICATE of bug 5292

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmstepd (show other bugs)
Version:	18.08.4
Hardware:	Linux Linux

Importance:	--- 3 - Medium Impact
Assignee:	Director of Support
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2019-02-11 03:00 MST by Magnus Jonsson
Modified:	2019-02-12 08:40 MST (History)
CC List:	0 users

See Also:
Site:	SNIC
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	HPC2N
Linux Distro:	Ubuntu
Machine Name:	kebnekaise
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
logs and configuration files. (13.45 KB, application/gzip) 2019-02-11 03:00 MST, Magnus Jonsson	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Magnus Jonsson 2019-02-11 03:00:36 MST

Created attachment 9116 [details]
logs and configuration files.

A user is running two(2) jobs on one node at the same time with one GPU each.

After a while (several days) the jobs start seeing both GPUs.

Attached files includes output from users job with nvidia-smi loop running in slurm.batch cgroup (every 4 hours).

Comment 1 Michael Hinton 2019-02-11 10:30:15 MST

Hey Magnus,

I'll go ahead and take a look at it.

Thanks,
Michael

Comment 2 Michael Hinton 2019-02-11 18:25:47 MST

Hey Magnus,

The sbatch logs clearly show coupling between the jobs starting Feb 8 sometime between 5:37 and 9:18. After that time, the jobs start seeing each other's processes in the nvidia-smi output. This would seem to indicate that the cgroups got messed up somehow, since I believe nvidia-smi simply shows the nvidia gpu devices it sees in /dev/.

So here is my current theory of what is going wrong:

systemd might be resetting cgroups for the jobs. Does your slurmd.service file have `Delegate=yes`? Without that, if `systemctl daemon-reload` gets called (e.g. by apt), it may wipe out cgroup device restrictions. See https://github.com/SchedMD/slurm/commit/cecb39ff0 for a complete explanation. See also bug 5292, bug 5300, and bug 5061.

The commit above fixed the slurmd.service file for 17.11.8+. It’s possible you haven’t updated your copy since then. 

The only thing that stands out from the logs is in syslog-20190208.b-cn1503. On Feb 8 at 6:16, apt did some stuff. It’s possible things got messed up then.

Let me know if adding `Delegate=yes` to slurmd.service solves the problem. If not, we'll probably need to gather more information.

Thanks,
Michael

Comment 3 Magnus Jonsson 2019-02-12 01:09:47 MST

Seems to have solved the issue.

Thanks.

Comment 4 Michael Hinton 2019-02-12 08:40:49 MST

Ok great. Feel free to reopen this if the issue reappears.

Thanks,
Michael

*** This bug has been marked as a duplicate of bug 5292 ***