Bug 6474 - A single job that starts out seeing only one of the GPU cards (nvidia-smi) suddenly starts seeing both cards.
Summary: A single job that starts out seeing only one of the GPU cards (nvidia-smi) su...
Status: RESOLVED DUPLICATE of bug 5292
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmstepd (show other bugs)
Version: 18.08.4
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Director of Support
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-02-11 03:00 MST by Magnus Jonsson
Modified: 2019-02-12 08:40 MST (History)
0 users

See Also:
Site: SNIC
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: HPC2N
Linux Distro: Ubuntu
Machine Name: kebnekaise
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
logs and configuration files. (13.45 KB, application/gzip)
2019-02-11 03:00 MST, Magnus Jonsson
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Magnus Jonsson 2019-02-11 03:00:36 MST
Created attachment 9116 [details]
logs and configuration files.

A user is running two(2) jobs on one node at the same time with one GPU each.

After a while (several days) the jobs start seeing both GPUs.

Attached files includes output from users job with nvidia-smi loop running in slurm.batch cgroup (every 4 hours).
Comment 1 Michael Hinton 2019-02-11 10:30:15 MST
Hey Magnus,

I'll go ahead and take a look at it.

Thanks,
Michael
Comment 2 Michael Hinton 2019-02-11 18:25:47 MST
Hey Magnus,

The sbatch logs clearly show coupling between the jobs starting Feb 8 sometime between 5:37 and 9:18. After that time, the jobs start seeing each other's processes in the nvidia-smi output. This would seem to indicate that the cgroups got messed up somehow, since I believe nvidia-smi simply shows the nvidia gpu devices it sees in /dev/.

So here is my current theory of what is going wrong:

systemd might be resetting cgroups for the jobs. Does your slurmd.service file have `Delegate=yes`? Without that, if `systemctl daemon-reload` gets called (e.g. by apt), it may wipe out cgroup device restrictions. See https://github.com/SchedMD/slurm/commit/cecb39ff0 for a complete explanation. See also bug 5292, bug 5300, and bug 5061.

The commit above fixed the slurmd.service file for 17.11.8+. It’s possible you haven’t updated your copy since then. 

The only thing that stands out from the logs is in syslog-20190208.b-cn1503. On Feb 8 at 6:16, apt did some stuff. It’s possible things got messed up then.

Let me know if adding `Delegate=yes` to slurmd.service solves the problem. If not, we'll probably need to gather more information.

Thanks,
Michael
Comment 3 Magnus Jonsson 2019-02-12 01:09:47 MST
Seems to have solved the issue.

Thanks.
Comment 4 Michael Hinton 2019-02-12 08:40:49 MST
Ok great. Feel free to reopen this if the issue reappears.

Thanks,
Michael

*** This bug has been marked as a duplicate of bug 5292 ***