Ticket 6474

Summary: A single job that starts out seeing only one of the GPU cards (nvidia-smi) suddenly starts seeing both cards.
Product: Slurm Reporter: Magnus Jonsson <magnus>
Component: slurmstepdAssignee: Director of Support <support>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 18.08.4   
Hardware: Linux   
OS: Linux   
Site: SNIC Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: HPC2N Linux Distro: Ubuntu
Machine Name: kebnekaise CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: logs and configuration files.

Description Magnus Jonsson 2019-02-11 03:00:36 MST
Created attachment 9116 [details]
logs and configuration files.

A user is running two(2) jobs on one node at the same time with one GPU each.

After a while (several days) the jobs start seeing both GPUs.

Attached files includes output from users job with nvidia-smi loop running in slurm.batch cgroup (every 4 hours).
Comment 1 Michael Hinton 2019-02-11 10:30:15 MST
Hey Magnus,

I'll go ahead and take a look at it.

Thanks,
Michael
Comment 2 Michael Hinton 2019-02-11 18:25:47 MST
Hey Magnus,

The sbatch logs clearly show coupling between the jobs starting Feb 8 sometime between 5:37 and 9:18. After that time, the jobs start seeing each other's processes in the nvidia-smi output. This would seem to indicate that the cgroups got messed up somehow, since I believe nvidia-smi simply shows the nvidia gpu devices it sees in /dev/.

So here is my current theory of what is going wrong:

systemd might be resetting cgroups for the jobs. Does your slurmd.service file have `Delegate=yes`? Without that, if `systemctl daemon-reload` gets called (e.g. by apt), it may wipe out cgroup device restrictions. See https://github.com/SchedMD/slurm/commit/cecb39ff0 for a complete explanation. See also bug 5292, bug 5300, and bug 5061.

The commit above fixed the slurmd.service file for 17.11.8+. It’s possible you haven’t updated your copy since then. 

The only thing that stands out from the logs is in syslog-20190208.b-cn1503. On Feb 8 at 6:16, apt did some stuff. It’s possible things got messed up then.

Let me know if adding `Delegate=yes` to slurmd.service solves the problem. If not, we'll probably need to gather more information.

Thanks,
Michael
Comment 3 Magnus Jonsson 2019-02-12 01:09:47 MST
Seems to have solved the issue.

Thanks.
Comment 4 Michael Hinton 2019-02-12 08:40:49 MST
Ok great. Feel free to reopen this if the issue reappears.

Thanks,
Michael

*** This ticket has been marked as a duplicate of ticket 5292 ***