Summary: | A single job that starts out seeing only one of the GPU cards (nvidia-smi) suddenly starts seeing both cards. | ||
---|---|---|---|
Product: | Slurm | Reporter: | Magnus Jonsson <magnus> |
Component: | slurmstepd | Assignee: | Director of Support <support> |
Status: | RESOLVED DUPLICATE | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | ||
Version: | 18.08.4 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | SNIC | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | HPC2N | Linux Distro: | Ubuntu |
Machine Name: | kebnekaise | CLE Version: | |
Version Fixed: | Target Release: | --- | |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Attachments: | logs and configuration files. |
Hey Magnus, I'll go ahead and take a look at it. Thanks, Michael Hey Magnus, The sbatch logs clearly show coupling between the jobs starting Feb 8 sometime between 5:37 and 9:18. After that time, the jobs start seeing each other's processes in the nvidia-smi output. This would seem to indicate that the cgroups got messed up somehow, since I believe nvidia-smi simply shows the nvidia gpu devices it sees in /dev/. So here is my current theory of what is going wrong: systemd might be resetting cgroups for the jobs. Does your slurmd.service file have `Delegate=yes`? Without that, if `systemctl daemon-reload` gets called (e.g. by apt), it may wipe out cgroup device restrictions. See https://github.com/SchedMD/slurm/commit/cecb39ff0 for a complete explanation. See also bug 5292, bug 5300, and bug 5061. The commit above fixed the slurmd.service file for 17.11.8+. It’s possible you haven’t updated your copy since then. The only thing that stands out from the logs is in syslog-20190208.b-cn1503. On Feb 8 at 6:16, apt did some stuff. It’s possible things got messed up then. Let me know if adding `Delegate=yes` to slurmd.service solves the problem. If not, we'll probably need to gather more information. Thanks, Michael Seems to have solved the issue. Thanks. Ok great. Feel free to reopen this if the issue reappears. Thanks, Michael *** This ticket has been marked as a duplicate of ticket 5292 *** |
Created attachment 9116 [details] logs and configuration files. A user is running two(2) jobs on one node at the same time with one GPU each. After a while (several days) the jobs start seeing both GPUs. Attached files includes output from users job with nvidia-smi loop running in slurm.batch cgroup (every 4 hours).