Created attachment 9116 [details] logs and configuration files. A user is running two(2) jobs on one node at the same time with one GPU each. After a while (several days) the jobs start seeing both GPUs. Attached files includes output from users job with nvidia-smi loop running in slurm.batch cgroup (every 4 hours).
Hey Magnus, I'll go ahead and take a look at it. Thanks, Michael
Hey Magnus, The sbatch logs clearly show coupling between the jobs starting Feb 8 sometime between 5:37 and 9:18. After that time, the jobs start seeing each other's processes in the nvidia-smi output. This would seem to indicate that the cgroups got messed up somehow, since I believe nvidia-smi simply shows the nvidia gpu devices it sees in /dev/. So here is my current theory of what is going wrong: systemd might be resetting cgroups for the jobs. Does your slurmd.service file have `Delegate=yes`? Without that, if `systemctl daemon-reload` gets called (e.g. by apt), it may wipe out cgroup device restrictions. See https://github.com/SchedMD/slurm/commit/cecb39ff0 for a complete explanation. See also bug 5292, bug 5300, and bug 5061. The commit above fixed the slurmd.service file for 17.11.8+. It’s possible you haven’t updated your copy since then. The only thing that stands out from the logs is in syslog-20190208.b-cn1503. On Feb 8 at 6:16, apt did some stuff. It’s possible things got messed up then. Let me know if adding `Delegate=yes` to slurmd.service solves the problem. If not, we'll probably need to gather more information. Thanks, Michael
Seems to have solved the issue. Thanks.
Ok great. Feel free to reopen this if the issue reappears. Thanks, Michael *** This bug has been marked as a duplicate of bug 5292 ***