Created attachment 15120 [details] gres.conf file I am seeing the following issue with my gpu nodes - error: _slurm_rpc_node_registration node=n0000.es1: Invalid argument When trying to restart the slurmd on the node I got the following message: [2020-07-21T13:36:56.694] agent/is_node_resp: node:n0000.es1 RPC:REQUEST_TERMINATE_JOB : Communication connection failure [2020-07-21T13:36:57.032] error: slurm_receive_msgs: Socket timed out on send/recv operation
Created attachment 15121 [details] slurm.conf
Hi Jackie, It looks like the first error message you sent is in the slurmctld logs and the second messages are in the slurmd logs. Is that right? It looks like slurmd is sending a request to terminate a job. Was there a job running when slurmd was stopped on the node (that you're aware)? Can you send the full slurmctld.log and slurmd.log (for this node) files for me to review? Thanks, Ben
right now we are in shutdown mode. I just upgraded slurm yesterday. There were no jobs running nor are they any showing in the queue that were running either. Everything was shutdown completely before the upgrade. I can send the logs but I don't think they will say much. All of the errors I have seen were in slurmctld.log and not slurmd.log. The node is not reporting anything. And even stopping and restarting slurmd on the node complains of the rpc issue. Jackie On Tue, Jul 21, 2020 at 1:56 PM <bugs@schedmd.com> wrote: > Ben Roberts <ben@schedmd.com> changed bug 9441 > <https://bugs.schedmd.com/show_bug.cgi?id=9441> > What Removed Added > Assignee support@schedmd.com ben@schedmd.com > > *Comment # 2 <https://bugs.schedmd.com/show_bug.cgi?id=9441#c2> on bug > 9441 <https://bugs.schedmd.com/show_bug.cgi?id=9441> from Ben Roberts > <ben@schedmd.com> * > > Hi Jackie, > > It looks like the first error message you sent is in the slurmctld logs and the > second messages are in the slurmd logs. Is that right? It looks like slurmd > is sending a request to terminate a job. Was there a job running when slurmd > was stopped on the node (that you're aware)? Can you send the full > slurmctld.log and slurmd.log (for this node) files for me to review? > > Thanks, > Ben > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Ok, if that's the case I would like to see you try to start slurmd in verbose debug mode (slurmd -Dvvv) on n0000.es1. Send the output that generates along with the slurmctld logs you collect that covers the time that you tried to start slurmd in this way. Thanks, Ben
I think I found the problem job maybe - scontrol completing JobId=25432802 EndTime=2020-07-21T12:28:13 CompletingTime=02:10:21 Nodes(COMPLETING)=n0000.es1 this job was a test job through the testing period. It is stuck in CG state. squeue --partition=es1 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 25432802 es1 bash kmuriki CG 0:48 1 n0000.es1 Could this be the troubled job? Jackie On Tue, Jul 21, 2020 at 2:38 PM <bugs@schedmd.com> wrote: > *Comment # 4 <https://bugs.schedmd.com/show_bug.cgi?id=9441#c4> on bug > 9441 <https://bugs.schedmd.com/show_bug.cgi?id=9441> from Ben Roberts > <ben@schedmd.com> * > > Ok, if that's the case I would like to see you try to start slurmd in verbose > debug mode (slurmd -Dvvv) on n0000.es1. Send the output that generates along > with the slurmctld logs you collect that covers the time that you tried to > start slurmd in this way. > > Thanks, > Ben > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Created attachment 15123 [details] slurmd-Dvvv.out Here is a copy of the output from slurmd -Dvvv it might have to do with a spank plugin we have implemented. Just check to see what you see. Thanks Jackie On Tue, Jul 21, 2020 at 2:40 PM Jacqueline Scoggins <jscoggins@lbl.gov> wrote: > I think I found the problem job maybe - > > scontrol completing > > JobId=25432802 EndTime=2020-07-21T12:28:13 CompletingTime=02:10:21 > Nodes(COMPLETING)=n0000.es1 > > > this job was a test job through the testing period. It is stuck in CG > state. > > > squeue --partition=es1 > > JOBID PARTITION NAME USER ST TIME NODES > NODELIST(REASON) > > 25432802 es1 bash kmuriki CG 0:48 1 > n0000.es1 > > > Could this be the troubled job? > > > Jackie > > On Tue, Jul 21, 2020 at 2:38 PM <bugs@schedmd.com> wrote: > >> *Comment # 4 <https://bugs.schedmd.com/show_bug.cgi?id=9441#c4> on bug >> 9441 <https://bugs.schedmd.com/show_bug.cgi?id=9441> from Ben Roberts >> <ben@schedmd.com> * >> >> Ok, if that's the case I would like to see you try to start slurmd in verbose >> debug mode (slurmd -Dvvv) on n0000.es1. Send the output that generates along >> with the slurmctld logs you collect that covers the time that you tried to >> start slurmd in this way. >> >> Thanks, >> Ben >> >> ------------------------------ >> You are receiving this mail because: >> >> - You reported the bug. >> >>
This looks like it might be an issue with the node configuration matching up with the resources as they are detected. Can you try adding "SlurmdParameters=config_overrides" to your slurm.conf, restarting the controller, and then restarting slurmd on this node? If that lets it come up correctly then I would like to have you run 'slurmd -C' on the node to see how it shows the resources and whether there are differences with what you have in your slurm.conf. Thanks, Ben
Created attachment 15124 [details] all.realmem I just ran the slurmd -C this morning on all of the nodes and grabbed the output and put it in the slurm.conf file. I will send you a copy of all of the slurmd -C output and you can compare it in the slurm.conf file. Jackie On Tue, Jul 21, 2020 at 3:20 PM <bugs@schedmd.com> wrote: > *Comment # 7 <https://bugs.schedmd.com/show_bug.cgi?id=9441#c7> on bug > 9441 <https://bugs.schedmd.com/show_bug.cgi?id=9441> from Ben Roberts > <ben@schedmd.com> * > > This looks like it might be an issue with the node configuration matching up > with the resources as they are detected. Can you try adding > "SlurmdParameters=config_overrides" to your slurm.conf, restarting the > controller, and then restarting slurmd on this node? If that lets it come up > correctly then I would like to have you run 'slurmd -C' on the node to see how > it shows the resources and whether there are differences with what you have in > your slurm.conf. > > Thanks, > Ben > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Created attachment 15125 [details] slurm.conf
The parameter did allow the node to come up without any complaints. I just sent you a copy of the slurm.conf file and all of the nodes output from slurmd -C Btw, why is the RealMemory needed for each of the nodes? Why was this a change to the configuration. In the past the nodes would know its memory and it would just report it now we have to include it in slurm.conf. I had to add 151 entries today because of that. Thanks Jackie On Tue, Jul 21, 2020 at 3:20 PM <bugs@schedmd.com> wrote: > *Comment # 7 <https://bugs.schedmd.com/show_bug.cgi?id=9441#c7> on bug > 9441 <https://bugs.schedmd.com/show_bug.cgi?id=9441> from Ben Roberts > <ben@schedmd.com> * > > This looks like it might be an issue with the node configuration matching up > with the resources as they are detected. Can you try adding > "SlurmdParameters=config_overrides" to your slurm.conf, restarting the > controller, and then restarting slurmd on this node? If that lets it come up > correctly then I would like to have you run 'slurmd -C' on the node to see how > it shows the resources and whether there are differences with what you have in > your slurm.conf. > > Thanks, > Ben > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Here is also what I see on the nodes and master scontrol show node n0001.es1 NodeName=n0001.es1 Arch=x86_64 CoresPerSocket=4 CPUAlloc=0 CPUTot=8 CPULoad=0.01 AvailableFeatures=es1_1080ti,es1 ActiveFeatures=es1_1080ti,es1 Gres=gpu:GTX1080TI:4 NodeAddr=10.0.43.1 NodeHostName=n0001.es1 Version=20.02.3 OS=Linux 3.10.0-1127.13.1.el7.x86_64 #1 SMP Tue Jun 23 10:32:27 CDT 2020 RealMemory=64319 AllocMem=0 FreeMem=59178 Sockets=2 Boards=1 State=MAINT+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=es1 BootTime=2020-07-20T12:16:14 SlurmdStartTime=2020-07-21T13:38:58 CfgTRES=cpu=8,mem=64319M,billing=8 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=gres/gpu count reported lower than configured (0 < 4) [slurm@2020-07-21T13:42:45] [root@master ~]# ssh n0000.es1 slurmd -C NodeName=n0000 CPUs=8 Boards=1 SocketsPerBoard=2 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=64319 UpTime=0-01:00:24 Is the gres.conf file setup correctly with 20.02? Nodename=n0000.*es*[1] Type=GTX1080TI Name=gpu Count=4 nvidia-smi Tue Jul 21 15:42:15 2020 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.44 Driver Version: 440.44 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 108... Off | 00000000:02:00.0 Off | N/A | | 35% 50C P0 63W / 250W | 0MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX 108... Off | 00000000:03:00.0 Off | N/A | | 23% 39C P0 60W / 250W | 0MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 GeForce GTX 108... Off | 00000000:81:00.0 Off | N/A | | 25% 39C P0 55W / 250W | 0MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 GeForce GTX 108... Off | 00000000:82:00.0 Off | N/A | | 21% 37C P0 60W / 250W | 0MiB / 11178MiB | 2% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ Jackie On Tue, Jul 21, 2020 at 3:34 PM Jacqueline Scoggins <jscoggins@lbl.gov> wrote: > The parameter did allow the node to come up without any complaints. > > I just sent you a copy of the slurm.conf file and all of the nodes output > from slurmd -C > > Btw, why is the RealMemory needed for each of the nodes? Why was this a > change to the configuration. In the past the nodes would know its memory > and it would just report it now we have to include it in slurm.conf. I had > to add 151 entries today because of that. > > Thanks > > Jackie > > On Tue, Jul 21, 2020 at 3:20 PM <bugs@schedmd.com> wrote: > >> *Comment # 7 <https://bugs.schedmd.com/show_bug.cgi?id=9441#c7> on bug >> 9441 <https://bugs.schedmd.com/show_bug.cgi?id=9441> from Ben Roberts >> <ben@schedmd.com> * >> >> This looks like it might be an issue with the node configuration matching up >> with the resources as they are detected. Can you try adding >> "SlurmdParameters=config_overrides" to your slurm.conf, restarting the >> controller, and then restarting slurmd on this node? If that lets it come up >> correctly then I would like to have you run 'slurmd -C' on the node to see how >> it shows the resources and whether there are differences with what you have in >> your slurm.conf. >> >> Thanks, >> Ben >> >> ------------------------------ >> You are receiving this mail because: >> >> - You reported the bug. >> >>
Ok. Time is going by and I wanted to know if you wanted to setup another zoom session to get this working? Jackie On Tue, Jul 21, 2020 at 3:42 PM Jacqueline Scoggins <jscoggins@lbl.gov> wrote: > Here is also what I see on the nodes and master > > scontrol show node n0001.es1 > > NodeName=n0001.es1 Arch=x86_64 CoresPerSocket=4 > > CPUAlloc=0 CPUTot=8 CPULoad=0.01 > > AvailableFeatures=es1_1080ti,es1 > > ActiveFeatures=es1_1080ti,es1 > > Gres=gpu:GTX1080TI:4 > > NodeAddr=10.0.43.1 NodeHostName=n0001.es1 Version=20.02.3 > > OS=Linux 3.10.0-1127.13.1.el7.x86_64 #1 SMP Tue Jun 23 10:32:27 CDT > 2020 > > RealMemory=64319 AllocMem=0 FreeMem=59178 Sockets=2 Boards=1 > > State=MAINT+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A > MCS_label=N/A > > Partitions=es1 > > BootTime=2020-07-20T12:16:14 SlurmdStartTime=2020-07-21T13:38:58 > > CfgTRES=cpu=8,mem=64319M,billing=8 > > AllocTRES= > > CapWatts=n/a > > CurrentWatts=0 AveWatts=0 > > ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s > > Reason=gres/gpu count reported lower than configured (0 < 4) > [slurm@2020-07-21T13:42:45] > > > [root@master ~]# ssh n0000.es1 slurmd -C > > NodeName=n0000 CPUs=8 Boards=1 SocketsPerBoard=2 CoresPerSocket=4 > ThreadsPerCore=1 RealMemory=64319 > > UpTime=0-01:00:24 > > > Is the gres.conf file setup correctly with 20.02? > > Nodename=n0000.*es*[1] Type=GTX1080TI Name=gpu Count=4 > > > > nvidia-smi > > Tue Jul 21 15:42:15 2020 > > > +-----------------------------------------------------------------------------+ > > | NVIDIA-SMI 440.44 Driver Version: 440.44 CUDA Version: 10.2 > | > > > |-------------------------------+----------------------+----------------------+ > > | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. > ECC | > > | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute > M. | > > > |===============================+======================+======================| > > | 0 GeForce GTX 108... Off | 00000000:02:00.0 Off | N/A > | > > | 35% 50C P0 63W / 250W | 0MiB / 11178MiB | 0% Default > | > > > +-------------------------------+----------------------+----------------------+ > > | 1 GeForce GTX 108... Off | 00000000:03:00.0 Off | N/A > | > > | 23% 39C P0 60W / 250W | 0MiB / 11178MiB | 0% Default > | > > > +-------------------------------+----------------------+----------------------+ > > | 2 GeForce GTX 108... Off | 00000000:81:00.0 Off | N/A > | > > | 25% 39C P0 55W / 250W | 0MiB / 11178MiB | 0% Default > | > > > +-------------------------------+----------------------+----------------------+ > > | 3 GeForce GTX 108... Off | 00000000:82:00.0 Off | N/A > | > > | 21% 37C P0 60W / 250W | 0MiB / 11178MiB | 2% Default > | > > > +-------------------------------+----------------------+----------------------+ > > > > > > +-----------------------------------------------------------------------------+ > > | Processes: GPU > Memory | > > | GPU PID Type Process name Usage > | > > > |=============================================================================| > > | No running processes found > | > > > +-----------------------------------------------------------------------------+ > > > Jackie > > On Tue, Jul 21, 2020 at 3:34 PM Jacqueline Scoggins <jscoggins@lbl.gov> > wrote: > >> The parameter did allow the node to come up without any complaints. >> >> I just sent you a copy of the slurm.conf file and all of the nodes output >> from slurmd -C >> >> Btw, why is the RealMemory needed for each of the nodes? Why was this a >> change to the configuration. In the past the nodes would know its memory >> and it would just report it now we have to include it in slurm.conf. I had >> to add 151 entries today because of that. >> >> Thanks >> >> Jackie >> >> On Tue, Jul 21, 2020 at 3:20 PM <bugs@schedmd.com> wrote: >> >>> *Comment # 7 <https://bugs.schedmd.com/show_bug.cgi?id=9441#c7> on bug >>> 9441 <https://bugs.schedmd.com/show_bug.cgi?id=9441> from Ben Roberts >>> <ben@schedmd.com> * >>> >>> This looks like it might be an issue with the node configuration matching up >>> with the resources as they are detected. Can you try adding >>> "SlurmdParameters=config_overrides" to your slurm.conf, restarting the >>> controller, and then restarting slurmd on this node? If that lets it come up >>> correctly then I would like to have you run 'slurmd -C' on the node to see how >>> it shows the resources and whether there are differences with what you have in >>> your slurm.conf. >>> >>> Thanks, >>> Ben >>> >>> ------------------------------ >>> You are receiving this mail because: >>> >>> - You reported the bug. >>> >>>
Ok adding that parameter has allowed for the node to stay up and running. Will I be able to keep that in the configuration file. I really would like to be able to keep it. Things are looking better. After restarting slurmd on the nodes and adding those parameters, restarting slurmctld actually is working. Jackie On Tue, Jul 21, 2020 at 4:08 PM Jacqueline Scoggins <jscoggins@lbl.gov> wrote: > Ok. Time is going by and I wanted to know if you wanted to setup another > zoom session to get this working? > > Jackie > > On Tue, Jul 21, 2020 at 3:42 PM Jacqueline Scoggins <jscoggins@lbl.gov> > wrote: > >> Here is also what I see on the nodes and master >> >> scontrol show node n0001.es1 >> >> NodeName=n0001.es1 Arch=x86_64 CoresPerSocket=4 >> >> CPUAlloc=0 CPUTot=8 CPULoad=0.01 >> >> AvailableFeatures=es1_1080ti,es1 >> >> ActiveFeatures=es1_1080ti,es1 >> >> Gres=gpu:GTX1080TI:4 >> >> NodeAddr=10.0.43.1 NodeHostName=n0001.es1 Version=20.02.3 >> >> OS=Linux 3.10.0-1127.13.1.el7.x86_64 #1 SMP Tue Jun 23 10:32:27 CDT >> 2020 >> >> RealMemory=64319 AllocMem=0 FreeMem=59178 Sockets=2 Boards=1 >> >> State=MAINT+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A >> MCS_label=N/A >> >> Partitions=es1 >> >> BootTime=2020-07-20T12:16:14 SlurmdStartTime=2020-07-21T13:38:58 >> >> CfgTRES=cpu=8,mem=64319M,billing=8 >> >> AllocTRES= >> >> CapWatts=n/a >> >> CurrentWatts=0 AveWatts=0 >> >> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s >> >> Reason=gres/gpu count reported lower than configured (0 < 4) >> [slurm@2020-07-21T13:42:45] >> >> >> [root@master ~]# ssh n0000.es1 slurmd -C >> >> NodeName=n0000 CPUs=8 Boards=1 SocketsPerBoard=2 CoresPerSocket=4 >> ThreadsPerCore=1 RealMemory=64319 >> >> UpTime=0-01:00:24 >> >> >> Is the gres.conf file setup correctly with 20.02? >> >> Nodename=n0000.*es*[1] Type=GTX1080TI Name=gpu Count=4 >> >> >> >> nvidia-smi >> >> Tue Jul 21 15:42:15 2020 >> >> >> +-----------------------------------------------------------------------------+ >> >> | NVIDIA-SMI 440.44 Driver Version: 440.44 CUDA Version: >> 10.2 | >> >> >> |-------------------------------+----------------------+----------------------+ >> >> | GPU Name Persistence-M| Bus-Id Disp.A | Volatile >> Uncorr. ECC | >> >> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute >> M. | >> >> >> |===============================+======================+======================| >> >> | 0 GeForce GTX 108... Off | 00000000:02:00.0 Off | >> N/A | >> >> | 35% 50C P0 63W / 250W | 0MiB / 11178MiB | 0% Default >> | >> >> >> +-------------------------------+----------------------+----------------------+ >> >> | 1 GeForce GTX 108... Off | 00000000:03:00.0 Off | >> N/A | >> >> | 23% 39C P0 60W / 250W | 0MiB / 11178MiB | 0% Default >> | >> >> >> +-------------------------------+----------------------+----------------------+ >> >> | 2 GeForce GTX 108... Off | 00000000:81:00.0 Off | >> N/A | >> >> | 25% 39C P0 55W / 250W | 0MiB / 11178MiB | 0% Default >> | >> >> >> +-------------------------------+----------------------+----------------------+ >> >> | 3 GeForce GTX 108... Off | 00000000:82:00.0 Off | >> N/A | >> >> | 21% 37C P0 60W / 250W | 0MiB / 11178MiB | 2% Default >> | >> >> >> +-------------------------------+----------------------+----------------------+ >> >> >> >> >> >> +-----------------------------------------------------------------------------+ >> >> | Processes: GPU >> Memory | >> >> | GPU PID Type Process name Usage >> | >> >> >> |=============================================================================| >> >> | No running processes found >> | >> >> >> +-----------------------------------------------------------------------------+ >> >> >> Jackie >> >> On Tue, Jul 21, 2020 at 3:34 PM Jacqueline Scoggins <jscoggins@lbl.gov> >> wrote: >> >>> The parameter did allow the node to come up without any complaints. >>> >>> I just sent you a copy of the slurm.conf file and all of the nodes >>> output from slurmd -C >>> >>> Btw, why is the RealMemory needed for each of the nodes? Why was this a >>> change to the configuration. In the past the nodes would know its memory >>> and it would just report it now we have to include it in slurm.conf. I had >>> to add 151 entries today because of that. >>> >>> Thanks >>> >>> Jackie >>> >>> On Tue, Jul 21, 2020 at 3:20 PM <bugs@schedmd.com> wrote: >>> >>>> *Comment # 7 <https://bugs.schedmd.com/show_bug.cgi?id=9441#c7> on bug >>>> 9441 <https://bugs.schedmd.com/show_bug.cgi?id=9441> from Ben Roberts >>>> <ben@schedmd.com> * >>>> >>>> This looks like it might be an issue with the node configuration matching up >>>> with the resources as they are detected. Can you try adding >>>> "SlurmdParameters=config_overrides" to your slurm.conf, restarting the >>>> controller, and then restarting slurmd on this node? If that lets it come up >>>> correctly then I would like to have you run 'slurmd -C' on the node to see how >>>> it shows the resources and whether there are differences with what you have in >>>> your slurm.conf. >>>> >>>> Thanks, >>>> Ben >>>> >>>> ------------------------------ >>>> You are receiving this mail because: >>>> >>>> - You reported the bug. >>>> >>>>
I think we have everything working now. That change allowed us to continue on. Everything else is working. Just curious if the spank messages were just info or an issue. Thanks Jackie Scoggins On Jul 21, 2020, at 3:20 PM, bugs@schedmd.com wrote: *Comment # 7 <https://bugs.schedmd.com/show_bug.cgi?id=9441#c7> on bug 9441 <https://bugs.schedmd.com/show_bug.cgi?id=9441> from Ben Roberts <ben@schedmd.com> * This looks like it might be an issue with the node configuration matching up with the resources as they are detected. Can you try adding "SlurmdParameters=config_overrides" to your slurm.conf, restarting the controller, and then restarting slurmd on this node? If that lets it come up correctly then I would like to have you run 'slurmd -C' on the node to see how it shows the resources and whether there are differences with what you have in your slurm.conf. Thanks, Ben ------------------------------ You are receiving this mail because: - You reported the bug.
Hi Jackie, My apologies. It was getting late for me, so when I saw your message that you were able to get things up with the config_overrides parameter I signed off for the day. I should have sent a note that I was doing so. Specifying the RealMemory has always been a requirement, but in the past you were able to able to specify something close and get around the requirement to have it match by using the FastSchedule=0 parameter. This has been deprecated, as you're aware, so that means it requires the RealMemory specification to match the detected value on the system. You can leave the RealMemory specification undefined, but then it will set the memory on the node to 1, preventing you from requesting memory on the node. You can leave the config_overrides parameter, but it is better to have the node configuration match and come up without it. Looking at the file you sent with the RealMemory you got from slurmd -C it does look like it matches. It's possible there is some other setting it doesn't like. Can you send the full output of slurmd -C for node n0000.es1 so I can compare that to your slurm.conf? Thanks, Ben
ssh n0000.es1 slurmd -C NodeName=n0000 CPUs=8 Boards=1 SocketsPerBoard=2 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=64319 UpTime=0-18:58:40 Jackie On Wed, Jul 22, 2020 at 7:54 AM <bugs@schedmd.com> wrote: > *Comment # 15 <https://bugs.schedmd.com/show_bug.cgi?id=9441#c15> on bug > 9441 <https://bugs.schedmd.com/show_bug.cgi?id=9441> from Ben Roberts > <ben@schedmd.com> * > > Hi Jackie, > > My apologies. It was getting late for me, so when I saw your message that you > were able to get things up with the config_overrides parameter I signed off for > the day. I should have sent a note that I was doing so. > > Specifying the RealMemory has always been a requirement, but in the past you > were able to able to specify something close and get around the requirement to > have it match by using the FastSchedule=0 parameter. This has been deprecated, > as you're aware, so that means it requires the RealMemory specification to > match the detected value on the system. You can leave the RealMemory > specification undefined, but then it will set the memory on the node to 1, > preventing you from requesting memory on the node. > > You can leave the config_overrides parameter, but it is better to have the node > configuration match and come up without it. Looking at the file you sent with > the RealMemory you got from slurmd -C it does look like it matches. It's > possible there is some other setting it doesn't like. Can you send the full > output of slurmd -C for node n0000.es1 so I can compare that to your > slurm.conf? > > Thanks, > Ben > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Thanks for confirming that. It looks like the other settings do line up. I looked again at your gres.conf file and I think there might be a couple problems contributing to this. In your gres.conf you have an extra 'G' at the end of the count of your gpus: Nodename=n0000.es[1] Type=GTX1080TI Name=gpu Count=4G I would recommend removing the 'G' from the count to make it match what's in your slurm.conf (Gres=gpu:GTX1080TI:4). I would also recommend defining the device files associated with your GPUs in the gres.conf. That would look something like this: File=/dev/nvidia[0-3] You can read a more thorough description of this parameter in the documentation here: https://slurm.schedmd.com/gres.conf.html#OPT_File With those changes you could try removing the config_override parameter and restarting to see if it comes up correctly. Thanks, Ben
Hi Jackie, I wanted to follow up and make sure my recommendation helped. Were you able to get the gres definitions working without the config_override parameter? Thanks, Ben
No, I did not. It is still broken. When I remove that variable the gres breaks. Any recommendations? Thanks Jackie On Tue, Aug 4, 2020 at 12:09 PM <bugs@schedmd.com> wrote: > Ben Roberts <ben@schedmd.com> changed bug 9441 > <https://bugs.schedmd.com/show_bug.cgi?id=9441> > What Removed Added > Severity 2 - High Impact 3 - Medium Impact > > *Comment # 18 <https://bugs.schedmd.com/show_bug.cgi?id=9441#c18> on bug > 9441 <https://bugs.schedmd.com/show_bug.cgi?id=9441> from Ben Roberts > <ben@schedmd.com> * > > Hi Jackie, > > I wanted to follow up and make sure my recommendation helped. Were you able to > get the gres definitions working without the config_override parameter? > > Thanks, > Ben > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Hi Jackie, Yes, I think this is due to the fact that you have an extra 'G' appended to the count value in your gres.conf: Nodename=n0000.es[1] Type=GTX1080TI Name=gpu Count=4G I've tested the same scenario, where I add a 'G' to the count of my GRES, and the nodes fail to come up unless I add the config_override parameter. The log entry I see in the slurmd logs is this: [2020-08-05T09:20:17.179] fatal: Invalid GRES record for gpu, count does not match File value My recommendation is to modify your gres.conf file so the entries have a count value that matches what is in your slurm.conf. For the example entry I used above, it should look like this: Nodename=n0000.es[1] Type=GTX1080TI Name=gpu Count=4 I would also recommend defining the device files associated with your GPUs in the gres.conf. That would look something like this: File=/dev/nvidia[0-3] I don't think this is causing a failure, but is beneficial to have set. You can read a more thorough description of this parameter in the documentation here: https://slurm.schedmd.com/gres.conf.html#OPT_File Let me know if that helps. Thanks, Ben
Created attachment 15323 [details] gres.conf Hello Ben, I don't have a G at the end of my gres.conf file. I'm not sure where you see that. I will share you my file. I will consider the File= suggestion. Jackie On Wed, Aug 5, 2020 at 7:30 AM <bugs@schedmd.com> wrote: > *Comment # 20 <https://bugs.schedmd.com/show_bug.cgi?id=9441#c20> on bug > 9441 <https://bugs.schedmd.com/show_bug.cgi?id=9441> from Ben Roberts > <ben@schedmd.com> * > > Hi Jackie, > > Yes, I think this is due to the fact that you have an extra 'G' appended to the > count value in your gres.conf: > Nodename=n0000.es[1] Type=GTX1080TI Name=gpu Count=4G > > I've tested the same scenario, where I add a 'G' to the count of my GRES, and > the nodes fail to come up unless I add the config_override parameter. The log > entry I see in the slurmd logs is this: > [2020-08-05T09:20:17.179] fatal: Invalid GRES record for gpu, count does not > match File value > > My recommendation is to modify your gres.conf file so the entries have a count > value that matches what is in your slurm.conf. For the example entry I used > above, it should look like this: > Nodename=n0000.es[1] Type=GTX1080TI Name=gpu Count=4 > > I would also recommend defining the device files associated with your GPUs in > the gres.conf. That would look something like this: > File=/dev/nvidia[0-3] > > I don't think this is causing a failure, but is beneficial to have set. You > can read a more thorough description of this parameter in the documentation > here:https://slurm.schedmd.com/gres.conf.html#OPT_File > > Let me know if that helps. > > Thanks, > Ben > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Ok, the gres.conf file you sent on Jul 21 was where I saw that the counts had a 'G' on the end. It looks like you've removed it since then, which is good. Have you tried starting without the config_override parameter since making that change? I understand if you can't try it outside of a maintenance period. Thanks, Ben
Yes, I did. And it did not work. I commented it out and restarted slurmctld and slurmd and immediately the gres servers in es1 went off line again due to the count value. Jackie On Wed, Aug 5, 2020 at 1:11 PM <bugs@schedmd.com> wrote: > *Comment # 22 <https://bugs.schedmd.com/show_bug.cgi?id=9441#c22> on bug > 9441 <https://bugs.schedmd.com/show_bug.cgi?id=9441> from Ben Roberts > <ben@schedmd.com> * > > Ok, the gres.conf file you sent on Jul 21 was where I saw that the counts had a > 'G' on the end. It looks like you've removed it since then, which is good. > Have you tried starting without the config_override parameter since making that > change? I understand if you can't try it outside of a maintenance period. > > Thanks, > Ben > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Do you happen to have the slurmd logs from one of the nodes from the last attempt to start it without config_override? If not when do you anticipate being able to try it again? Thanks, Ben
Hi Jackie, I just wanted to check in with you and see if you still have the slurmd logs from the last time you tried starting them without config_override. Let me know if you're planning on trying to remove this setting at some point. Thanks, Ben
No, I did not. I tried to find the logs but was unable to pinpoint them. If you'd like I could do a test on Monday or Tuesday. If you're available let me know and you can watch me test it via zoom. What day works best for you? I will schedule a zoom meeting. Thanks Jackie On Thu, Aug 20, 2020 at 7:13 AM <bugs@schedmd.com> wrote: > Ben Roberts <ben@schedmd.com> changed bug 9441 > <https://bugs.schedmd.com/show_bug.cgi?id=9441> > What Removed Added > Severity 3 - Medium Impact 4 - Minor Issue > > *Comment # 25 <https://bugs.schedmd.com/show_bug.cgi?id=9441#c25> on bug > 9441 <https://bugs.schedmd.com/show_bug.cgi?id=9441> from Ben Roberts > <ben@schedmd.com> * > > Hi Jackie, > > I just wanted to check in with you and see if you still have the slurmd logs > from the last time you tried starting them without config_override. Let me > know if you're planning on trying to remove this setting at some point. > > Thanks, > Ben > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Hi Jackie, This should be an issue we can work on without scheduling a zoom meeting. You sent a copy of your gres.conf file on Aug 05. Does the file still look like it did then? The main difference between that file and the previous version is that you removed the 'G' from the end of each of the counts, which is what I would expect. We did discuss in comment 17 specifying a File for the 'es' nodes that have GPUs. Nvidia uses a special file at /dev/nvidia[0-?] for the GPUs. I would recommend specifying the file in your gres.conf for the gpu nodes. For example n0000.es[1] had 4 GPUs configured, so it would look like this: Nodename=n0000.es[1] Type=GTX1080TI Name=gpu Count=4 File=/dev/nvidia[0-3] When you are able to try without the 'config_override' flag, if you still have a failure I would like to have you tar up and send the the following things: slurm.conf gres.conf slurmd.log (from a node that had a failure) slurmctld.log Thanks, Ben
Got it, thanks! I will try and schedule something for next week. Jackie On Fri, Aug 21, 2020 at 9:33 AM <bugs@schedmd.com> wrote: > *Comment # 27 <https://bugs.schedmd.com/show_bug.cgi?id=9441#c27> on bug > 9441 <https://bugs.schedmd.com/show_bug.cgi?id=9441> from Ben Roberts > <ben@schedmd.com> * > > Hi Jackie, > > This should be an issue we can work on without scheduling a zoom meeting. You > sent a copy of your gres.conf file on Aug 05. Does the file still look like it > did then? The main difference between that file and the previous version is > that you removed the 'G' from the end of each of the counts, which is what I > would expect. We did discuss in comment 17 <https://bugs.schedmd.com/show_bug.cgi?id=9441#c17> specifying a File for the 'es' > nodes that have GPUs. Nvidia uses a special file at /dev/nvidia[0-?] for the > GPUs. I would recommend specifying the file in your gres.conf for the gpu > nodes. For example n0000.es[1] had 4 GPUs configured, so it would look like > this: > Nodename=n0000.es[1] Type=GTX1080TI Name=gpu Count=4 File=/dev/nvidia[0-3] > > When you are able to try without the 'config_override' flag, if you still have > a failure I would like to have you tar up and send the the following things: > slurm.conf > gres.conf > slurmd.log (from a node that had a failure) > slurmctld.log > > Thanks, > Ben > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
I am also seeing similar behavior after upgrading to v20 from v18. When the nodes config (ie lower then expected memory or lower then expected number of gpu's) are detected on the node slurm kicks out these invalid argument entries i agree that the config should match, but it would be better to throw 1 error in the log and report the node through sinfo -R as being damaged, rather then incessantly throwing errors in the logs every second
Hi Jackie, I'm just checking back in to see if you have been able to try removing the 'config_override' flag. Thanks, Ben
I have but removed that. I tried it once and nodes once again started to change state. Is there any issue with leaving it on? If you’d like to digg into the reason this keeps happening I’d be happy to schedule a downtime to do this. How much time would you need of what information would you like for me to gather during the test period? Thanks Jackie Scoggins On Sep 21, 2020, at 10:22 AM, bugs@schedmd.com wrote: *Comment # 30 <https://bugs.schedmd.com/show_bug.cgi?id=9441#c30> on bug 9441 <https://bugs.schedmd.com/show_bug.cgi?id=9441> from Ben Roberts <ben@schedmd.com> * Hi Jackie, I'm just checking back in to see if you have been able to try removing the 'config_override' flag. Thanks, Ben ------------------------------ You are receiving this mail because: - You reported the bug.
Hi Jackie, I'm sorry I didn't respond to you sooner, I let your update fall through the cracks. It should be ok as it is. If you would like to try and update this at some point in the future please let us know and we can work with you to narrow down what's causing the nodes to fail without 'config_override' set. Thanks, Ben