9441 – slurm_rpc_node_registration

Bug 9441 - slurm_rpc_node_registration

Summary: slurm_rpc_node_registration

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other bugs)
Version:	20.02.3
Hardware:	Linux Linux

Importance:	--- 4 - Minor Issue
Assignee:	Ben Roberts
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2020-07-21 14:40 MDT by Wei Feinstein
Modified:	2020-10-08 15:19 MDT (History)
CC List:	1 user (show)

See Also:
Site:	LBNL - Lawrence Berkeley National Laboratory
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
gres.conf file (7.77 KB, text/plain) 2020-07-21 14:40 MDT, Wei Feinstein	Details
slurm.conf (35.97 KB, text/plain) 2020-07-21 14:41 MDT, Wei Feinstein	Details
slurmd-Dvvv.out (51.84 KB, application/octet-stream) 2020-07-21 15:50 MDT, Wei Feinstein	Details
all.realmem (128.88 KB, application/octet-stream) 2020-07-21 16:32 MDT, Wei Feinstein	Details
slurm.conf (35.97 KB, application/octet-stream) 2020-07-21 16:32 MDT, Wei Feinstein	Details
gres.conf (7.64 KB, application/octet-stream) 2020-08-05 11:38 MDT, Wei Feinstein	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Wei Feinstein 2020-07-21 14:40:40 MDT

Created attachment 15120 [details]
gres.conf file

I am seeing the following issue with my gpu nodes - 
error: _slurm_rpc_node_registration node=n0000.es1: Invalid argument

When trying to restart the slurmd on the node I got the following message:
[2020-07-21T13:36:56.694] agent/is_node_resp: node:n0000.es1 RPC:REQUEST_TERMINATE_JOB : Communication connection failure
[2020-07-21T13:36:57.032] error: slurm_receive_msgs: Socket timed out on send/recv operation

Comment 1 Wei Feinstein 2020-07-21 14:41:03 MDT

Created attachment 15121 [details]
slurm.conf

Comment 2 Ben Roberts 2020-07-21 14:56:35 MDT

Hi Jackie,

It looks like the first error message you sent is in the slurmctld logs and the second messages are in the slurmd logs.  Is that right?  It looks like slurmd is sending a request to terminate a job.  Was there a job running when slurmd was stopped on the node (that you're aware)?  Can you send the full slurmctld.log and slurmd.log (for this node) files for me to review?  

Thanks,
Ben

Comment 3 Wei Feinstein 2020-07-21 15:13:42 MDT

right now we are in shutdown mode. I just upgraded slurm yesterday. There
were no jobs running nor are they any showing in the queue that were
running either.  Everything was shutdown completely before the upgrade.  I
can send the logs but I don't think they will say much. All of the errors I
have seen were in slurmctld.log and not slurmd.log. The node is not
reporting anything.  And even stopping and restarting slurmd on the node
complains of the rpc issue.

Jackie

On Tue, Jul 21, 2020 at 1:56 PM <bugs@schedmd.com> wrote:

> Ben Roberts <ben@schedmd.com> changed bug 9441
> <https://bugs.schedmd.com/show_bug.cgi?id=9441>
> What Removed Added
> Assignee support@schedmd.com ben@schedmd.com
>
> *Comment # 2 <https://bugs.schedmd.com/show_bug.cgi?id=9441#c2> on bug
> 9441 <https://bugs.schedmd.com/show_bug.cgi?id=9441> from Ben Roberts
> <ben@schedmd.com> *
>
> Hi Jackie,
>
> It looks like the first error message you sent is in the slurmctld logs and the
> second messages are in the slurmd logs.  Is that right?  It looks like slurmd
> is sending a request to terminate a job.  Was there a job running when slurmd
> was stopped on the node (that you're aware)?  Can you send the full
> slurmctld.log and slurmd.log (for this node) files for me to review?
>
> Thanks,
> Ben
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 4 Ben Roberts 2020-07-21 15:38:45 MDT

Ok, if that's the case I would like to see you try to start slurmd in verbose debug mode (slurmd -Dvvv) on n0000.es1.  Send the output that generates along with the slurmctld logs you collect that covers the time that you tried to start slurmd in this way.  

Thanks,
Ben

Comment 5 Wei Feinstein 2020-07-21 15:40:30 MDT

I think I found the problem job maybe -

scontrol completing

JobId=25432802 EndTime=2020-07-21T12:28:13 CompletingTime=02:10:21
Nodes(COMPLETING)=n0000.es1


this job was a test job through the testing period. It is stuck in CG state.


squeue --partition=es1

             JOBID PARTITION     NAME     USER ST       TIME  NODES
NODELIST(REASON)

          25432802       es1     bash  kmuriki CG       0:48      1
n0000.es1


Could this be the troubled job?


Jackie

On Tue, Jul 21, 2020 at 2:38 PM <bugs@schedmd.com> wrote:

> *Comment # 4 <https://bugs.schedmd.com/show_bug.cgi?id=9441#c4> on bug
> 9441 <https://bugs.schedmd.com/show_bug.cgi?id=9441> from Ben Roberts
> <ben@schedmd.com> *
>
> Ok, if that's the case I would like to see you try to start slurmd in verbose
> debug mode (slurmd -Dvvv) on n0000.es1.  Send the output that generates along
> with the slurmctld logs you collect that covers the time that you tried to
> start slurmd in this way.
>
> Thanks,
> Ben
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 6 Wei Feinstein 2020-07-21 15:50:07 MDT

Created attachment 15123 [details]
slurmd-Dvvv.out

Here is a copy of the output from slurmd -Dvvv

it might have to do with a spank plugin we have implemented. Just check to
see what you see.

Thanks

Jackie

On Tue, Jul 21, 2020 at 2:40 PM Jacqueline Scoggins <jscoggins@lbl.gov>
wrote:

> I think I found the problem job maybe -
>
> scontrol completing
>
> JobId=25432802 EndTime=2020-07-21T12:28:13 CompletingTime=02:10:21
> Nodes(COMPLETING)=n0000.es1
>
>
> this job was a test job through the testing period. It is stuck in CG
> state.
>
>
> squeue --partition=es1
>
>              JOBID PARTITION     NAME     USER ST       TIME  NODES
> NODELIST(REASON)
>
>           25432802       es1     bash  kmuriki CG       0:48      1
> n0000.es1
>
>
> Could this be the troubled job?
>
>
> Jackie
>
> On Tue, Jul 21, 2020 at 2:38 PM <bugs@schedmd.com> wrote:
>
>> *Comment # 4 <https://bugs.schedmd.com/show_bug.cgi?id=9441#c4> on bug
>> 9441 <https://bugs.schedmd.com/show_bug.cgi?id=9441> from Ben Roberts
>> <ben@schedmd.com> *
>>
>> Ok, if that's the case I would like to see you try to start slurmd in verbose
>> debug mode (slurmd -Dvvv) on n0000.es1.  Send the output that generates along
>> with the slurmctld logs you collect that covers the time that you tried to
>> start slurmd in this way.
>>
>> Thanks,
>> Ben
>>
>> ------------------------------
>> You are receiving this mail because:
>>
>>    - You reported the bug.
>>
>>

Comment 7 Ben Roberts 2020-07-21 16:20:58 MDT

This looks like it might be an issue with the node configuration matching up with the resources as they are detected.  Can you try adding "SlurmdParameters=config_overrides" to your slurm.conf, restarting the controller, and then restarting slurmd on this node?  If that lets it come up correctly then I would like to have you run 'slurmd -C' on the node to see how it shows the resources and whether there are differences with what you have in your slurm.conf.

Thanks,
Ben

Comment 8 Wei Feinstein 2020-07-21 16:32:29 MDT

Created attachment 15124 [details]
all.realmem

I just ran the slurmd -C this morning on all of the nodes and grabbed the
output and put it in the slurm.conf file.

I will send you a copy of all of the slurmd -C output and you can compare
it in the slurm.conf file.

Jackie


On Tue, Jul 21, 2020 at 3:20 PM <bugs@schedmd.com> wrote:

> *Comment # 7 <https://bugs.schedmd.com/show_bug.cgi?id=9441#c7> on bug
> 9441 <https://bugs.schedmd.com/show_bug.cgi?id=9441> from Ben Roberts
> <ben@schedmd.com> *
>
> This looks like it might be an issue with the node configuration matching up
> with the resources as they are detected.  Can you try adding
> "SlurmdParameters=config_overrides" to your slurm.conf, restarting the
> controller, and then restarting slurmd on this node?  If that lets it come up
> correctly then I would like to have you run 'slurmd -C' on the node to see how
> it shows the resources and whether there are differences with what you have in
> your slurm.conf.
>
> Thanks,
> Ben
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 9 Wei Feinstein 2020-07-21 16:32:29 MDT

Created attachment 15125 [details]
slurm.conf

Comment 10 Wei Feinstein 2020-07-21 16:34:34 MDT

The parameter did allow the node to come up without any complaints.

I just sent you a copy of the slurm.conf file and all of the nodes output
from slurmd -C

Btw,  why is the RealMemory needed for each of the nodes? Why was this a
change to the configuration. In the past the nodes would know its memory
and it would just report it now we have to include it in slurm.conf.  I had
to add 151 entries today because of that.

Thanks

Jackie

On Tue, Jul 21, 2020 at 3:20 PM <bugs@schedmd.com> wrote:

> *Comment # 7 <https://bugs.schedmd.com/show_bug.cgi?id=9441#c7> on bug
> 9441 <https://bugs.schedmd.com/show_bug.cgi?id=9441> from Ben Roberts
> <ben@schedmd.com> *
>
> This looks like it might be an issue with the node configuration matching up
> with the resources as they are detected.  Can you try adding
> "SlurmdParameters=config_overrides" to your slurm.conf, restarting the
> controller, and then restarting slurmd on this node?  If that lets it come up
> correctly then I would like to have you run 'slurmd -C' on the node to see how
> it shows the resources and whether there are differences with what you have in
> your slurm.conf.
>
> Thanks,
> Ben
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 11 Wei Feinstein 2020-07-21 16:43:14 MDT

Here is also what I see on the nodes and master

scontrol show node n0001.es1

NodeName=n0001.es1 Arch=x86_64 CoresPerSocket=4

   CPUAlloc=0 CPUTot=8 CPULoad=0.01

   AvailableFeatures=es1_1080ti,es1

   ActiveFeatures=es1_1080ti,es1

   Gres=gpu:GTX1080TI:4

   NodeAddr=10.0.43.1 NodeHostName=n0001.es1 Version=20.02.3

   OS=Linux 3.10.0-1127.13.1.el7.x86_64 #1 SMP Tue Jun 23 10:32:27 CDT 2020

   RealMemory=64319 AllocMem=0 FreeMem=59178 Sockets=2 Boards=1

   State=MAINT+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
MCS_label=N/A

   Partitions=es1

   BootTime=2020-07-20T12:16:14 SlurmdStartTime=2020-07-21T13:38:58

   CfgTRES=cpu=8,mem=64319M,billing=8

   AllocTRES=

   CapWatts=n/a

   CurrentWatts=0 AveWatts=0

   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

   Reason=gres/gpu count reported lower than configured (0 < 4)
[slurm@2020-07-21T13:42:45]


[root@master ~]# ssh n0000.es1 slurmd -C

NodeName=n0000 CPUs=8 Boards=1 SocketsPerBoard=2 CoresPerSocket=4
ThreadsPerCore=1 RealMemory=64319

UpTime=0-01:00:24


Is the gres.conf file setup correctly with 20.02?

Nodename=n0000.*es*[1]  Type=GTX1080TI Name=gpu Count=4



nvidia-smi

Tue Jul 21 15:42:15 2020

+-----------------------------------------------------------------------------+

| NVIDIA-SMI 440.44       Driver Version: 440.44       CUDA Version: 10.2
  |

|-------------------------------+----------------------+----------------------+

| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr.
ECC |

| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute
M. |

|===============================+======================+======================|

|   0  GeForce GTX 108...  Off  | 00000000:02:00.0 Off |                  N/A
|

| 35%   50C    P0    63W / 250W |      0MiB / 11178MiB |      0%      Default
|

+-------------------------------+----------------------+----------------------+

|   1  GeForce GTX 108...  Off  | 00000000:03:00.0 Off |                  N/A
|

| 23%   39C    P0    60W / 250W |      0MiB / 11178MiB |      0%      Default
|

+-------------------------------+----------------------+----------------------+

|   2  GeForce GTX 108...  Off  | 00000000:81:00.0 Off |                  N/A
|

| 25%   39C    P0    55W / 250W |      0MiB / 11178MiB |      0%      Default
|

+-------------------------------+----------------------+----------------------+

|   3  GeForce GTX 108...  Off  | 00000000:82:00.0 Off |                  N/A
|

| 21%   37C    P0    60W / 250W |      0MiB / 11178MiB |      2%      Default
|

+-------------------------------+----------------------+----------------------+




+-----------------------------------------------------------------------------+

| Processes:                                                       GPU
Memory |

|  GPU       PID   Type   Process name                             Usage
  |

|=============================================================================|

|  No running processes found
  |

+-----------------------------------------------------------------------------+


Jackie

On Tue, Jul 21, 2020 at 3:34 PM Jacqueline Scoggins <jscoggins@lbl.gov>
wrote:

> The parameter did allow the node to come up without any complaints.
>
> I just sent you a copy of the slurm.conf file and all of the nodes output
> from slurmd -C
>
> Btw,  why is the RealMemory needed for each of the nodes? Why was this a
> change to the configuration. In the past the nodes would know its memory
> and it would just report it now we have to include it in slurm.conf.  I had
> to add 151 entries today because of that.
>
> Thanks
>
> Jackie
>
> On Tue, Jul 21, 2020 at 3:20 PM <bugs@schedmd.com> wrote:
>
>> *Comment # 7 <https://bugs.schedmd.com/show_bug.cgi?id=9441#c7> on bug
>> 9441 <https://bugs.schedmd.com/show_bug.cgi?id=9441> from Ben Roberts
>> <ben@schedmd.com> *
>>
>> This looks like it might be an issue with the node configuration matching up
>> with the resources as they are detected.  Can you try adding
>> "SlurmdParameters=config_overrides" to your slurm.conf, restarting the
>> controller, and then restarting slurmd on this node?  If that lets it come up
>> correctly then I would like to have you run 'slurmd -C' on the node to see how
>> it shows the resources and whether there are differences with what you have in
>> your slurm.conf.
>>
>> Thanks,
>> Ben
>>
>> ------------------------------
>> You are receiving this mail because:
>>
>>    - You reported the bug.
>>
>>

Comment 12 Wei Feinstein 2020-07-21 17:08:20 MDT

Ok. Time is going by and I wanted to know if you wanted to setup another
zoom session to get this working?

Jackie

On Tue, Jul 21, 2020 at 3:42 PM Jacqueline Scoggins <jscoggins@lbl.gov>
wrote:

> Here is also what I see on the nodes and master
>
> scontrol show node n0001.es1
>
> NodeName=n0001.es1 Arch=x86_64 CoresPerSocket=4
>
>    CPUAlloc=0 CPUTot=8 CPULoad=0.01
>
>    AvailableFeatures=es1_1080ti,es1
>
>    ActiveFeatures=es1_1080ti,es1
>
>    Gres=gpu:GTX1080TI:4
>
>    NodeAddr=10.0.43.1 NodeHostName=n0001.es1 Version=20.02.3
>
>    OS=Linux 3.10.0-1127.13.1.el7.x86_64 #1 SMP Tue Jun 23 10:32:27 CDT
> 2020
>
>    RealMemory=64319 AllocMem=0 FreeMem=59178 Sockets=2 Boards=1
>
>    State=MAINT+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
> MCS_label=N/A
>
>    Partitions=es1
>
>    BootTime=2020-07-20T12:16:14 SlurmdStartTime=2020-07-21T13:38:58
>
>    CfgTRES=cpu=8,mem=64319M,billing=8
>
>    AllocTRES=
>
>    CapWatts=n/a
>
>    CurrentWatts=0 AveWatts=0
>
>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
>    Reason=gres/gpu count reported lower than configured (0 < 4)
> [slurm@2020-07-21T13:42:45]
>
>
> [root@master ~]# ssh n0000.es1 slurmd -C
>
> NodeName=n0000 CPUs=8 Boards=1 SocketsPerBoard=2 CoresPerSocket=4
> ThreadsPerCore=1 RealMemory=64319
>
> UpTime=0-01:00:24
>
>
> Is the gres.conf file setup correctly with 20.02?
>
> Nodename=n0000.*es*[1]  Type=GTX1080TI Name=gpu Count=4
>
>
>
> nvidia-smi
>
> Tue Jul 21 15:42:15 2020
>
>
> +-----------------------------------------------------------------------------+
>
> | NVIDIA-SMI 440.44       Driver Version: 440.44       CUDA Version: 10.2
>   |
>
>
> |-------------------------------+----------------------+----------------------+
>
> | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr.
> ECC |
>
> | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute
> M. |
>
>
> |===============================+======================+======================|
>
> |   0  GeForce GTX 108...  Off  | 00000000:02:00.0 Off |                  N/A
> |
>
> | 35%   50C    P0    63W / 250W |      0MiB / 11178MiB |      0%      Default
> |
>
>
> +-------------------------------+----------------------+----------------------+
>
> |   1  GeForce GTX 108...  Off  | 00000000:03:00.0 Off |                  N/A
> |
>
> | 23%   39C    P0    60W / 250W |      0MiB / 11178MiB |      0%      Default
> |
>
>
> +-------------------------------+----------------------+----------------------+
>
> |   2  GeForce GTX 108...  Off  | 00000000:81:00.0 Off |                  N/A
> |
>
> | 25%   39C    P0    55W / 250W |      0MiB / 11178MiB |      0%      Default
> |
>
>
> +-------------------------------+----------------------+----------------------+
>
> |   3  GeForce GTX 108...  Off  | 00000000:82:00.0 Off |                  N/A
> |
>
> | 21%   37C    P0    60W / 250W |      0MiB / 11178MiB |      2%      Default
> |
>
>
> +-------------------------------+----------------------+----------------------+
>
>
>
>
>
> +-----------------------------------------------------------------------------+
>
> | Processes:                                                       GPU
> Memory |
>
> |  GPU       PID   Type   Process name                             Usage
>     |
>
>
> |=============================================================================|
>
> |  No running processes found
>     |
>
>
> +-----------------------------------------------------------------------------+
>
>
> Jackie
>
> On Tue, Jul 21, 2020 at 3:34 PM Jacqueline Scoggins <jscoggins@lbl.gov>
> wrote:
>
>> The parameter did allow the node to come up without any complaints.
>>
>> I just sent you a copy of the slurm.conf file and all of the nodes output
>> from slurmd -C
>>
>> Btw,  why is the RealMemory needed for each of the nodes? Why was this a
>> change to the configuration. In the past the nodes would know its memory
>> and it would just report it now we have to include it in slurm.conf.  I had
>> to add 151 entries today because of that.
>>
>> Thanks
>>
>> Jackie
>>
>> On Tue, Jul 21, 2020 at 3:20 PM <bugs@schedmd.com> wrote:
>>
>>> *Comment # 7 <https://bugs.schedmd.com/show_bug.cgi?id=9441#c7> on bug
>>> 9441 <https://bugs.schedmd.com/show_bug.cgi?id=9441> from Ben Roberts
>>> <ben@schedmd.com> *
>>>
>>> This looks like it might be an issue with the node configuration matching up
>>> with the resources as they are detected.  Can you try adding
>>> "SlurmdParameters=config_overrides" to your slurm.conf, restarting the
>>> controller, and then restarting slurmd on this node?  If that lets it come up
>>> correctly then I would like to have you run 'slurmd -C' on the node to see how
>>> it shows the resources and whether there are differences with what you have in
>>> your slurm.conf.
>>>
>>> Thanks,
>>> Ben
>>>
>>> ------------------------------
>>> You are receiving this mail because:
>>>
>>>    - You reported the bug.
>>>
>>>

Comment 13 Wei Feinstein 2020-07-21 17:28:00 MDT

Ok adding that parameter has allowed for the node to stay up and running.
Will I be able to keep that in the configuration file. I really would like
to be able to keep it.

Things are looking better. After restarting slurmd on the nodes and adding
those parameters, restarting slurmctld actually is working.


Jackie

On Tue, Jul 21, 2020 at 4:08 PM Jacqueline Scoggins <jscoggins@lbl.gov>
wrote:

> Ok. Time is going by and I wanted to know if you wanted to setup another
> zoom session to get this working?
>
> Jackie
>
> On Tue, Jul 21, 2020 at 3:42 PM Jacqueline Scoggins <jscoggins@lbl.gov>
> wrote:
>
>> Here is also what I see on the nodes and master
>>
>> scontrol show node n0001.es1
>>
>> NodeName=n0001.es1 Arch=x86_64 CoresPerSocket=4
>>
>>    CPUAlloc=0 CPUTot=8 CPULoad=0.01
>>
>>    AvailableFeatures=es1_1080ti,es1
>>
>>    ActiveFeatures=es1_1080ti,es1
>>
>>    Gres=gpu:GTX1080TI:4
>>
>>    NodeAddr=10.0.43.1 NodeHostName=n0001.es1 Version=20.02.3
>>
>>    OS=Linux 3.10.0-1127.13.1.el7.x86_64 #1 SMP Tue Jun 23 10:32:27 CDT
>> 2020
>>
>>    RealMemory=64319 AllocMem=0 FreeMem=59178 Sockets=2 Boards=1
>>
>>    State=MAINT+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
>> MCS_label=N/A
>>
>>    Partitions=es1
>>
>>    BootTime=2020-07-20T12:16:14 SlurmdStartTime=2020-07-21T13:38:58
>>
>>    CfgTRES=cpu=8,mem=64319M,billing=8
>>
>>    AllocTRES=
>>
>>    CapWatts=n/a
>>
>>    CurrentWatts=0 AveWatts=0
>>
>>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>>
>>    Reason=gres/gpu count reported lower than configured (0 < 4)
>> [slurm@2020-07-21T13:42:45]
>>
>>
>> [root@master ~]# ssh n0000.es1 slurmd -C
>>
>> NodeName=n0000 CPUs=8 Boards=1 SocketsPerBoard=2 CoresPerSocket=4
>> ThreadsPerCore=1 RealMemory=64319
>>
>> UpTime=0-01:00:24
>>
>>
>> Is the gres.conf file setup correctly with 20.02?
>>
>> Nodename=n0000.*es*[1]  Type=GTX1080TI Name=gpu Count=4
>>
>>
>>
>> nvidia-smi
>>
>> Tue Jul 21 15:42:15 2020
>>
>>
>> +-----------------------------------------------------------------------------+
>>
>> | NVIDIA-SMI 440.44       Driver Version: 440.44       CUDA Version:
>> 10.2     |
>>
>>
>> |-------------------------------+----------------------+----------------------+
>>
>> | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile
>> Uncorr. ECC |
>>
>> | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute
>> M. |
>>
>>
>> |===============================+======================+======================|
>>
>> |   0  GeForce GTX 108...  Off  | 00000000:02:00.0 Off |
>>   N/A |
>>
>> | 35%   50C    P0    63W / 250W |      0MiB / 11178MiB |      0%      Default
>> |
>>
>>
>> +-------------------------------+----------------------+----------------------+
>>
>> |   1  GeForce GTX 108...  Off  | 00000000:03:00.0 Off |
>>   N/A |
>>
>> | 23%   39C    P0    60W / 250W |      0MiB / 11178MiB |      0%      Default
>> |
>>
>>
>> +-------------------------------+----------------------+----------------------+
>>
>> |   2  GeForce GTX 108...  Off  | 00000000:81:00.0 Off |
>>   N/A |
>>
>> | 25%   39C    P0    55W / 250W |      0MiB / 11178MiB |      0%      Default
>> |
>>
>>
>> +-------------------------------+----------------------+----------------------+
>>
>> |   3  GeForce GTX 108...  Off  | 00000000:82:00.0 Off |
>>   N/A |
>>
>> | 21%   37C    P0    60W / 250W |      0MiB / 11178MiB |      2%      Default
>> |
>>
>>
>> +-------------------------------+----------------------+----------------------+
>>
>>
>>
>>
>>
>> +-----------------------------------------------------------------------------+
>>
>> | Processes:                                                       GPU
>> Memory |
>>
>> |  GPU       PID   Type   Process name                             Usage
>>     |
>>
>>
>> |=============================================================================|
>>
>> |  No running processes found
>>       |
>>
>>
>> +-----------------------------------------------------------------------------+
>>
>>
>> Jackie
>>
>> On Tue, Jul 21, 2020 at 3:34 PM Jacqueline Scoggins <jscoggins@lbl.gov>
>> wrote:
>>
>>> The parameter did allow the node to come up without any complaints.
>>>
>>> I just sent you a copy of the slurm.conf file and all of the nodes
>>> output from slurmd -C
>>>
>>> Btw,  why is the RealMemory needed for each of the nodes? Why was this a
>>> change to the configuration. In the past the nodes would know its memory
>>> and it would just report it now we have to include it in slurm.conf.  I had
>>> to add 151 entries today because of that.
>>>
>>> Thanks
>>>
>>> Jackie
>>>
>>> On Tue, Jul 21, 2020 at 3:20 PM <bugs@schedmd.com> wrote:
>>>
>>>> *Comment # 7 <https://bugs.schedmd.com/show_bug.cgi?id=9441#c7> on bug
>>>> 9441 <https://bugs.schedmd.com/show_bug.cgi?id=9441> from Ben Roberts
>>>> <ben@schedmd.com> *
>>>>
>>>> This looks like it might be an issue with the node configuration matching up
>>>> with the resources as they are detected.  Can you try adding
>>>> "SlurmdParameters=config_overrides" to your slurm.conf, restarting the
>>>> controller, and then restarting slurmd on this node?  If that lets it come up
>>>> correctly then I would like to have you run 'slurmd -C' on the node to see how
>>>> it shows the resources and whether there are differences with what you have in
>>>> your slurm.conf.
>>>>
>>>> Thanks,
>>>> Ben
>>>>
>>>> ------------------------------
>>>> You are receiving this mail because:
>>>>
>>>>    - You reported the bug.
>>>>
>>>>

Comment 14 Wei Feinstein 2020-07-21 19:40:54 MDT

I think we have everything working now. That change allowed us to continue
on. Everything else is working. Just curious if the spank messages were
just info or an issue.

Thanks

Jackie Scoggins

On Jul 21, 2020, at 3:20 PM, bugs@schedmd.com wrote:



*Comment # 7 <https://bugs.schedmd.com/show_bug.cgi?id=9441#c7> on bug 9441
<https://bugs.schedmd.com/show_bug.cgi?id=9441> from Ben Roberts
<ben@schedmd.com> *

This looks like it might be an issue with the node configuration matching up
with the resources as they are detected.  Can you try adding
"SlurmdParameters=config_overrides" to your slurm.conf, restarting the
controller, and then restarting slurmd on this node?  If that lets it come up
correctly then I would like to have you run 'slurmd -C' on the node to see how
it shows the resources and whether there are differences with what you have in
your slurm.conf.

Thanks,
Ben

------------------------------
You are receiving this mail because:

   - You reported the bug.

Comment 15 Ben Roberts 2020-07-22 08:54:33 MDT

Hi Jackie,

My apologies.  It was getting late for me, so when I saw your message that you were able to get things up with the config_overrides parameter I signed off for the day.  I should have sent a note that I was doing so.

Specifying the RealMemory has always been a requirement, but in the past you were able to able to specify something close and get around the requirement to have it match by using the FastSchedule=0 parameter.  This has been deprecated, as you're aware, so that means it requires the RealMemory specification to match the detected value on the system.  You can leave the RealMemory specification undefined, but then it will set the memory on the node to 1, preventing you from requesting memory on the node.  

You can leave the config_overrides parameter, but it is better to have the node configuration match and come up without it.  Looking at the file you sent with the RealMemory you got from slurmd -C it does look like it matches.  It's possible there is some other setting it doesn't like.  Can you send the full output of slurmd -C for node n0000.es1 so I can compare that to your slurm.conf?

Thanks,
Ben

Comment 16 Wei Feinstein 2020-07-22 11:12:42 MDT

ssh n0000.es1 slurmd -C

NodeName=n0000 CPUs=8 Boards=1 SocketsPerBoard=2 CoresPerSocket=4
ThreadsPerCore=1 RealMemory=64319

UpTime=0-18:58:40


Jackie

On Wed, Jul 22, 2020 at 7:54 AM <bugs@schedmd.com> wrote:

> *Comment # 15 <https://bugs.schedmd.com/show_bug.cgi?id=9441#c15> on bug
> 9441 <https://bugs.schedmd.com/show_bug.cgi?id=9441> from Ben Roberts
> <ben@schedmd.com> *
>
> Hi Jackie,
>
> My apologies.  It was getting late for me, so when I saw your message that you
> were able to get things up with the config_overrides parameter I signed off for
> the day.  I should have sent a note that I was doing so.
>
> Specifying the RealMemory has always been a requirement, but in the past you
> were able to able to specify something close and get around the requirement to
> have it match by using the FastSchedule=0 parameter.  This has been deprecated,
> as you're aware, so that means it requires the RealMemory specification to
> match the detected value on the system.  You can leave the RealMemory
> specification undefined, but then it will set the memory on the node to 1,
> preventing you from requesting memory on the node.
>
> You can leave the config_overrides parameter, but it is better to have the node
> configuration match and come up without it.  Looking at the file you sent with
> the RealMemory you got from slurmd -C it does look like it matches.  It's
> possible there is some other setting it doesn't like.  Can you send the full
> output of slurmd -C for node n0000.es1 so I can compare that to your
> slurm.conf?
>
> Thanks,
> Ben
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 17 Ben Roberts 2020-07-22 12:45:28 MDT

Thanks for confirming that.  It looks like the other settings do line up.  

I looked again at your gres.conf file and I think there might be a couple problems contributing to this.  In your gres.conf you have an extra 'G' at the end of the count of your gpus:
Nodename=n0000.es[1]  Type=GTX1080TI Name=gpu Count=4G

I would recommend removing the 'G' from the count to make it match what's in your slurm.conf (Gres=gpu:GTX1080TI:4).  

I would also recommend defining the device files associated with your GPUs in the gres.conf.  That would look something like this:
File=/dev/nvidia[0-3]

You can read a more thorough description of this parameter in the documentation here:
https://slurm.schedmd.com/gres.conf.html#OPT_File

With those changes you could try removing the config_override parameter and restarting to see if it comes up correctly.

Thanks,
Ben

Comment 18 Ben Roberts 2020-08-04 13:09:55 MDT

Hi Jackie,

I wanted to follow up and make sure my recommendation helped.  Were you able to get the gres definitions working without the config_override parameter?  

Thanks,
Ben

Comment 19 Wei Feinstein 2020-08-04 14:13:54 MDT

No, I did not.

It is still broken. When I remove that variable the gres breaks.

Any recommendations?

Thanks

Jackie

On Tue, Aug 4, 2020 at 12:09 PM <bugs@schedmd.com> wrote:

> Ben Roberts <ben@schedmd.com> changed bug 9441
> <https://bugs.schedmd.com/show_bug.cgi?id=9441>
> What Removed Added
> Severity 2 - High Impact 3 - Medium Impact
>
> *Comment # 18 <https://bugs.schedmd.com/show_bug.cgi?id=9441#c18> on bug
> 9441 <https://bugs.schedmd.com/show_bug.cgi?id=9441> from Ben Roberts
> <ben@schedmd.com> *
>
> Hi Jackie,
>
> I wanted to follow up and make sure my recommendation helped.  Were you able to
> get the gres definitions working without the config_override parameter?
>
> Thanks,
> Ben
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 20 Ben Roberts 2020-08-05 08:30:48 MDT

Hi Jackie,

Yes, I think this is due to the fact that you have an extra 'G' appended to the count value in your gres.conf:
Nodename=n0000.es[1]  Type=GTX1080TI Name=gpu Count=4G

I've tested the same scenario, where I add a 'G' to the count of my GRES, and the nodes fail to come up unless I add the config_override parameter.  The log entry I see in the slurmd logs is this:
[2020-08-05T09:20:17.179] fatal: Invalid GRES record for gpu, count does not match File value

My recommendation is to modify your gres.conf file so the entries have a count value that matches what is in your slurm.conf.  For the example entry I used above, it should look like this:
Nodename=n0000.es[1]  Type=GTX1080TI Name=gpu Count=4

I would also recommend defining the device files associated with your GPUs in the gres.conf.  That would look something like this:
File=/dev/nvidia[0-3]

I don't think this is causing a failure, but is beneficial to have set.  You can read a more thorough description of this parameter in the documentation here:
https://slurm.schedmd.com/gres.conf.html#OPT_File

Let me know if that helps.

Thanks,
Ben

Comment 21 Wei Feinstein 2020-08-05 11:38:30 MDT

Created attachment 15323 [details]
gres.conf

Hello Ben,

I don't have a G at the end of my gres.conf file. I'm not sure where you
see that.

I will share you my file.  I will consider the File= suggestion.

Jackie

On Wed, Aug 5, 2020 at 7:30 AM <bugs@schedmd.com> wrote:

> *Comment # 20 <https://bugs.schedmd.com/show_bug.cgi?id=9441#c20> on bug
> 9441 <https://bugs.schedmd.com/show_bug.cgi?id=9441> from Ben Roberts
> <ben@schedmd.com> *
>
> Hi Jackie,
>
> Yes, I think this is due to the fact that you have an extra 'G' appended to the
> count value in your gres.conf:
> Nodename=n0000.es[1]  Type=GTX1080TI Name=gpu Count=4G
>
> I've tested the same scenario, where I add a 'G' to the count of my GRES, and
> the nodes fail to come up unless I add the config_override parameter.  The log
> entry I see in the slurmd logs is this:
> [2020-08-05T09:20:17.179] fatal: Invalid GRES record for gpu, count does not
> match File value
>
> My recommendation is to modify your gres.conf file so the entries have a count
> value that matches what is in your slurm.conf.  For the example entry I used
> above, it should look like this:
> Nodename=n0000.es[1]  Type=GTX1080TI Name=gpu Count=4
>
> I would also recommend defining the device files associated with your GPUs in
> the gres.conf.  That would look something like this:
> File=/dev/nvidia[0-3]
>
> I don't think this is causing a failure, but is beneficial to have set.  You
> can read a more thorough description of this parameter in the documentation
> here:https://slurm.schedmd.com/gres.conf.html#OPT_File
>
> Let me know if that helps.
>
> Thanks,
> Ben
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 22 Ben Roberts 2020-08-05 14:11:08 MDT

Ok, the gres.conf file you sent on Jul 21 was where I saw that the counts had a 'G' on the end.  It looks like you've removed it since then, which is good.  Have you tried starting without the config_override parameter since making that change?  I understand if you can't try it outside of a maintenance period.  

Thanks,
Ben

Comment 23 Wei Feinstein 2020-08-05 14:57:32 MDT

Yes, I did. And it did not work. I commented it out and restarted slurmctld
and slurmd and immediately the gres servers in es1 went off line again due
to the count value.


Jackie

On Wed, Aug 5, 2020 at 1:11 PM <bugs@schedmd.com> wrote:

> *Comment # 22 <https://bugs.schedmd.com/show_bug.cgi?id=9441#c22> on bug
> 9441 <https://bugs.schedmd.com/show_bug.cgi?id=9441> from Ben Roberts
> <ben@schedmd.com> *
>
> Ok, the gres.conf file you sent on Jul 21 was where I saw that the counts had a
> 'G' on the end.  It looks like you've removed it since then, which is good.
> Have you tried starting without the config_override parameter since making that
> change?  I understand if you can't try it outside of a maintenance period.
>
> Thanks,
> Ben
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 24 Ben Roberts 2020-08-06 08:47:21 MDT

Do you happen to have the slurmd logs from one of the nodes from the last attempt to start it without config_override?  If not when do you anticipate being able to try it again?

Thanks,
Ben

Comment 25 Ben Roberts 2020-08-20 08:13:49 MDT

Hi Jackie,

I just wanted to check in with you and see if you still have the slurmd logs from the last time you tried starting them without config_override.  Let me know if you're planning on trying to remove this setting at some point.

Thanks,
Ben

Comment 26 Wei Feinstein 2020-08-20 23:15:55 MDT

No, I did not. I tried to find the logs but was unable to pinpoint them.
If you'd like I could do a test on Monday or Tuesday. If you're available
let me know and you can watch me test it via zoom.  What day works best for
you?  I will schedule a zoom meeting.

Thanks

Jackie

On Thu, Aug 20, 2020 at 7:13 AM <bugs@schedmd.com> wrote:

> Ben Roberts <ben@schedmd.com> changed bug 9441
> <https://bugs.schedmd.com/show_bug.cgi?id=9441>
> What Removed Added
> Severity 3 - Medium Impact 4 - Minor Issue
>
> *Comment # 25 <https://bugs.schedmd.com/show_bug.cgi?id=9441#c25> on bug
> 9441 <https://bugs.schedmd.com/show_bug.cgi?id=9441> from Ben Roberts
> <ben@schedmd.com> *
>
> Hi Jackie,
>
> I just wanted to check in with you and see if you still have the slurmd logs
> from the last time you tried starting them without config_override.  Let me
> know if you're planning on trying to remove this setting at some point.
>
> Thanks,
> Ben
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 27 Ben Roberts 2020-08-21 10:33:18 MDT

Hi Jackie,

This should be an issue we can work on without scheduling a zoom meeting.  You sent a copy of your gres.conf file on Aug 05.  Does the file still look like it did then?  The main difference between that file and the previous version is that you removed the 'G' from the end of each of the counts, which is what I would expect.  We did discuss in comment 17 specifying a File for the 'es' nodes that have GPUs.  Nvidia uses a special file at /dev/nvidia[0-?] for the GPUs.  I would recommend specifying the file in your gres.conf for the gpu nodes.  For example n0000.es[1] had 4 GPUs configured, so it would look like this:
Nodename=n0000.es[1]  Type=GTX1080TI Name=gpu Count=4 File=/dev/nvidia[0-3]

When you are able to try without the 'config_override' flag, if you still have a failure I would like to have you tar up and send the the following things:
slurm.conf
gres.conf
slurmd.log (from a node that had a failure)
slurmctld.log

Thanks,
Ben

Comment 28 Wei Feinstein 2020-08-21 19:37:56 MDT

Got it, thanks! I will try and schedule something for next week.

Jackie

On Fri, Aug 21, 2020 at 9:33 AM <bugs@schedmd.com> wrote:

> *Comment # 27 <https://bugs.schedmd.com/show_bug.cgi?id=9441#c27> on bug
> 9441 <https://bugs.schedmd.com/show_bug.cgi?id=9441> from Ben Roberts
> <ben@schedmd.com> *
>
> Hi Jackie,
>
> This should be an issue we can work on without scheduling a zoom meeting.  You
> sent a copy of your gres.conf file on Aug 05.  Does the file still look like it
> did then?  The main difference between that file and the previous version is
> that you removed the 'G' from the end of each of the counts, which is what I
> would expect.  We did discuss in comment 17 <https://bugs.schedmd.com/show_bug.cgi?id=9441#c17> specifying a File for the 'es'
> nodes that have GPUs.  Nvidia uses a special file at /dev/nvidia[0-?] for the
> GPUs.  I would recommend specifying the file in your gres.conf for the gpu
> nodes.  For example n0000.es[1] had 4 GPUs configured, so it would look like
> this:
> Nodename=n0000.es[1]  Type=GTX1080TI Name=gpu Count=4 File=/dev/nvidia[0-3]
>
> When you are able to try without the 'config_override' flag, if you still have
> a failure I would like to have you tar up and send the the following things:
> slurm.conf
> gres.conf
> slurmd.log (from a node that had a failure)
> slurmctld.log
>
> Thanks,
> Ben
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 29 Michael DiDomenico 2020-08-26 09:19:51 MDT

I am also seeing similar behavior after upgrading to v20 from v18.  When the nodes config (ie lower then expected memory or lower then expected number of gpu's) are detected on the node slurm kicks out these invalid argument entries

i agree that the config should match, but it would be better to throw 1 error in the log and report the node through sinfo -R as being damaged, rather then incessantly throwing errors in the logs every second

Comment 30 Ben Roberts 2020-09-21 11:22:37 MDT

Hi Jackie,

I'm just checking back in to see if you have been able to try removing the 'config_override' flag.  

Thanks,
Ben

Comment 31 Wei Feinstein 2020-09-22 05:52:30 MDT

I have but removed that.  I tried it once and nodes once again started to
change state.  Is there any issue with leaving it on?  If you’d like to
digg into the reason this keeps happening I’d be happy to schedule a
downtime to do this.  How much time would you need of what information
would you like for me to gather during the test period?

Thanks

Jackie Scoggins

On Sep 21, 2020, at 10:22 AM, bugs@schedmd.com wrote:



*Comment # 30 <https://bugs.schedmd.com/show_bug.cgi?id=9441#c30> on bug
9441 <https://bugs.schedmd.com/show_bug.cgi?id=9441> from Ben Roberts
<ben@schedmd.com> *

Hi Jackie,

I'm just checking back in to see if you have been able to try removing the
'config_override' flag.

Thanks,
Ben

------------------------------
You are receiving this mail because:

   - You reported the bug.

Comment 32 Ben Roberts 2020-10-08 15:19:18 MDT

Hi Jackie,

I'm sorry I didn't respond to you sooner, I let your update fall through the cracks.

It should be ok as it is.  If you would like to try and update this at some point in the future please let us know and we can work with you to narrow down what's causing the nodes to fail without 'config_override' set.  

Thanks,
Ben