Hello, We plan to upgrade to 20.02.3 next week. Today we were testing upgrading our test cluster. I kept facing errors whenever I issue a SLURM command scontrol: error: NodeNames=cn110-22-l CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs. for each node in the configuration file before the upgrade, the structure of our nodes.conf file was as follows: NodeName=DEFAULT Gres="" Feature=raven,cpu_amd_epyc_7702,amd,ibex2019,nogpu,nolmem,local_200G,local_400G,local_500G,local_950G RealMemory=510000 Boards=1 SocketsPerBoard=2 CoresPerSocket=64 ThreadsPerCore=1 Weight=100 NodeName=cn110-22-l NodeName=cn110-23-l ... Is that a change introduced in the new version? The weird thing that node specs appear correctly when I use "scontrol show node" # scontrol show node cn110-22-l scontrol: error: NodeNames=cn110-22-l CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs. scontrol: error: NodeNames=cn110-23-l CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs. scontrol: error: NodeNames=dgpu502-17-l CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs. scontrol: error: NodeNames=dgpu502-17-r CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs. NodeName=cn110-22-l CoresPerSocket=64 CPUAlloc=0 CPUTot=128 CPULoad=N/A AvailableFeatures=raven,cpu_amd_epyc_7702,amd,ibex2019,nogpu,nolmem,local_200G,local_400G,local_500G,local_950G ActiveFeatures=raven,cpu_amd_epyc_7702,amd,ibex2019,nogpu,nolmem,local_200G,local_400G,local_500G,local_950G Gres=(null) NodeAddr=cn110-22-l NodeHostName=cn110-22-l RealMemory=510000 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1 State=DOWN* ThreadsPerCore=1 TmpDisk=0 Weight=100 Owner=N/A MCS_label=N/A Partitions=ALL,batch,users BootTime=None SlurmdStartTime=None CfgTRES=cpu=128,mem=510000M,billing=128 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=NO NETWORK ADDRESS FOUND [root@2020-03-17T16:15:38] I've edited nodes.conf file adding "CPUs=<count>" for each record and the error disappeared.. that's easy for a small cluster, but do I have to do the same for each record in our files while upgrading next week?
Hi Ahmed, It is not happening on my side. Can you upload your configuration files here to try to reproduce it? Thanks
(In reply to Felip Moll from comment #1) > Hi Ahmed, > > It is not happening on my side. > Can you upload your configuration files here to try to reproduce it? > > Thanks Dear Felip, Here is our nodes.conf # # New Rome test nodes # NodeName=DEFAULT Gres="" Feature=raven,cpu_amd_epyc_7702,amd,ibex2019,nogpu,nolmem,local_200G,local_400G,local_500G,local_950G RealMemory=510000 Boards=1 SocketsPerBoard=2 CoresPerSocket=64 ThreadsPerCore=1 Weight=100 NodeName=cn110-22-l NodeName=cn110-23-l # # GPU nodes # NodeName=DEFAULT Gres=gpu:tesla_k40m:8 Feature=ibex2017,nolmem,cpu_intel_e5_2670,gpu,intel_gpu,local_200G,local_400G,local_500G,gpu_tesla_k40m,tesla_k40m RealMemory=252800 Boards=1 SocketsPerBoard=2 CoresPerSocket=8 ThreadsPerCore=1 Weight=5000 NodeName=dgpu502-17-l NodeName=dgpu502-17-r Best regards, Ahmed
Hi, In bug 9241 I have discovered that one must replace "Boards=1 SocketsPerBoard=2" by "Sockets=2". This solved the issue for me. /Ole
*** Bug 9241 has been marked as a duplicate of this bug. ***
I see the issue and I am working on it. There were some changes in commit 60c6a1f8f88 which are reflected in 20.02 RELEASE_NOTES and NEWS: Release notes: NOTE: Slurmctld is now set to fatal in case of computing node configured with CPUs == #Sockets. CPUs has to be either total number of cores or threads. News: -- NodeName configurations with CPUs != Sockets*Cores or Sockets*Cores*Threads will be rejected with fatal. But I don't think what you're seeing is entirely correct, at least it would need clarification. I will come back when I've figured it out.
Hi, This is fixed in commit 73ff1b200776 which will be available in 20.02.4. The workaround is what Ole commented in comment 3, for the moment just don't use Boards and SocketsPerBoard and use Sockets instead. Marking the bug as fixed. Thanks for reporting.
(In reply to Felip Moll from comment #12) > Hi, > > This is fixed in commit 73ff1b200776 which will be available in 20.02.4. > > The workaround is what Ole commented in comment 3, for the moment just don't > use Boards and SocketsPerBoard and use Sockets instead. > > Marking the bug as fixed. > > Thanks for reporting. Thanks for the bug fix, Felip! /Ole