Bug 9233

Summary: Nodes file structure after upgrade to 20.02.3
Product: Slurm Reporter: Ahmed Essam ElMazaty <ahmed.mazaty>
Component: ConfigurationAssignee: Felip Moll <felip.moll>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: cinek, felip.moll, Ole.H.Nielsen
Version: 20.02.3   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=8713
https://bugs.schedmd.com/show_bug.cgi?id=7295
Site: KAUST Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 20.02.4 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Ahmed Essam ElMazaty 2020-06-14 06:32:06 MDT
Hello,
We plan to upgrade to 20.02.3 next week. Today we were testing upgrading our test cluster.
I kept facing errors whenever I issue a SLURM command
scontrol: error: NodeNames=cn110-22-l CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.
for each node in the configuration file

before the upgrade, the structure of our nodes.conf file was as follows:
NodeName=DEFAULT Gres="" Feature=raven,cpu_amd_epyc_7702,amd,ibex2019,nogpu,nolmem,local_200G,local_400G,local_500G,local_950G RealMemory=510000 Boards=1 SocketsPerBoard=2 CoresPerSocket=64 ThreadsPerCore=1 Weight=100
NodeName=cn110-22-l
NodeName=cn110-23-l
...

Is that a change introduced in the new version?

The weird thing that node specs appear correctly when I use "scontrol show node"


# scontrol show node cn110-22-l
scontrol: error: NodeNames=cn110-22-l CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.
scontrol: error: NodeNames=cn110-23-l CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.
scontrol: error: NodeNames=dgpu502-17-l CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.
scontrol: error: NodeNames=dgpu502-17-r CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.
NodeName=cn110-22-l CoresPerSocket=64 
   CPUAlloc=0 CPUTot=128 CPULoad=N/A
   AvailableFeatures=raven,cpu_amd_epyc_7702,amd,ibex2019,nogpu,nolmem,local_200G,local_400G,local_500G,local_950G
   ActiveFeatures=raven,cpu_amd_epyc_7702,amd,ibex2019,nogpu,nolmem,local_200G,local_400G,local_500G,local_950G
   Gres=(null)
   NodeAddr=cn110-22-l NodeHostName=cn110-22-l 
   RealMemory=510000 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1
   State=DOWN* ThreadsPerCore=1 TmpDisk=0 Weight=100 Owner=N/A MCS_label=N/A
   Partitions=ALL,batch,users 
   BootTime=None SlurmdStartTime=None
   CfgTRES=cpu=128,mem=510000M,billing=128
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=NO NETWORK ADDRESS FOUND [root@2020-03-17T16:15:38]


I've edited nodes.conf file adding "CPUs=<count>" for each record and the error disappeared..  that's easy for a small cluster, but do I have to do the same for each record in our files while upgrading next week?
Comment 1 Felip Moll 2020-06-15 07:01:22 MDT
Hi Ahmed,

It is not happening on my side.
Can you upload your configuration files here to try to reproduce it?

Thanks
Comment 2 Ahmed Essam ElMazaty 2020-06-15 07:06:36 MDT
(In reply to Felip Moll from comment #1)
> Hi Ahmed,
> 
> It is not happening on my side.
> Can you upload your configuration files here to try to reproduce it?
> 
> Thanks

Dear Felip,
Here is our nodes.conf

#
# New Rome test nodes
#
NodeName=DEFAULT Gres="" Feature=raven,cpu_amd_epyc_7702,amd,ibex2019,nogpu,nolmem,local_200G,local_400G,local_500G,local_950G RealMemory=510000 Boards=1 SocketsPerBoard=2 CoresPerSocket=64 ThreadsPerCore=1 Weight=100
NodeName=cn110-22-l
NodeName=cn110-23-l
#
# GPU nodes
#
NodeName=DEFAULT Gres=gpu:tesla_k40m:8 Feature=ibex2017,nolmem,cpu_intel_e5_2670,gpu,intel_gpu,local_200G,local_400G,local_500G,gpu_tesla_k40m,tesla_k40m RealMemory=252800 Boards=1 SocketsPerBoard=2 CoresPerSocket=8  ThreadsPerCore=1 Weight=5000
NodeName=dgpu502-17-l
NodeName=dgpu502-17-r

Best regards,
Ahmed
Comment 3 Ole.H.Nielsen@fysik.dtu.dk 2020-06-16 02:41:20 MDT
Hi,

In bug 9241 I have discovered that one must replace "Boards=1 SocketsPerBoard=2" by "Sockets=2".  This solved the issue for me.

/Ole
Comment 4 Felip Moll 2020-06-16 09:29:02 MDT
*** Bug 9241 has been marked as a duplicate of this bug. ***
Comment 6 Felip Moll 2020-06-16 11:15:05 MDT
I see the issue and I am working on it.

There were some changes in commit 60c6a1f8f88 which are reflected in 20.02 RELEASE_NOTES and NEWS:

Release notes:
NOTE: Slurmctld is now set to fatal in case of computing node configured with
      CPUs == #Sockets. CPUs has to be either total number of cores or threads.


News:
 -- NodeName configurations with CPUs != Sockets*Cores or
    Sockets*Cores*Threads will be rejected with fatal.


But I don't think what you're seeing is entirely correct, at least it would need clarification.

I will come back when I've figured it out.
Comment 12 Felip Moll 2020-06-29 02:59:54 MDT
Hi,

This is fixed in commit 73ff1b200776 which will be available in 20.02.4.

The workaround is what Ole commented in comment 3, for the moment just don't use Boards and SocketsPerBoard and use Sockets instead.

Marking the bug as fixed.

Thanks for reporting.
Comment 13 Ole.H.Nielsen@fysik.dtu.dk 2020-06-29 03:10:54 MDT
(In reply to Felip Moll from comment #12)
> Hi,
> 
> This is fixed in commit 73ff1b200776 which will be available in 20.02.4.
> 
> The workaround is what Ole commented in comment 3, for the moment just don't
> use Boards and SocketsPerBoard and use Sockets instead.
> 
> Marking the bug as fixed.
> 
> Thanks for reporting.

Thanks for the bug fix, Felip!

/Ole