Bug 9241

Summary: CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.
Product: Slurm Reporter: Ole.H.Nielsen <Ole.H.Nielsen>
Component: ConfigurationAssignee: Felip Moll <felip.moll>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: felip.moll
Version: 20.02.3   
Hardware: Linux   
OS: Linux   
Site: DTU Physics Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurm.conf

Description Ole.H.Nielsen@fysik.dtu.dk 2020-06-16 00:44:52 MDT
Created attachment 14683 [details]
slurm.conf

Today we upgraded the controller node from 19.05 to 20.02.3, and immediately all Slurm commands (on the controller node) give error messages:

# sinfo --version
sinfo: error: NodeNames=a[001-140] CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.
sinfo: error: NodeNames=d[001-019] CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.
sinfo: error: NodeNames=d[021-022] CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.
sinfo: error: NodeNames=d[023-068] CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.
sinfo: error: NodeNames=g[001-021],g[024-078] CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.
sinfo: error: NodeNames=g[079-084],g[089-110] CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.
sinfo: error: NodeNames=g[085-088] CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.
sinfo: error: NodeNames=h[001-002] CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.
sinfo: error: NodeNames=i[004-030] CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.
sinfo: error: NodeNames=i[031-050] CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.
sinfo: error: NodeNames=x[001-168],x[181-192] CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.
sinfo: error: NodeNames=x[169-180] CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.
sinfo: error: NodeNames=c[001-196] CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.
sinfo: error: NodeNames=b[001-012] CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.
slurm 20.02.3

In slurm.conf we have defined NodeName with Boards=1 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=1.  According to the slurm.conf manual the CPUs should then be calculated automatically:

"If CPUs is omitted, its default will be set equal to the product of Boards, Sockets, CoresPerSocket, and ThreadsPerCore."

and:

"Boards and CPUs are mutually exclusive."

The error message "CPUs=1 match no Sockets" would seem to be a bug.  It may be the same issue as in bug 9233.

Can you please help with a workaround?

Thanks,
Ole
Comment 1 Ole.H.Nielsen@fysik.dtu.dk 2020-06-16 02:40:14 MDT
When I change slurm.conf lines:

NodeName=a[001-140] Weight=10001 Boards=1 SocketsPerBoard=2 CoresPerSocket=4 ThreadsPerCore=1 ...

into

NodeName=a[001-140] Weight=10001 Sockets=2 CoresPerSocket=4 ThreadsPerCore=1 ...

then the errors do not show up!  

It appears that Boards=1 SocketsPerBoard=2 is not supported correctly in 20.02.3, and that one must in stead use Sockets=2.

Question: Should Boards and SocketsPerBoard be working correctly, or should these parameters be deprecated?

Thanks,
Ole
Comment 2 Felip Moll 2020-06-16 09:29:02 MDT
Thanks Ole,

as you have guessed this is a duplicate of 9233.
I am investigating the regression right now.

I am marking this but as duplicate, so please, follow the other one from now on.

Thank you for your help.

*** This bug has been marked as a duplicate of bug 9233 ***