Bug 9241 - CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.
Summary: CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*Thr...
Status: RESOLVED DUPLICATE of bug 9233
Alias: None
Product: Slurm
Classification: Unclassified
Component: Configuration (show other bugs)
Version: 20.02.3
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Felip Moll
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-06-16 00:44 MDT by Ole.H.Nielsen@fysik.dtu.dk
Modified: 2020-06-16 09:29 MDT (History)
1 user (show)

See Also:
Site: DTU Physics
Alineos Sites: ---
Bull/Atos Sites: ---
Confidential Site: ---
Cray Sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---


Attachments
slurm.conf (5.80 KB, text/plain)
2020-06-16 00:44 MDT, Ole.H.Nielsen@fysik.dtu.dk
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Ole.H.Nielsen@fysik.dtu.dk 2020-06-16 00:44:52 MDT
Created attachment 14683 [details]
slurm.conf

Today we upgraded the controller node from 19.05 to 20.02.3, and immediately all Slurm commands (on the controller node) give error messages:

# sinfo --version
sinfo: error: NodeNames=a[001-140] CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.
sinfo: error: NodeNames=d[001-019] CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.
sinfo: error: NodeNames=d[021-022] CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.
sinfo: error: NodeNames=d[023-068] CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.
sinfo: error: NodeNames=g[001-021],g[024-078] CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.
sinfo: error: NodeNames=g[079-084],g[089-110] CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.
sinfo: error: NodeNames=g[085-088] CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.
sinfo: error: NodeNames=h[001-002] CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.
sinfo: error: NodeNames=i[004-030] CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.
sinfo: error: NodeNames=i[031-050] CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.
sinfo: error: NodeNames=x[001-168],x[181-192] CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.
sinfo: error: NodeNames=x[169-180] CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.
sinfo: error: NodeNames=c[001-196] CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.
sinfo: error: NodeNames=b[001-012] CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.
slurm 20.02.3

In slurm.conf we have defined NodeName with Boards=1 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=1.  According to the slurm.conf manual the CPUs should then be calculated automatically:

"If CPUs is omitted, its default will be set equal to the product of Boards, Sockets, CoresPerSocket, and ThreadsPerCore."

and:

"Boards and CPUs are mutually exclusive."

The error message "CPUs=1 match no Sockets" would seem to be a bug.  It may be the same issue as in bug 9233.

Can you please help with a workaround?

Thanks,
Ole
Comment 1 Ole.H.Nielsen@fysik.dtu.dk 2020-06-16 02:40:14 MDT
When I change slurm.conf lines:

NodeName=a[001-140] Weight=10001 Boards=1 SocketsPerBoard=2 CoresPerSocket=4 ThreadsPerCore=1 ...

into

NodeName=a[001-140] Weight=10001 Sockets=2 CoresPerSocket=4 ThreadsPerCore=1 ...

then the errors do not show up!  

It appears that Boards=1 SocketsPerBoard=2 is not supported correctly in 20.02.3, and that one must in stead use Sockets=2.

Question: Should Boards and SocketsPerBoard be working correctly, or should these parameters be deprecated?

Thanks,
Ole
Comment 2 Felip Moll 2020-06-16 09:29:02 MDT
Thanks Ole,

as you have guessed this is a duplicate of 9233.
I am investigating the regression right now.

I am marking this but as duplicate, so please, follow the other one from now on.

Thank you for your help.

*** This bug has been marked as a duplicate of bug 9233 ***