Summary: | Nodes file structure after upgrade to 20.02.3 | ||
---|---|---|---|
Product: | Slurm | Reporter: | Ahmed Essam ElMazaty <ahmed.mazaty> |
Component: | Configuration | Assignee: | Felip Moll <felip.moll> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | CC: | cinek, felip.moll, Ole.H.Nielsen |
Version: | 20.02.3 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=8713 https://bugs.schedmd.com/show_bug.cgi?id=7295 |
||
Site: | KAUST | Alineos Sites: | --- |
Bull/Atos Sites: | --- | Confidential Site: | --- |
Cray Sites: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | OCF Sites: | --- |
SFW Sites: | --- | SNIC sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | 20.02.4 | |
Target Release: | --- | DevPrio: | --- |
Description
Ahmed Essam ElMazaty
2020-06-14 06:32:06 MDT
Hi Ahmed, It is not happening on my side. Can you upload your configuration files here to try to reproduce it? Thanks (In reply to Felip Moll from comment #1) > Hi Ahmed, > > It is not happening on my side. > Can you upload your configuration files here to try to reproduce it? > > Thanks Dear Felip, Here is our nodes.conf # # New Rome test nodes # NodeName=DEFAULT Gres="" Feature=raven,cpu_amd_epyc_7702,amd,ibex2019,nogpu,nolmem,local_200G,local_400G,local_500G,local_950G RealMemory=510000 Boards=1 SocketsPerBoard=2 CoresPerSocket=64 ThreadsPerCore=1 Weight=100 NodeName=cn110-22-l NodeName=cn110-23-l # # GPU nodes # NodeName=DEFAULT Gres=gpu:tesla_k40m:8 Feature=ibex2017,nolmem,cpu_intel_e5_2670,gpu,intel_gpu,local_200G,local_400G,local_500G,gpu_tesla_k40m,tesla_k40m RealMemory=252800 Boards=1 SocketsPerBoard=2 CoresPerSocket=8 ThreadsPerCore=1 Weight=5000 NodeName=dgpu502-17-l NodeName=dgpu502-17-r Best regards, Ahmed Hi, In bug 9241 I have discovered that one must replace "Boards=1 SocketsPerBoard=2" by "Sockets=2". This solved the issue for me. /Ole *** Bug 9241 has been marked as a duplicate of this bug. *** I see the issue and I am working on it. There were some changes in commit 60c6a1f8f88 which are reflected in 20.02 RELEASE_NOTES and NEWS: Release notes: NOTE: Slurmctld is now set to fatal in case of computing node configured with CPUs == #Sockets. CPUs has to be either total number of cores or threads. News: -- NodeName configurations with CPUs != Sockets*Cores or Sockets*Cores*Threads will be rejected with fatal. But I don't think what you're seeing is entirely correct, at least it would need clarification. I will come back when I've figured it out. Hi, This is fixed in commit 73ff1b200776 which will be available in 20.02.4. The workaround is what Ole commented in comment 3, for the moment just don't use Boards and SocketsPerBoard and use Sockets instead. Marking the bug as fixed. Thanks for reporting. (In reply to Felip Moll from comment #12) > Hi, > > This is fixed in commit 73ff1b200776 which will be available in 20.02.4. > > The workaround is what Ole commented in comment 3, for the moment just don't > use Boards and SocketsPerBoard and use Sockets instead. > > Marking the bug as fixed. > > Thanks for reporting. Thanks for the bug fix, Felip! /Ole |