Bug 14270 - sinfo does not print cloud nodes
Summary: sinfo does not print cloud nodes
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: User Commands (show other bugs)
Version: 21.08.8
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Dominik Bartkiewicz
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2022-06-08 02:39 MDT by Ole.H.Nielsen@fysik.dtu.dk
Modified: 2022-06-17 06:26 MDT (History)
3 users (show)

See Also:
Site: DTU Physics
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf (3.21 KB, text/plain)
2022-06-08 02:39 MDT, Ole.H.Nielsen@fysik.dtu.dk
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Ole.H.Nielsen@fysik.dtu.dk 2022-06-08 02:39:46 MDT
Created attachment 25405 [details]
slurm.conf

Our test cluster has 2 on-premise nodes and 2 Azure cloud nodes defined in slurm.conf:

NodeName=test[001-002] Weight=10313 Sockets=2 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=23000 TmpDisk=100000 Feature=xeonx5570
NodeName=camd[001-002] Weight=10005 Sockets=1 CoresPerSocket=4 ThreadsPerCore=1 State=CLOUD RealMemory=26000 TmpDisk=10000 Feature=xeon8272cl,Azure

I need to use sinfo to inquire the state of nodes, and this doesn't work correctly with node range expressions for cloud nodes.

An "sinfo -N" lists only non-cloud nodes:

$ sinfo -N -h -O NODELIST,StateComplete:40 
test001             idle                                    
test002             idle      

Nodes with state CLOUD are not listed:

$ sinfo -N -h -O NODELIST,StateComplete:40 -t CLOUD

If I specify a single cloud node name it DOES get printed:

$ sinfo -N -h -O NODELIST,StateComplete:40 -n camd001
camd001             idle+cloud+powered_down                 
$ sinfo -N -h -O NODELIST,StateComplete:40 -n camd002
camd002             down+cloud+powered_down+not_responding  

But if I specify a node range expression nothing gets printed:

$ sinfo -N -h -O NODELIST,StateComplete:40 -n camd[001-002]
$ sinfo -N -h -O NODELIST,StateComplete:40 -n camd001,camd002

Something seems to be inconsistent with the output from sinfo.  The empty output from node range expressions would seem to be a bug.  I need sinfo to work correctly for my cloud node power up/down scripts.

Can you help clarify what's going on?
Comment 1 Dominik Bartkiewicz 2022-06-08 05:36:58 MDT
Hi

I can't recreate this locally.
Could you send me the output from "scontrol -F show node"

Dominik
Comment 2 Ole.H.Nielsen@fysik.dtu.dk 2022-06-08 05:42:15 MDT
(In reply to Dominik Bartkiewicz from comment #1)
> Hi
> 
> I can't recreate this locally.
> Could you send me the output from "scontrol -F show node"

Yes, it doesn't print the cloud nodes:

$ scontrol -F show node
NodeName=test001 Arch=x86_64 CoresPerSocket=4 
   CPUAlloc=0 CPUTot=8 CPULoad=0.00
   AvailableFeatures=xeonx5570
   ActiveFeatures=xeonx5570
   Gres=(null)
   NodeAddr=test001 NodeHostName=test001 Version=21.08.8-2
   OS=Linux 4.18.0-372.9.1.el8.x86_64 #1 SMP Tue May 10 08:57:35 EDT 2022 
   RealMemory=23000 AllocMem=0 FreeMem=21589 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=100000 Weight=10313 Owner=N/A MCS_label=N/A
   Partitions=xeon8 
   BootTime=2022-05-29T20:49:32 SlurmdStartTime=2022-06-03T11:02:27
   LastBusyTime=2022-06-08T10:47:33
   CfgTRES=cpu=8,mem=23000M,billing=6
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

NodeName=test002 Arch=x86_64 CoresPerSocket=4 
   CPUAlloc=0 CPUTot=8 CPULoad=0.00
   AvailableFeatures=xeonx5570
   ActiveFeatures=xeonx5570
   Gres=(null)
   NodeAddr=test002 NodeHostName=test002 Version=21.08.8-2
   OS=Linux 4.18.0-372.9.1.el8.x86_64 #1 SMP Tue May 10 08:57:35 EDT 2022 
   RealMemory=23000 AllocMem=0 FreeMem=21598 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=100000 Weight=10313 Owner=N/A MCS_label=N/A
   Partitions=xeon8 
   BootTime=2022-05-29T20:49:28 SlurmdStartTime=2022-06-03T11:02:27
   LastBusyTime=2022-06-08T10:47:33
   CfgTRES=cpu=8,mem=23000M,billing=6
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Comment 3 Ole.H.Nielsen@fysik.dtu.dk 2022-06-08 08:13:09 MDT
I was running a job on an Azure cloud node, and when the node is being powered down the sinfo command actually prints the node:

$ sinfo -N -h -O NODELIST,StateComplete:40 -n camd001,camd002
camd001             idle+cloud+powering_down            

After the node was powered down, sinfo again prints empty output.
Comment 4 Dominik Bartkiewicz 2022-06-08 08:23:16 MDT
Hi

You should set PrivateData=cloud in slurm.conf.
Let me know if this will solve this issue.

Dominik
Comment 5 Ole.H.Nielsen@fysik.dtu.dk 2022-06-08 08:49:48 MDT
Hi Dominik,

(In reply to Dominik Bartkiewicz from comment #4)
> You should set PrivateData=cloud in slurm.conf.
> Let me know if this will solve this issue.

Wow, that fixes the issue!  I run this as the root user:

$ sinfo -N -h -O NODELIST,StateComplete:40 -n camd001,camd002
camd001             idle+cloud+powered_down                 
camd002             down+cloud+powered_down+not_responding  

This is, however, not consistent with the slurm.conf man-page:

PrivateData
This controls what type of information is hidden from regular users.  By default, all information is visible to all users.  User SlurmUser and root can always view all information.

cloud  Powered down nodes in the cloud are visible.

The "cloud" parameter description is highly ambiguous, since the slurm and root users should always see the "cloud" data.

Can we agree that the lack of cloud node display for the slurm and root users is a bug?  It would certainly be convenient if we didn't have to configure PrivateData in this case!

Thanks,
Ole
Comment 6 Ole.H.Nielsen@fysik.dtu.dk 2022-06-08 08:51:52 MDT
In comment #0 I also found that a single cloud node name it DOES get printed, whereas for a node range expression nothing gets printed.  This seems really buggy.
Comment 7 Dominik Bartkiewicz 2022-06-08 09:47:54 MDT
Hi

For sure documentation is not precise.
and I agree that PrivateData isn't the best place for this option.

But I don't think we want to change this default behavior.
We will fix the documentation and internally discuss if we can move this option to a more appropriate place in 23.03.

Dominik
Comment 8 Ole.H.Nielsen@fysik.dtu.dk 2022-06-08 10:07:28 MDT
(In reply to Dominik Bartkiewicz from comment #7)
> For sure documentation is not precise.
> and I agree that PrivateData isn't the best place for this option.
> 
> But I don't think we want to change this default behavior.
> We will fix the documentation and internally discuss if we can move this
> option to a more appropriate place in 23.03.

Thanks for your analysis and suggested resolution.  I still believe that PrivateData=cloud should *NOT* be required, because the users slurm and root should by default see all cloud nodes (normal users should not see them).

IMHO, there is a bug in sinfo (and other tools?) causing cloud nodes to not be printed by default for the slurm and root users.  Will SchedMD accept this argument and work towards a bug fix?

Thanks,
Ole
Comment 9 Dominik Bartkiewicz 2022-06-09 05:31:58 MDT
> Thanks for your analysis and suggested resolution.  I still believe that
> PrivateData=cloud should *NOT* be required, because the users slurm and root
> should by default see all cloud nodes (normal users should not see them).

Not displaying down cloud nodes to all users is the default behavior from the beginning of CLOUD nodes.
In slurm-14-11-0 we add an option to print them in PrivateData.
This never was or was supposed to be a private date because all of these data are available in slurm.conf.

> 
> IMHO, there is a bug in sinfo (and other tools?) causing cloud nodes to not
> be printed by default for the slurm and root users.  Will SchedMD accept
> this argument and work towards a bug fix?

This is a documentation bug, and we will fix it.
Perhaps default behavior isn't best, but now it is too late to change it just as a bug fix. 
In my opinion, this option shouldn't be available in PrivateData because it is not PrivateData. Instead, we can move this to SlurmctldParameters or add show_flags to allow tools to control this behavior and then depreciate cloud in PrivateData. I will discuss this with the team internally and let you know what we propose. 

Dominik
Comment 16 Dominik Bartkiewicz 2022-06-17 05:22:47 MDT
Hi

This commit updates documentation:
https://github.com/SchedMD/slurm/commit/d889d317fddff

We already have an internal ticket (bug 4751) to track this issue.
I believe that in the next major release, we will implement this option in a better way. Please let me know if you need anything else.

Dominik
Comment 17 Ole.H.Nielsen@fysik.dtu.dk 2022-06-17 05:47:40 MDT
(In reply to Dominik Bartkiewicz from comment #16)
> This commit updates documentation:
> https://github.com/SchedMD/slurm/commit/d889d317fddff
> 
> We already have an internal ticket (bug 4751) to track this issue.
> I believe that in the next major release, we will implement this option in a
> better way. Please let me know if you need anything else.

Thanks very much for documenting this!  I see that it's already in the 22.05.2 docs.

Best regards,
Ole
Comment 18 Dominik Bartkiewicz 2022-06-17 06:26:54 MDT
I'm closing this bug as infogiven. Please let us know if you have any other
questions or concerns.

Dominik