1614 – strange reason code with entire partition reservation

Bug 1614 - strange reason code with entire partition reservation

Summary: strange reason code with entire partition reservation

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other bugs)
Version:	14.11.5
Hardware:	Linux Linux

Importance:	--- 4 - Minor Issue
Assignee:	Moe Jette
QA Contact:

URL:

Depends on:
Blocks:	1920
	Show dependency tree / graph

Reported:	2015-04-22 06:11 MDT by wettstein
Modified:	2015-09-07 20:29 MDT (History)
CC List:	2 users (show)

See Also:
Site:	University of Chicago
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	14.11.8
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm.conf (9.12 KB, text/plain) 2015-04-23 02:40 MDT, wettstein	Details
print host status when setting node list reason (1.56 KB, patch) 2015-04-28 09:34 MDT, David Bigagli	Details \| Diff
slurmctld.log (800.14 KB, application/x-gzip) 2015-05-11 08:04 MDT, wettstein	Details
new instrumentation (1.60 KB, patch) 2015-05-11 10:43 MDT, David Bigagli	Details \| Diff
slurmctld.log (187.79 KB, application/x-gzip) 2015-05-14 06:44 MDT, wettstein	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description wettstein 2015-04-22 06:11:08 MDT

We noticed after making a reservation that had all the nodes in a partition that the reason code for a pending job was a little strange. It gives something like Reason=ReqNodeNotAvail(Unavailable:midway617) where the node mentioned is not in that partition. 

I can reproduce by creating a reservation on all the nodes (2 total) in our mic partition:

scontrol create res=mic nodecnt=2 start=now duration=1-00:00:00 users=user1 partition=mic

ReservationName=mic StartTime=2015-04-22T12:58:07 EndTime=2015-04-23T12:58:07 Duration=1-00:00:00
   Nodes=midway-mic[01-02] NodeCnt=2 CoreCnt=32 Features=(null) PartitionName=mic Flags=
   Users=user1 Accounts=(null) Licenses=(null) State=ACTIVE

I am not in the user list for that reservation. I then submitted a job to the mic partition. My job was pending with Reason=ReqNodeNotAvail(Unavailable:midway617). midway617 is not in that partition. After I removed the reservation, the job did start running. I think this may only be a cosmetic issue, but it is a little confusing.

Comment 1 David Bigagli 2015-04-22 10:01:10 MDT

Hi, we are trying to reproduce the problem in house using your steps.
Could you please append your configuration? Can you reproduce this always?

Thanks,
David

Comment 2 wettstein 2015-04-23 02:40:30 MDT

Created attachment 1838 [details]
slurm.conf

Comment 3 wettstein 2015-04-23 02:45:57 MDT

It is reproducible every time I've tried so far. The list of nodes it says are unavailable does change. This morning the reason was this:

Reason=ReqNodeNotAvail(Unavailable:midway[230-232,493-494,569-572,617],midway-l34-[01-04])

Those are nodes from 3 different partitions: midway[230-232,493-494],midway-l34-[01-04] is from our gpu partition. All of those nodes except midway-l34-04 are in another active reservation right now. midway[569-572] is from our ivyb partition. There is no reservation on those nodes. midway617 is from our depablo partition and that node is currently down.

Comment 4 David Bigagli 2015-04-24 08:21:22 MDT

Hi,
   so it turned out that this works as designed. The ReqNodeNotAvail indicates nodes that are not available for scheduling in states DOWN, DRAINING/DRAINED,
FAILING or NO_RESPOND. Can you verify that this is indeed the case in your cluster.

David

Comment 5 wettstein 2015-04-24 08:26:48 MDT

From my last comment with Reason=ReqNodeNotAvail(Unavailable:midway[230-232,493-494,569-572,617],midway-l34-[01-04])

midway[230-232,493-494],midway-l34-[01-03] were up and had a reservation
midway[569-572],midway-l34-04 were up and had no reservation
midway617 was down

Comment 6 David Bigagli 2015-04-24 08:33:51 MDT

We cannot reproduce this case even using your configuration. When this happens 
again can you send us the output of scontrol show job and sinfo -n <list of hosts>
in the ReqNodeNotAvail string at the same time.

David

Comment 7 Moe Jette 2015-04-24 08:36:45 MDT

Did you check your slurmctld log for nodes that are being flagged as not responding on a intermittent basis? Search for "not responding" in the file. It might only be apparent from checking the log.

Comment 8 wettstein 2015-04-24 08:38:24 MDT

JobId=14189627 JobName=_interactive
   UserId=wettstein(891783663) GroupId=wettstein(891783663)
   Priority=165701 Nice=0 Account=rcc-staff QOS=mic
   JobState=PENDING Reason=ReqNodeNotAvail(Unavailable:midway617,midway-bigmem[01-03]) Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=02:00:00 TimeMin=N/A
   SubmitTime=2015-04-24T15:35:25 EligibleTime=2015-04-24T15:35:25
   StartTime=2015-04-25T15:35:21 EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=mic AllocNode:Sid=midway-login1:14691
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=midway-mic02
   NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=0 Contiguous=0 Licenses=(null) Network=(null)
   Command=/software/bin/_interactive
   WorkDir=/home/wettstein/redmine/pubsw/software/modulefiles/comsol
   StdErr=/dev/null
   StdIn=/dev/null
   StdOut=/dev/null

#sinfo -n midway617,midway-bigmem[01-03]
PARTITION    AVAIL  TIMELIMIT  NODES  STATE NODELIST
cron            up   infinite      0    n/a
westmere        up   infinite      0    n/a
sandyb*         up   infinite      0    n/a
gpu             up   infinite      0    n/a
viz             up   infinite      0    n/a
bigmem          up   infinite      3  alloc midway-bigmem[01-03]
mic             up   infinite      0    n/a
sepalmer        up   infinite      0    n/a
kicp            up   infinite      0    n/a
kicp-long       up   infinite      0    n/a
kicp-ht         up   infinite      0    n/a
surph           up   infinite      0    n/a
surph-large     up   infinite      0    n/a
roux            up   infinite      0    n/a
rouxgpu         up   infinite      0    n/a
weare-dinner    up   infinite      0    n/a
depablo         up   infinite      1  down* midway617
gagalli         up   infinite      0    n/a
svaikunt        up   infinite      0    n/a
fabrycky        up   infinite      0    n/a
gavoth          up   infinite      0    n/a
tokmakoff       up   infinite      0    n/a
taddy           up   infinite      0    n/a
mkolar          up   infinite      0    n/a
kite            up   infinite      0    n/a
scidmz          up   infinite      0    n/a
amd             up   infinite      0    n/a
ivyb            up   infinite      0    n/a
neiman          up   infinite      0    n/a

Comment 9 David Bigagli 2015-04-24 09:15:16 MDT

As Moe suggested could you check if in the slurmctld.log containes message like
"error: Nodes midway-bigmem0 not responding" for nodes midway-bigmem[01-03].

David

Comment 10 Moe Jette 2015-04-24 09:27:40 MDT

(In reply to David Bigagli from comment #9)
> As Moe suggested could you check if in the slurmctld.log containes message
> like
> "error: Nodes midway-bigmem0 not responding" for nodes midway-bigmem[01-03].
> 
> David

Just check for the "not responding" because the node names could be grouped into various regular expressions.

Comment 11 wettstein 2015-04-27 02:21:42 MDT

On Fri, Apr 24, 2015 at 09:27:40PM +0000, bugs@schedmd.com wrote:
> http://bugs.schedmd.com/show_bug.cgi?id=1614
> 
> --- Comment #10 from Moe Jette <jette@schedmd.com> ---
> (In reply to David Bigagli from comment #9)
> > As Moe suggested could you check if in the slurmctld.log containes message
> > like
> > "error: Nodes midway-bigmem0 not responding" for nodes midway-bigmem[01-03].
> > 
> > David
> 
> Just check for the "not responding" because the node names could be grouped
> into various regular expressions.

There are no messages for "not responding" for the nodes listed in the
reason field.

Comment 12 David Bigagli 2015-04-28 08:37:59 MDT

I would like to prepare a patch which is an instrumentation to explore where the issue is. The idea is to print the hosts and their state right where the job reason is set.

Would you be able to apply and run the patched code?

David

Comment 13 wettstein 2015-04-28 08:38:58 MDT

Yes, I can do that.

On Tue, Apr 28, 2015 at 08:37:59PM +0000, bugs@schedmd.com wrote:
> http://bugs.schedmd.com/show_bug.cgi?id=1614
> 
> --- Comment #12 from David Bigagli <david@schedmd.com> ---
> 
> I would like to prepare a patch which is an instrumentation to explore where
> the issue is. The idea is to print the hosts and their state right where the
> job reason is set.
> 
> Would you be able to apply and run the patched code?
> 
> David
> 
> -- 
> You are receiving this mail because:
> You reported the bug.

Comment 14 David Bigagli 2015-04-28 09:34:42 MDT

Created attachment 1847 [details]
print host status when setting node list reason

Comment 15 David Bigagli 2015-04-28 09:36:24 MDT

The patch is attached. It will generate this kind of output in the slurrmctld.log:

Apr 28 14:34:53.195548  slurmctld: _print_node_state: print direct map
Apr 28 14:34:53.195564  slurmctld: _print_node_state: node dario state 0x22
Apr 28 14:34:53.195576  slurmctld: _print_node_state: node prometeo state 0x22
Apr 28 14:34:53.195587  slurmctld: _print_node_state: print reverse map

the messages are log at info level.

David

Comment 16 wettstein 2015-04-29 08:35:21 MDT

Well, after restarting slurm the problem isn't happening. I
unfortunately didn't do a check right before I restarted so I don't know
if just a restart cleared up the problem or if it had already cleared
up. I've changed the log level on the patch to debug so I can leave it
enabled. I will test again over the next couple of days to see if this
starts happening again.

On Tue, Apr 28, 2015 at 09:36:24PM +0000, bugs@schedmd.com wrote:
> http://bugs.schedmd.com/show_bug.cgi?id=1614
> 
> --- Comment #15 from David Bigagli <david@schedmd.com> ---
> 
> The patch is attached. It will generate this kind of output in the
> slurrmctld.log:
> 
> Apr 28 14:34:53.195548  slurmctld: _print_node_state: print direct map
> Apr 28 14:34:53.195564  slurmctld: _print_node_state: node dario state 0x22
> Apr 28 14:34:53.195576  slurmctld: _print_node_state: node prometeo state 0x22
> Apr 28 14:34:53.195587  slurmctld: _print_node_state: print reverse map
> 
> the messages are log at info level.
> 
> David
> 
> -- 
> You are receiving this mail because:
> You reported the bug.

Comment 17 David Bigagli 2015-04-29 08:41:44 MDT

Ah a heisenbug perhaps. Let's wait.

David

Comment 18 wettstein 2015-05-07 06:59:23 MDT

Ok. This has been intermittent, but I have a job now naming a very large
number of nodes (only midway617 is down):

$ scontrol show jobid=14280035
JobId=14280035 JobName=_interactive
   UserId=wettstein(891783663) GroupId=wettstein(891783663)
   Priority=163163 Nice=0 Account=rcc-staff QOS=mic
   JobState=PENDING Reason=ReqNodeNotAvail(Unavailable:midway[033-036,069-073,077-112,193-226,249-262,304-338,341-376,379-414,417-452,455-488,617],midway-bigmem[01-03]) Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=02:00:00 TimeMin=N/A
   SubmitTime=2015-05-07T13:50:44 EligibleTime=2015-05-07T13:50:44
   StartTime=2015-05-08T11:13:48 EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=mic AllocNode:Sid=midway-login2:19375
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=midway-mic01
   NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/software/bin/_interactive
   WorkDir=/home/wettstein/redmine/pubsw/userguide
   StdErr=/dev/null
   StdIn=/dev/null
   StdOut=/dev/null


$ sinfo -n midway[033-036,069-073,077-112,193-226,249-262,304-338,341-376,379-414,417-452,455-488,617],midway-bigmem[01-03]
PARTITION    AVAIL  TIMELIMIT  NODES  STATE NODELIST
cron            up   infinite      0    n/a
westmere        up   infinite      0    n/a
sandyb*         up   infinite     43    mix midway[036,081-083,086-088,096,101,104-105,110,193,196,202-203,207-208,210-211,218,262,309,312,317-318,329,334,341,351,353,380-382,389,396,417,433-434,436,459,470,484]
sandyb*         up   infinite    227  alloc midway[033-035,069-073,077-080,084-085,089-095,097-100,102-103,106-109,111-112,194-195,197-201,204-206,209,212-217,219-226,249-261,304-308,310-311,313-316,319-328,330-333,335-338,342-350,352,354-376,379,383-388,390-395,397-414,418-432,435,437-452,455-458,460-469,471-483,485-488]
gpu             up   infinite      0    n/a
viz             up   infinite      0    n/a
bigmem          up   infinite      1    mix midway-bigmem02
bigmem          up   infinite      2  alloc midway-bigmem[01,03]
mic             up   infinite      0    n/a
sepalmer        up   infinite      0    n/a
kicp            up   infinite      0    n/a
kicp-long       up   infinite      0    n/a
kicp-ht         up   infinite      0    n/a
surph           up   infinite      0    n/a
surph-large     up   infinite      0    n/a
roux            up   infinite      0    n/a
rouxgpu         up   infinite      0    n/a
weare-dinner    up   infinite      0    n/a
depablo         up   infinite      1  drain midway617
gagalli         up   infinite      0    n/a
svaikunt        up   infinite      0    n/a
fabrycky        up   infinite      0    n/a
gavoth          up   infinite      0    n/a
tokmakoff       up   infinite      0    n/a
taddy           up   infinite      0    n/a
mkolar          up   infinite      0    n/a
kite            up   infinite      0    n/a
scidmz          up   infinite      0    n/a
amd             up   infinite      0    n/a
ivyb            up   infinite      0    n/a
neiman          up   infinite      0    n/a

For all of the nodes I checked the additional logging is similar to this:

[2015-05-07T13:50:38.324] debug:  _print_node_state: node midway488 state 0x3

All the nodes had state 0x3 and it did not change before or after the job
submission.




On Wed, Apr 29, 2015 at 08:41:44PM +0000, bugs@schedmd.com wrote:
> http://bugs.schedmd.com/show_bug.cgi?id=1614
> 
> --- Comment #17 from David Bigagli <david@schedmd.com> ---
> 
> Ah a heisenbug perhaps. Let's wait.
> 
> David
> 
> -- 
> You are receiving this mail because:
> You reported the bug.

Comment 19 David Bigagli 2015-05-11 08:00:02 MDT

Could you please append the entire log. The instrumentation prints the direct bitmap and the reverse bitmap after the bits were cleaned.

David

Comment 20 David Bigagli 2015-05-11 08:03:10 MDT

Or I should say after the bit were reversed rather than cleaned.

Comment 21 wettstein 2015-05-11 08:04:23 MDT

Created attachment 1873 [details]
slurmctld.log

This is the full log. I turned on the debugging shortly before I submitted that job.

Comment 22 David Bigagli 2015-05-11 10:43:24 MDT

Created attachment 1874 [details]
new instrumentation

Comment 23 David Bigagli 2015-05-11 10:46:00 MDT

Indeed the state 0x3 is NODE_STATE_ALLOCATED and it should not be displayed in this context. Can you reproduce this without having a reservation?
We are reviewing the code for possible errors. We have also modified the instrumentation a bit, could you please update your code including it?

David

Comment 24 wettstein 2015-05-14 06:43:29 MDT

I've only seen it with a reservation in place. I reproduced it today
with the new patch. I'll attach the log file as well.

# scontrol show jobid=14354779
JobId=14354779 JobName=_interactive
   UserId=wettstein(891783663) GroupId=wettstein(891783663)
   Priority=165128 Nice=0 Account=rcc-staff QOS=mic
   JobState=PENDING Reason=ReqNodeNotAvail(Unavailable:midway[569-572,617]) Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=02:00:00 TimeMin=N/A
   SubmitTime=2015-05-14T13:39:03 EligibleTime=2015-05-14T13:39:03
   StartTime=Unknown EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=mic AllocNode:Sid=midway-login2:19375
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/software/bin/_interactive
   WorkDir=/home/wettstein/spark
   StdErr=/dev/null
   StdIn=/dev/null
   StdOut=/dev/null

# sinfo -n midway[569-572,617]
PARTITION    AVAIL  TIMELIMIT  NODES  STATE NODELIST
cron            up   infinite      0    n/a
westmere        up   infinite      0    n/a
sandyb*         up   infinite      0    n/a
gpu             up   infinite      0    n/a
viz             up   infinite      0    n/a
bigmem          up   infinite      0    n/a
mic             up   infinite      0    n/a
sepalmer        up   infinite      0    n/a
kicp            up   infinite      0    n/a
kicp-long       up   infinite      0    n/a
kicp-ht         up   infinite      0    n/a
surph           up   infinite      0    n/a
surph-large     up   infinite      0    n/a
roux            up   infinite      0    n/a
rouxgpu         up   infinite      0    n/a
weare-dinner    up   infinite      0    n/a
depablo         up   infinite      1  drain midway617
gagalli         up   infinite      0    n/a
svaikunt        up   infinite      0    n/a
fabrycky        up   infinite      0    n/a
gavoth          up   infinite      0    n/a
tokmakoff       up   infinite      0    n/a
taddy           up   infinite      0    n/a
mkolar          up   infinite      0    n/a
kite            up   infinite      0    n/a
scidmz          up   infinite      0    n/a
amd             up   infinite      0    n/a
ivyb            up   infinite      4  alloc midway[569-572]
neiman          up   infinite      0    n/a

Comment 25 wettstein 2015-05-14 06:44:12 MDT

Created attachment 1878 [details]
slurmctld.log

Comment 26 David Bigagli 2015-05-14 06:49:39 MDT

Thanks. Were the nodes midway[569-572,617] in any reservation or they were just allocated?

David

Comment 27 wettstein 2015-05-14 06:59:11 MDT

Those nodes are not in a reservation.

On Thu, May 14, 2015 at 06:49:39PM +0000, bugs@schedmd.com wrote:
> http://bugs.schedmd.com/show_bug.cgi?id=1614
> 
> --- Comment #26 from David Bigagli <david@schedmd.com> ---
> 
> Thanks. Were the nodes midway[569-572,617] in any reservation or they were just
> allocated?
> 
> David
> 
> -- 
> You are receiving this mail because:
> You reported the bug.

Comment 28 David Bigagli 2015-05-19 05:08:18 MDT

The log file shows that some node which are in allocated state are not listed
as available. This should not happen as the available node bitmap should be 
populated with all these nodes. At this point I think we can trace when the nodes are added and removed to the bitmap to hopefully see what is happening. 
I will attach a new instrumentation.

David

Comment 29 David Bigagli 2015-05-19 06:08:53 MDT

Bug reassign to scheduler development team.

David

Comment 30 Moe Jette 2015-05-26 04:59:25 MDT

I see what is happening. The main scheduling function, _schedule() in src/slurmctld job_scheduler.c, makes a copy of the available nodes (save_avail_node_bitmap) and then removes nodes from the pool available for pending jobs as it goes along, clearing bits from avail_node_bitmap based upon resource requirements for higher priority jobs. The avail_node_bitmap is what is reported to the user as nodes not currently available. In a sense that is correct, but the nodes are not necessarily DOWN, but could just be reserved for higher priority jobs. We'll need to restructure the code to address this properly.

Comment 31 Moe Jette 2015-05-26 06:20:23 MDT

Thank you for your help debugging the problem. Your logs were very helpful.

The bug will be fixed in the 14.11.8 release, likely in late June:
https://github.com/SchedMD/slurm/commit/dd6d5ddc79650f696a7fc1b6ee8191befc31f19b

The job's reason will be more verbose to more clearly explain the situation, something like this:
ReqNodeNotAvail, May be reserved for other job, UnavailableNodes: tux10

"ReqNodeNotAvail" is to indicate that one or more nodes required to start the job are not available right now
"May be reserved for other job" is to indicate the the nodes may be reserved for another job (or in use, in a reservation, or DOWN, or .....), but I'm trying to keep it relatively short
"UnavailableNodes: " will list the nodes which are DOWN, DRAINED, or not responding, in other words, likely to require the attention of an administrator to resolve

Comment 32 Moe Jette 2015-05-26 06:21:24 MDT

You probably want to remove David's patch as it will generate a lot of log messages.