We noticed after making a reservation that had all the nodes in a partition that the reason code for a pending job was a little strange. It gives something like Reason=ReqNodeNotAvail(Unavailable:midway617) where the node mentioned is not in that partition. I can reproduce by creating a reservation on all the nodes (2 total) in our mic partition: scontrol create res=mic nodecnt=2 start=now duration=1-00:00:00 users=user1 partition=mic ReservationName=mic StartTime=2015-04-22T12:58:07 EndTime=2015-04-23T12:58:07 Duration=1-00:00:00 Nodes=midway-mic[01-02] NodeCnt=2 CoreCnt=32 Features=(null) PartitionName=mic Flags= Users=user1 Accounts=(null) Licenses=(null) State=ACTIVE I am not in the user list for that reservation. I then submitted a job to the mic partition. My job was pending with Reason=ReqNodeNotAvail(Unavailable:midway617). midway617 is not in that partition. After I removed the reservation, the job did start running. I think this may only be a cosmetic issue, but it is a little confusing.
Hi, we are trying to reproduce the problem in house using your steps. Could you please append your configuration? Can you reproduce this always? Thanks, David
Created attachment 1838 [details] slurm.conf
It is reproducible every time I've tried so far. The list of nodes it says are unavailable does change. This morning the reason was this: Reason=ReqNodeNotAvail(Unavailable:midway[230-232,493-494,569-572,617],midway-l34-[01-04]) Those are nodes from 3 different partitions: midway[230-232,493-494],midway-l34-[01-04] is from our gpu partition. All of those nodes except midway-l34-04 are in another active reservation right now. midway[569-572] is from our ivyb partition. There is no reservation on those nodes. midway617 is from our depablo partition and that node is currently down.
Hi, so it turned out that this works as designed. The ReqNodeNotAvail indicates nodes that are not available for scheduling in states DOWN, DRAINING/DRAINED, FAILING or NO_RESPOND. Can you verify that this is indeed the case in your cluster. David
From my last comment with Reason=ReqNodeNotAvail(Unavailable:midway[230-232,493-494,569-572,617],midway-l34-[01-04]) midway[230-232,493-494],midway-l34-[01-03] were up and had a reservation midway[569-572],midway-l34-04 were up and had no reservation midway617 was down
We cannot reproduce this case even using your configuration. When this happens again can you send us the output of scontrol show job and sinfo -n <list of hosts> in the ReqNodeNotAvail string at the same time. David
Did you check your slurmctld log for nodes that are being flagged as not responding on a intermittent basis? Search for "not responding" in the file. It might only be apparent from checking the log.
JobId=14189627 JobName=_interactive UserId=wettstein(891783663) GroupId=wettstein(891783663) Priority=165701 Nice=0 Account=rcc-staff QOS=mic JobState=PENDING Reason=ReqNodeNotAvail(Unavailable:midway617,midway-bigmem[01-03]) Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=02:00:00 TimeMin=N/A SubmitTime=2015-04-24T15:35:25 EligibleTime=2015-04-24T15:35:25 StartTime=2015-04-25T15:35:21 EndTime=Unknown PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=mic AllocNode:Sid=midway-login1:14691 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) SchedNodeList=midway-mic02 NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=* MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=0 Contiguous=0 Licenses=(null) Network=(null) Command=/software/bin/_interactive WorkDir=/home/wettstein/redmine/pubsw/software/modulefiles/comsol StdErr=/dev/null StdIn=/dev/null StdOut=/dev/null #sinfo -n midway617,midway-bigmem[01-03] PARTITION AVAIL TIMELIMIT NODES STATE NODELIST cron up infinite 0 n/a westmere up infinite 0 n/a sandyb* up infinite 0 n/a gpu up infinite 0 n/a viz up infinite 0 n/a bigmem up infinite 3 alloc midway-bigmem[01-03] mic up infinite 0 n/a sepalmer up infinite 0 n/a kicp up infinite 0 n/a kicp-long up infinite 0 n/a kicp-ht up infinite 0 n/a surph up infinite 0 n/a surph-large up infinite 0 n/a roux up infinite 0 n/a rouxgpu up infinite 0 n/a weare-dinner up infinite 0 n/a depablo up infinite 1 down* midway617 gagalli up infinite 0 n/a svaikunt up infinite 0 n/a fabrycky up infinite 0 n/a gavoth up infinite 0 n/a tokmakoff up infinite 0 n/a taddy up infinite 0 n/a mkolar up infinite 0 n/a kite up infinite 0 n/a scidmz up infinite 0 n/a amd up infinite 0 n/a ivyb up infinite 0 n/a neiman up infinite 0 n/a
As Moe suggested could you check if in the slurmctld.log containes message like "error: Nodes midway-bigmem0 not responding" for nodes midway-bigmem[01-03]. David
(In reply to David Bigagli from comment #9) > As Moe suggested could you check if in the slurmctld.log containes message > like > "error: Nodes midway-bigmem0 not responding" for nodes midway-bigmem[01-03]. > > David Just check for the "not responding" because the node names could be grouped into various regular expressions.
On Fri, Apr 24, 2015 at 09:27:40PM +0000, bugs@schedmd.com wrote: > http://bugs.schedmd.com/show_bug.cgi?id=1614 > > --- Comment #10 from Moe Jette <jette@schedmd.com> --- > (In reply to David Bigagli from comment #9) > > As Moe suggested could you check if in the slurmctld.log containes message > > like > > "error: Nodes midway-bigmem0 not responding" for nodes midway-bigmem[01-03]. > > > > David > > Just check for the "not responding" because the node names could be grouped > into various regular expressions. There are no messages for "not responding" for the nodes listed in the reason field.
I would like to prepare a patch which is an instrumentation to explore where the issue is. The idea is to print the hosts and their state right where the job reason is set. Would you be able to apply and run the patched code? David
Yes, I can do that. On Tue, Apr 28, 2015 at 08:37:59PM +0000, bugs@schedmd.com wrote: > http://bugs.schedmd.com/show_bug.cgi?id=1614 > > --- Comment #12 from David Bigagli <david@schedmd.com> --- > > I would like to prepare a patch which is an instrumentation to explore where > the issue is. The idea is to print the hosts and their state right where the > job reason is set. > > Would you be able to apply and run the patched code? > > David > > -- > You are receiving this mail because: > You reported the bug.
Created attachment 1847 [details] print host status when setting node list reason
The patch is attached. It will generate this kind of output in the slurrmctld.log: Apr 28 14:34:53.195548 slurmctld: _print_node_state: print direct map Apr 28 14:34:53.195564 slurmctld: _print_node_state: node dario state 0x22 Apr 28 14:34:53.195576 slurmctld: _print_node_state: node prometeo state 0x22 Apr 28 14:34:53.195587 slurmctld: _print_node_state: print reverse map the messages are log at info level. David
Well, after restarting slurm the problem isn't happening. I unfortunately didn't do a check right before I restarted so I don't know if just a restart cleared up the problem or if it had already cleared up. I've changed the log level on the patch to debug so I can leave it enabled. I will test again over the next couple of days to see if this starts happening again. On Tue, Apr 28, 2015 at 09:36:24PM +0000, bugs@schedmd.com wrote: > http://bugs.schedmd.com/show_bug.cgi?id=1614 > > --- Comment #15 from David Bigagli <david@schedmd.com> --- > > The patch is attached. It will generate this kind of output in the > slurrmctld.log: > > Apr 28 14:34:53.195548 slurmctld: _print_node_state: print direct map > Apr 28 14:34:53.195564 slurmctld: _print_node_state: node dario state 0x22 > Apr 28 14:34:53.195576 slurmctld: _print_node_state: node prometeo state 0x22 > Apr 28 14:34:53.195587 slurmctld: _print_node_state: print reverse map > > the messages are log at info level. > > David > > -- > You are receiving this mail because: > You reported the bug.
Ah a heisenbug perhaps. Let's wait. David
Ok. This has been intermittent, but I have a job now naming a very large number of nodes (only midway617 is down): $ scontrol show jobid=14280035 JobId=14280035 JobName=_interactive UserId=wettstein(891783663) GroupId=wettstein(891783663) Priority=163163 Nice=0 Account=rcc-staff QOS=mic JobState=PENDING Reason=ReqNodeNotAvail(Unavailable:midway[033-036,069-073,077-112,193-226,249-262,304-338,341-376,379-414,417-452,455-488,617],midway-bigmem[01-03]) Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=02:00:00 TimeMin=N/A SubmitTime=2015-05-07T13:50:44 EligibleTime=2015-05-07T13:50:44 StartTime=2015-05-08T11:13:48 EndTime=Unknown PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=mic AllocNode:Sid=midway-login2:19375 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) SchedNodeList=midway-mic01 NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=* MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=/software/bin/_interactive WorkDir=/home/wettstein/redmine/pubsw/userguide StdErr=/dev/null StdIn=/dev/null StdOut=/dev/null $ sinfo -n midway[033-036,069-073,077-112,193-226,249-262,304-338,341-376,379-414,417-452,455-488,617],midway-bigmem[01-03] PARTITION AVAIL TIMELIMIT NODES STATE NODELIST cron up infinite 0 n/a westmere up infinite 0 n/a sandyb* up infinite 43 mix midway[036,081-083,086-088,096,101,104-105,110,193,196,202-203,207-208,210-211,218,262,309,312,317-318,329,334,341,351,353,380-382,389,396,417,433-434,436,459,470,484] sandyb* up infinite 227 alloc midway[033-035,069-073,077-080,084-085,089-095,097-100,102-103,106-109,111-112,194-195,197-201,204-206,209,212-217,219-226,249-261,304-308,310-311,313-316,319-328,330-333,335-338,342-350,352,354-376,379,383-388,390-395,397-414,418-432,435,437-452,455-458,460-469,471-483,485-488] gpu up infinite 0 n/a viz up infinite 0 n/a bigmem up infinite 1 mix midway-bigmem02 bigmem up infinite 2 alloc midway-bigmem[01,03] mic up infinite 0 n/a sepalmer up infinite 0 n/a kicp up infinite 0 n/a kicp-long up infinite 0 n/a kicp-ht up infinite 0 n/a surph up infinite 0 n/a surph-large up infinite 0 n/a roux up infinite 0 n/a rouxgpu up infinite 0 n/a weare-dinner up infinite 0 n/a depablo up infinite 1 drain midway617 gagalli up infinite 0 n/a svaikunt up infinite 0 n/a fabrycky up infinite 0 n/a gavoth up infinite 0 n/a tokmakoff up infinite 0 n/a taddy up infinite 0 n/a mkolar up infinite 0 n/a kite up infinite 0 n/a scidmz up infinite 0 n/a amd up infinite 0 n/a ivyb up infinite 0 n/a neiman up infinite 0 n/a For all of the nodes I checked the additional logging is similar to this: [2015-05-07T13:50:38.324] debug: _print_node_state: node midway488 state 0x3 All the nodes had state 0x3 and it did not change before or after the job submission. On Wed, Apr 29, 2015 at 08:41:44PM +0000, bugs@schedmd.com wrote: > http://bugs.schedmd.com/show_bug.cgi?id=1614 > > --- Comment #17 from David Bigagli <david@schedmd.com> --- > > Ah a heisenbug perhaps. Let's wait. > > David > > -- > You are receiving this mail because: > You reported the bug.
Could you please append the entire log. The instrumentation prints the direct bitmap and the reverse bitmap after the bits were cleaned. David
Or I should say after the bit were reversed rather than cleaned.
Created attachment 1873 [details] slurmctld.log This is the full log. I turned on the debugging shortly before I submitted that job.
Created attachment 1874 [details] new instrumentation
Indeed the state 0x3 is NODE_STATE_ALLOCATED and it should not be displayed in this context. Can you reproduce this without having a reservation? We are reviewing the code for possible errors. We have also modified the instrumentation a bit, could you please update your code including it? David
I've only seen it with a reservation in place. I reproduced it today with the new patch. I'll attach the log file as well. # scontrol show jobid=14354779 JobId=14354779 JobName=_interactive UserId=wettstein(891783663) GroupId=wettstein(891783663) Priority=165128 Nice=0 Account=rcc-staff QOS=mic JobState=PENDING Reason=ReqNodeNotAvail(Unavailable:midway[569-572,617]) Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=02:00:00 TimeMin=N/A SubmitTime=2015-05-14T13:39:03 EligibleTime=2015-05-14T13:39:03 StartTime=Unknown EndTime=Unknown PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=mic AllocNode:Sid=midway-login2:19375 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=* MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=/software/bin/_interactive WorkDir=/home/wettstein/spark StdErr=/dev/null StdIn=/dev/null StdOut=/dev/null # sinfo -n midway[569-572,617] PARTITION AVAIL TIMELIMIT NODES STATE NODELIST cron up infinite 0 n/a westmere up infinite 0 n/a sandyb* up infinite 0 n/a gpu up infinite 0 n/a viz up infinite 0 n/a bigmem up infinite 0 n/a mic up infinite 0 n/a sepalmer up infinite 0 n/a kicp up infinite 0 n/a kicp-long up infinite 0 n/a kicp-ht up infinite 0 n/a surph up infinite 0 n/a surph-large up infinite 0 n/a roux up infinite 0 n/a rouxgpu up infinite 0 n/a weare-dinner up infinite 0 n/a depablo up infinite 1 drain midway617 gagalli up infinite 0 n/a svaikunt up infinite 0 n/a fabrycky up infinite 0 n/a gavoth up infinite 0 n/a tokmakoff up infinite 0 n/a taddy up infinite 0 n/a mkolar up infinite 0 n/a kite up infinite 0 n/a scidmz up infinite 0 n/a amd up infinite 0 n/a ivyb up infinite 4 alloc midway[569-572] neiman up infinite 0 n/a
Created attachment 1878 [details] slurmctld.log
Thanks. Were the nodes midway[569-572,617] in any reservation or they were just allocated? David
Those nodes are not in a reservation. On Thu, May 14, 2015 at 06:49:39PM +0000, bugs@schedmd.com wrote: > http://bugs.schedmd.com/show_bug.cgi?id=1614 > > --- Comment #26 from David Bigagli <david@schedmd.com> --- > > Thanks. Were the nodes midway[569-572,617] in any reservation or they were just > allocated? > > David > > -- > You are receiving this mail because: > You reported the bug.
The log file shows that some node which are in allocated state are not listed as available. This should not happen as the available node bitmap should be populated with all these nodes. At this point I think we can trace when the nodes are added and removed to the bitmap to hopefully see what is happening. I will attach a new instrumentation. David
Bug reassign to scheduler development team. David
I see what is happening. The main scheduling function, _schedule() in src/slurmctld job_scheduler.c, makes a copy of the available nodes (save_avail_node_bitmap) and then removes nodes from the pool available for pending jobs as it goes along, clearing bits from avail_node_bitmap based upon resource requirements for higher priority jobs. The avail_node_bitmap is what is reported to the user as nodes not currently available. In a sense that is correct, but the nodes are not necessarily DOWN, but could just be reserved for higher priority jobs. We'll need to restructure the code to address this properly.
Thank you for your help debugging the problem. Your logs were very helpful. The bug will be fixed in the 14.11.8 release, likely in late June: https://github.com/SchedMD/slurm/commit/dd6d5ddc79650f696a7fc1b6ee8191befc31f19b The job's reason will be more verbose to more clearly explain the situation, something like this: ReqNodeNotAvail, May be reserved for other job, UnavailableNodes: tux10 "ReqNodeNotAvail" is to indicate that one or more nodes required to start the job are not available right now "May be reserved for other job" is to indicate the the nodes may be reserved for another job (or in use, in a reservation, or DOWN, or .....), but I'm trying to keep it relatively short "UnavailableNodes: " will list the nodes which are DOWN, DRAINED, or not responding, in other words, likely to require the attention of an administrator to resolve
You probably want to remove David's patch as it will generate a lot of log messages.