1850 – Bad core_bitmap size error

Ticket 1850 - Bad core_bitmap size error

Summary: Bad core_bitmap size error

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	14.11.7
Hardware:	Linux Linux

Importance:	--- 4 - Minor Issue
Assignee:	Brian Christiansen
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2015-08-06 04:19 MDT by Paul Edmon
Modified:	2015-08-13 10:30 MDT (History)
CC List:	3 users (show)

See Also:
Site:	Harvard University
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	14.11.9
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm.conf-holy-slurm01 (31.68 KB, text/plain) 2015-08-06 05:01 MDT, Paul Edmon	Details
Bug fix (4.20 KB, patch) 2015-08-13 10:24 MDT, Moe Jette	Details \| Diff
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Paul Edmon 2015-08-06 04:19:48 MDT

I recently pulled a bunch of nodes out of slurm.conf that we are moving to a different cluster.  However when I did I got:

Aug  6 12:16:37 holy-slurm01 slurmctld[40120]: error: Bad core_bitmap size for reservation (null) (46854 != 45942), ignoring core reservation

I looked at all the reservations and none involve the cores I removed.  Is there a way to fix this?  It spitting out errors fairly consistently so it makes it hard to read the log.  It also appears to slow the scheduler a bit as it has to handle the error constantly.

-Paul Edmon-

Comment 1 David Bigagli 2015-08-06 04:22:34 MDT

When you removed the node were there some jobs still running? Did you restart the
controller after you pulled those nodes out of slurm.conf?

David

Comment 2 Paul Edmon 2015-08-06 04:24:14 MDT

I made sure there were no jobs running on those nodes and I restarted 
the scheduler after I changed the slurm.conf.

-Paul Edmon-

On 08/06/2015 12:22 PM, bugs@schedmd.com wrote:
>
> *Comment # 1 <http://bugs.schedmd.com/show_bug.cgi?id=1850#c1> on bug 
> 1850 <http://bugs.schedmd.com/show_bug.cgi?id=1850> from David Bigagli 
> <mailto:david@schedmd.com> *
> When you removed the node were there some jobs still running? Did you restart
> the
> controller after you pulled those nodes out of slurm.conf?
>
> David
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 3 David Bigagli 2015-08-06 04:29:33 MDT

When changing nodes in slurm.conf all slurmds have to be restarted as well
to keep the configuration in sync. Are there some slurmds running on these
hosts?

David

Comment 4 Paul Edmon 2015-08-06 04:31:19 MDT

I tried doing that, but I can try again.

-Paul Edmon-

On 08/06/2015 12:29 PM, bugs@schedmd.com wrote:
>
> *Comment # 3 <http://bugs.schedmd.com/show_bug.cgi?id=1850#c3> on bug 
> 1850 <http://bugs.schedmd.com/show_bug.cgi?id=1850> from David Bigagli 
> <mailto:david@schedmd.com> *
> When changing nodes in slurm.conf all slurmds have to be restarted as well
> to keep the configuration in sync. Are there some slurmds running on these
> hosts?
>
> David
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 5 Paul Edmon 2015-08-06 04:36:43 MDT

No a global restart doesn't remove the error either.

-Paul Edmon-

On 08/06/2015 12:29 PM, bugs@schedmd.com wrote:
>
> *Comment # 3 <http://bugs.schedmd.com/show_bug.cgi?id=1850#c3> on bug 
> 1850 <http://bugs.schedmd.com/show_bug.cgi?id=1850> from David Bigagli 
> <mailto:david@schedmd.com> *
> When changing nodes in slurm.conf all slurmds have to be restarted as well
> to keep the configuration in sync. Are there some slurmds running on these
> hosts?
>
> David
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 6 Moe Jette 2015-08-06 04:58:24 MDT

What do your reservations look like ("scontrol show res")?
Do/did they include any removed nodes?
Please attach your current slurm.conf file and the reservation output.

Comment 7 Paul Edmon 2015-08-06 05:00:54 MDT

Created attachment 2099 [details]
slurm.conf-holy-slurm01

[root@holy2a01101 ~]# scontrol show res
ReservationName=dingma StartTime=2015-06-23T09:00:00 
EndTime=2015-09-16T09:00:00 Duration=85-00:00:00
    Nodes=hp0103 NodeCnt=1 CoreCnt=6 Features=(null) 
PartitionName=kuang_hp Flags=
    Users=dingma Accounts=(null) Licenses=(null) State=ACTIVE

ReservationName=gtorri StartTime=2015-07-06T12:00:00 
EndTime=2015-08-11T12:00:00 Duration=36-00:00:00
    Nodes=holy2b0520[1-8] NodeCnt=8 CoreCnt=128 Features=(null) 
PartitionName=kuang Flags=IGNORE_JOBS,SPEC_NODES
    Users=gtorri Accounts=(null) Licenses=(null) State=ACTIVE

ReservationName=zxi StartTime=2015-07-17T09:00:00 
EndTime=2015-08-14T09:00:00 Duration=28-00:00:00
    Nodes=holy2a[01204,01208,02102,02201-02204,03102-03103,03203] 
NodeCnt=10 CoreCnt=640 Features=(null) PartitionName=general Flags=
    Users=zxi Accounts=(null) Licenses=(null) State=ACTIVE

ReservationName=kuang3 StartTime=2015-07-30T09:00:00 
EndTime=2015-08-27T09:00:00 Duration=28-00:00:00
Nodes=hp[0101-0102,0104,0201,0203-0204,0301-0303,0401-0404,0601-0604,0701,0703-0704,0801-0804,0901-0904,1001-1004,1101,1103-1104,1202,1301-1304,1401-1402,1502-1504,1603-1604,1701-1704,1801-1804,1901-1904,2001,2003,2101-2103] 
NodeCnt=64 CoreCnt=768 Features=(null) PartitionName=kuang_hp Flags=
    Users=kuang Accounts=(null) Licenses=(null) State=ACTIVE

ReservationName=cyana StartTime=2015-08-06T11:21:27 
EndTime=2015-08-20T11:21:27 Duration=14-00:00:00
    Nodes=regal[03,05-13,16-18] NodeCnt=13 CoreCnt=104 Features=(null) 
PartitionName=regal Flags=IGNORE_JOBS,SPEC_NODES
    Users=syockel,jcuff,dsouza,simai Accounts=(null) Licenses=(null) 
State=ACTIVE

-Paul Edmon-

On 08/06/2015 12:58 PM, bugs@schedmd.com wrote:
>
> *Comment # 6 <http://bugs.schedmd.com/show_bug.cgi?id=1850#c6> on bug 
> 1850 <http://bugs.schedmd.com/show_bug.cgi?id=1850> from Moe Jette 
> <mailto:jette@schedmd.com> *
> What do your reservations look like ("scontrol show res")?
> Do/did they include any removed nodes?
> Please attach your current slurm.conf file and the reservation output.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 8 Moe Jette 2015-08-06 06:54:47 MDT

It looks like the problem is related reservation kuang3, which contains a node hp1802 and that node no longer exists in slurm.conf.

All of the other nodes in all of the other reservations exist.

Comment 9 Paul Edmon 2015-08-06 06:56:24 MDT

Oh, okay.  I see.  I will have to make sure to remove that.

-Paul Edmon-

On 08/06/2015 02:54 PM, bugs@schedmd.com wrote:
>
> *Comment # 8 <http://bugs.schedmd.com/show_bug.cgi?id=1850#c8> on bug 
> 1850 <http://bugs.schedmd.com/show_bug.cgi?id=1850> from Moe Jette 
> <mailto:jette@schedmd.com> *
> It looks like the problem is related reservation kuang3, which contains a node
> hp1802 and that node no longer exists in slurm.conf.
>
> All of the other nodes in all of the other reservations exist.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 10 Paul Edmon 2015-08-06 06:59:15 MDT

Actually that node does exist:

root@holy-slurm01 slurm]# scontrol show node hp1802
NodeName=hp1802 Arch=x86_64 CoresPerSocket=6
    CPUAlloc=12 CPUErr=0 CPUTot=12 CPULoad=12.25 Features=intel
    Gres=(null)
    NodeAddr=hp1802 NodeHostName=hp1802 Version=14.11
    OS=Linux RealMemory=24150 AllocMem=24000 Sockets=2 Boards=1
    State=ALLOCATED ThreadsPerCore=1 TmpDisk=258048 Weight=1
    BootTime=2015-07-08T14:47:17 SlurmdStartTime=2015-08-06T12:35:16
    CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


# hp. Owned by RC for Kuang
NodeName=hp010[1-4],hp020[1-4],hp030[1-3],hp040[1-4],hp060[1-4],\
hp070[1,3-4],hp080[1-4],hp090[1-4],hp100[1-4],hp110[1,3-4],hp120[1-2],\
hp130[1-4],hp140[1-2],hp150[2-4],hp160[1-4],hp170[1-4],hp180[1-4],\
hp190[1-4],hp200[1,3],hp210[1-4],hp220[3-4],hp230[3-4],hp240[1-2,4],\
hp250[1-4],hp260[1-3],hp2702 \
     CPUS=12 RealMemory=24150 Sockets=2 CoresPerSocket=6 \
     ThreadsPerCore=1 TmpDisk=258048 Feature=intel

You can see the hp180[1-4]

-Paul Edmon-


On 08/06/2015 02:54 PM, bugs@schedmd.com wrote:
>
> *Comment # 8 <http://bugs.schedmd.com/show_bug.cgi?id=1850#c8> on bug 
> 1850 <http://bugs.schedmd.com/show_bug.cgi?id=1850> from Moe Jette 
> <mailto:jette@schedmd.com> *
> It looks like the problem is related reservation kuang3, which contains a node
> hp1802 and that node no longer exists in slurm.conf.
>
> All of the other nodes in all of the other reservations exist.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 11 Moe Jette 2015-08-06 07:11:01 MDT

(In reply to Paul Edmon from comment #10)
> Actually that node does exist:

Mea culpa, I was accidentally looking in the node list for partition=priority in your slurm.conf: "...hp180[1,3-4]...", so all of the nodes in all of your reservations are valid.

Comment 12 Paul Edmon 2015-08-06 07:12:37 MDT

Ah, now that is a bug as in our environment priority should have 
everything.  I should fix that :).

-Paul Edmon-

On 08/06/2015 03:11 PM, bugs@schedmd.com wrote:
>
> *Comment # 11 <http://bugs.schedmd.com/show_bug.cgi?id=1850#c11> on 
> bug 1850 <http://bugs.schedmd.com/show_bug.cgi?id=1850> from Moe Jette 
> <mailto:jette@schedmd.com> *
> (In reply to Paul Edmon fromcomment #10  <show_bug.cgi?id=1850#c10>)
> > Actually that node does exist:
>
> Mea culpa, I was accidentally looking in the node list for partition=priority
> in your slurm.conf: "...hp180[1,3-4]...", so all of the nodes in all of your
> reservations are valid.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 13 Brian Christiansen 2015-08-06 11:26:56 MDT

I'm able to reproduce this. We'll look further into it.

Comment 14 Scott Yockel 2015-08-11 08:27:13 MDT

We are having this issue again.  

Aug 11 14:59:20 holy-slurm01 slurmctld[36344]: find_node_record passed NULL name
Aug 11 14:59:20 holy-slurm01 kernel: slurmctld_sched[57340]: segfault at 68 ip 000000000046ad36 sp 00007f610b9e8c90 error 4 in slurmctld[400000+23f000]
Aug 11 15:00:02 holy-slurm01 purge-binlogs: Purging master logs to binlog.001792

Paul is out of commission at the moment, so I'm trying to get this back into service.  I'm not as seasoned as a SLURM admin as he as.  Any more info that I can provide you, I'm all ears.

Comment 15 Brian Christiansen 2015-08-11 08:53:39 MDT

Hey Scott,

This looks different than the original bug. I've created Bug 1854 to track this. I'll move over there.

Thanks,
Brian

Comment 16 Paul Edmon 2015-08-13 04:16:06 MDT

So what's the status on this bug.  At this point I can't even add new nodes to the conf.  This isn't good as we have a ton of hardware that just landed and we will need to add it to the conf soon (probably next week or so).

Comment 17 Brian Christiansen 2015-08-13 05:49:52 MDT

The best option for now is to delete the old "core based" reservations and recreate them. I've found that if you try to create a new core based reservation while having the existing reservations (ones created before removing nodes) , the code will hit an assert when looking at the existing core based reservations.

We are looking into how to handle this better in 14.11 and future releases.

How many core based reservations do you have?

Comment 18 Paul Edmon 2015-08-13 05:57:57 MDT

5 of them.  I think they all have jobs running in them as well.

So can I delete the reservations and recreate them with the jobs going?

-Paul Edmon-

On 8/13/2015 1:49 PM, bugs@schedmd.com wrote:
>
> *Comment # 17 <http://bugs.schedmd.com/show_bug.cgi?id=1850#c17> on 
> bug 1850 <http://bugs.schedmd.com/show_bug.cgi?id=1850> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> The best option for now is to delete the old "core based" reservations and
> recreate them. I've found that if you try to create a new core based
> reservation while having the existing reservations (ones created before
> removing nodes) , the code will hit an assert when looking at the existing core
> based reservations.
>
> We are looking into how to handle this better in 14.11 and future releases.
>
> How many core based reservations do you have?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 19 Brian Christiansen 2015-08-13 06:45:11 MDT

Good point. You can't delete reservations while there are job's in the system that have requested the reservation. I'm going to see if we can get the reservations updated through "scontrol update". I'll let you know what I find.

Comment 20 Paul Edmon 2015-08-13 06:46:33 MDT

Thanks.

-Paul Edmon-

On 8/13/2015 2:45 PM, bugs@schedmd.com wrote:
>
> *Comment # 19 <http://bugs.schedmd.com/show_bug.cgi?id=1850#c19> on 
> bug 1850 <http://bugs.schedmd.com/show_bug.cgi?id=1850> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> Good point. You can't delete reservations while there are job's in the system
> that have requested the reservation. I'm going to see if we can get the
> reservations updated through "scontrol update". I'll let you know what I find.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 21 Moe Jette 2015-08-13 10:24:33 MDT

Created attachment 2116 [details]
Bug fix

This will be fixed in v14.11.9 when released. It rebuilds the core_bitmap associated with an advanced core reservation. You might possibly get different cores, but it does make a best-effort (same nodes and same core count, use same cores as in use by any jobs running in that reservation). I'll create a ticket to change the save/restore logic so that we can preserve the identical cores, but that will need to wait for a future release (after v15.08). Here's the commit:

https://github.com/SchedMD/slurm/commit/931d18143582aed08c3eea028be6d49713f0c392

Comment 22 Moe Jette 2015-08-13 10:30:08 MDT

I'm closing this bug. Reopen if the patch doesn't fix this problem for you.

I've opened bug 1864 identifying the data structure changes needed to properly fix this problem.