I recently pulled a bunch of nodes out of slurm.conf that we are moving to a different cluster. However when I did I got: Aug 6 12:16:37 holy-slurm01 slurmctld[40120]: error: Bad core_bitmap size for reservation (null) (46854 != 45942), ignoring core reservation I looked at all the reservations and none involve the cores I removed. Is there a way to fix this? It spitting out errors fairly consistently so it makes it hard to read the log. It also appears to slow the scheduler a bit as it has to handle the error constantly. -Paul Edmon-
When you removed the node were there some jobs still running? Did you restart the controller after you pulled those nodes out of slurm.conf? David
I made sure there were no jobs running on those nodes and I restarted the scheduler after I changed the slurm.conf. -Paul Edmon- On 08/06/2015 12:22 PM, bugs@schedmd.com wrote: > > *Comment # 1 <http://bugs.schedmd.com/show_bug.cgi?id=1850#c1> on bug > 1850 <http://bugs.schedmd.com/show_bug.cgi?id=1850> from David Bigagli > <mailto:david@schedmd.com> * > When you removed the node were there some jobs still running? Did you restart > the > controller after you pulled those nodes out of slurm.conf? > > David > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
When changing nodes in slurm.conf all slurmds have to be restarted as well to keep the configuration in sync. Are there some slurmds running on these hosts? David
I tried doing that, but I can try again. -Paul Edmon- On 08/06/2015 12:29 PM, bugs@schedmd.com wrote: > > *Comment # 3 <http://bugs.schedmd.com/show_bug.cgi?id=1850#c3> on bug > 1850 <http://bugs.schedmd.com/show_bug.cgi?id=1850> from David Bigagli > <mailto:david@schedmd.com> * > When changing nodes in slurm.conf all slurmds have to be restarted as well > to keep the configuration in sync. Are there some slurmds running on these > hosts? > > David > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
No a global restart doesn't remove the error either. -Paul Edmon- On 08/06/2015 12:29 PM, bugs@schedmd.com wrote: > > *Comment # 3 <http://bugs.schedmd.com/show_bug.cgi?id=1850#c3> on bug > 1850 <http://bugs.schedmd.com/show_bug.cgi?id=1850> from David Bigagli > <mailto:david@schedmd.com> * > When changing nodes in slurm.conf all slurmds have to be restarted as well > to keep the configuration in sync. Are there some slurmds running on these > hosts? > > David > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
What do your reservations look like ("scontrol show res")? Do/did they include any removed nodes? Please attach your current slurm.conf file and the reservation output.
Created attachment 2099 [details] slurm.conf-holy-slurm01 [root@holy2a01101 ~]# scontrol show res ReservationName=dingma StartTime=2015-06-23T09:00:00 EndTime=2015-09-16T09:00:00 Duration=85-00:00:00 Nodes=hp0103 NodeCnt=1 CoreCnt=6 Features=(null) PartitionName=kuang_hp Flags= Users=dingma Accounts=(null) Licenses=(null) State=ACTIVE ReservationName=gtorri StartTime=2015-07-06T12:00:00 EndTime=2015-08-11T12:00:00 Duration=36-00:00:00 Nodes=holy2b0520[1-8] NodeCnt=8 CoreCnt=128 Features=(null) PartitionName=kuang Flags=IGNORE_JOBS,SPEC_NODES Users=gtorri Accounts=(null) Licenses=(null) State=ACTIVE ReservationName=zxi StartTime=2015-07-17T09:00:00 EndTime=2015-08-14T09:00:00 Duration=28-00:00:00 Nodes=holy2a[01204,01208,02102,02201-02204,03102-03103,03203] NodeCnt=10 CoreCnt=640 Features=(null) PartitionName=general Flags= Users=zxi Accounts=(null) Licenses=(null) State=ACTIVE ReservationName=kuang3 StartTime=2015-07-30T09:00:00 EndTime=2015-08-27T09:00:00 Duration=28-00:00:00 Nodes=hp[0101-0102,0104,0201,0203-0204,0301-0303,0401-0404,0601-0604,0701,0703-0704,0801-0804,0901-0904,1001-1004,1101,1103-1104,1202,1301-1304,1401-1402,1502-1504,1603-1604,1701-1704,1801-1804,1901-1904,2001,2003,2101-2103] NodeCnt=64 CoreCnt=768 Features=(null) PartitionName=kuang_hp Flags= Users=kuang Accounts=(null) Licenses=(null) State=ACTIVE ReservationName=cyana StartTime=2015-08-06T11:21:27 EndTime=2015-08-20T11:21:27 Duration=14-00:00:00 Nodes=regal[03,05-13,16-18] NodeCnt=13 CoreCnt=104 Features=(null) PartitionName=regal Flags=IGNORE_JOBS,SPEC_NODES Users=syockel,jcuff,dsouza,simai Accounts=(null) Licenses=(null) State=ACTIVE -Paul Edmon- On 08/06/2015 12:58 PM, bugs@schedmd.com wrote: > > *Comment # 6 <http://bugs.schedmd.com/show_bug.cgi?id=1850#c6> on bug > 1850 <http://bugs.schedmd.com/show_bug.cgi?id=1850> from Moe Jette > <mailto:jette@schedmd.com> * > What do your reservations look like ("scontrol show res")? > Do/did they include any removed nodes? > Please attach your current slurm.conf file and the reservation output. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
It looks like the problem is related reservation kuang3, which contains a node hp1802 and that node no longer exists in slurm.conf. All of the other nodes in all of the other reservations exist.
Oh, okay. I see. I will have to make sure to remove that. -Paul Edmon- On 08/06/2015 02:54 PM, bugs@schedmd.com wrote: > > *Comment # 8 <http://bugs.schedmd.com/show_bug.cgi?id=1850#c8> on bug > 1850 <http://bugs.schedmd.com/show_bug.cgi?id=1850> from Moe Jette > <mailto:jette@schedmd.com> * > It looks like the problem is related reservation kuang3, which contains a node > hp1802 and that node no longer exists in slurm.conf. > > All of the other nodes in all of the other reservations exist. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Actually that node does exist: root@holy-slurm01 slurm]# scontrol show node hp1802 NodeName=hp1802 Arch=x86_64 CoresPerSocket=6 CPUAlloc=12 CPUErr=0 CPUTot=12 CPULoad=12.25 Features=intel Gres=(null) NodeAddr=hp1802 NodeHostName=hp1802 Version=14.11 OS=Linux RealMemory=24150 AllocMem=24000 Sockets=2 Boards=1 State=ALLOCATED ThreadsPerCore=1 TmpDisk=258048 Weight=1 BootTime=2015-07-08T14:47:17 SlurmdStartTime=2015-08-06T12:35:16 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s # hp. Owned by RC for Kuang NodeName=hp010[1-4],hp020[1-4],hp030[1-3],hp040[1-4],hp060[1-4],\ hp070[1,3-4],hp080[1-4],hp090[1-4],hp100[1-4],hp110[1,3-4],hp120[1-2],\ hp130[1-4],hp140[1-2],hp150[2-4],hp160[1-4],hp170[1-4],hp180[1-4],\ hp190[1-4],hp200[1,3],hp210[1-4],hp220[3-4],hp230[3-4],hp240[1-2,4],\ hp250[1-4],hp260[1-3],hp2702 \ CPUS=12 RealMemory=24150 Sockets=2 CoresPerSocket=6 \ ThreadsPerCore=1 TmpDisk=258048 Feature=intel You can see the hp180[1-4] -Paul Edmon- On 08/06/2015 02:54 PM, bugs@schedmd.com wrote: > > *Comment # 8 <http://bugs.schedmd.com/show_bug.cgi?id=1850#c8> on bug > 1850 <http://bugs.schedmd.com/show_bug.cgi?id=1850> from Moe Jette > <mailto:jette@schedmd.com> * > It looks like the problem is related reservation kuang3, which contains a node > hp1802 and that node no longer exists in slurm.conf. > > All of the other nodes in all of the other reservations exist. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
(In reply to Paul Edmon from comment #10) > Actually that node does exist: Mea culpa, I was accidentally looking in the node list for partition=priority in your slurm.conf: "...hp180[1,3-4]...", so all of the nodes in all of your reservations are valid.
Ah, now that is a bug as in our environment priority should have everything. I should fix that :). -Paul Edmon- On 08/06/2015 03:11 PM, bugs@schedmd.com wrote: > > *Comment # 11 <http://bugs.schedmd.com/show_bug.cgi?id=1850#c11> on > bug 1850 <http://bugs.schedmd.com/show_bug.cgi?id=1850> from Moe Jette > <mailto:jette@schedmd.com> * > (In reply to Paul Edmon fromcomment #10 <show_bug.cgi?id=1850#c10>) > > Actually that node does exist: > > Mea culpa, I was accidentally looking in the node list for partition=priority > in your slurm.conf: "...hp180[1,3-4]...", so all of the nodes in all of your > reservations are valid. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
I'm able to reproduce this. We'll look further into it.
We are having this issue again. Aug 11 14:59:20 holy-slurm01 slurmctld[36344]: find_node_record passed NULL name Aug 11 14:59:20 holy-slurm01 kernel: slurmctld_sched[57340]: segfault at 68 ip 000000000046ad36 sp 00007f610b9e8c90 error 4 in slurmctld[400000+23f000] Aug 11 15:00:02 holy-slurm01 purge-binlogs: Purging master logs to binlog.001792 Paul is out of commission at the moment, so I'm trying to get this back into service. I'm not as seasoned as a SLURM admin as he as. Any more info that I can provide you, I'm all ears.
Hey Scott, This looks different than the original bug. I've created Bug 1854 to track this. I'll move over there. Thanks, Brian
So what's the status on this bug. At this point I can't even add new nodes to the conf. This isn't good as we have a ton of hardware that just landed and we will need to add it to the conf soon (probably next week or so).
The best option for now is to delete the old "core based" reservations and recreate them. I've found that if you try to create a new core based reservation while having the existing reservations (ones created before removing nodes) , the code will hit an assert when looking at the existing core based reservations. We are looking into how to handle this better in 14.11 and future releases. How many core based reservations do you have?
5 of them. I think they all have jobs running in them as well. So can I delete the reservations and recreate them with the jobs going? -Paul Edmon- On 8/13/2015 1:49 PM, bugs@schedmd.com wrote: > > *Comment # 17 <http://bugs.schedmd.com/show_bug.cgi?id=1850#c17> on > bug 1850 <http://bugs.schedmd.com/show_bug.cgi?id=1850> from Brian > Christiansen <mailto:brian@schedmd.com> * > The best option for now is to delete the old "core based" reservations and > recreate them. I've found that if you try to create a new core based > reservation while having the existing reservations (ones created before > removing nodes) , the code will hit an assert when looking at the existing core > based reservations. > > We are looking into how to handle this better in 14.11 and future releases. > > How many core based reservations do you have? > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Good point. You can't delete reservations while there are job's in the system that have requested the reservation. I'm going to see if we can get the reservations updated through "scontrol update". I'll let you know what I find.
Thanks. -Paul Edmon- On 8/13/2015 2:45 PM, bugs@schedmd.com wrote: > > *Comment # 19 <http://bugs.schedmd.com/show_bug.cgi?id=1850#c19> on > bug 1850 <http://bugs.schedmd.com/show_bug.cgi?id=1850> from Brian > Christiansen <mailto:brian@schedmd.com> * > Good point. You can't delete reservations while there are job's in the system > that have requested the reservation. I'm going to see if we can get the > reservations updated through "scontrol update". I'll let you know what I find. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Created attachment 2116 [details] Bug fix This will be fixed in v14.11.9 when released. It rebuilds the core_bitmap associated with an advanced core reservation. You might possibly get different cores, but it does make a best-effort (same nodes and same core count, use same cores as in use by any jobs running in that reservation). I'll create a ticket to change the save/restore logic so that we can preserve the identical cores, but that will need to wait for a future release (after v15.08). Here's the commit: https://github.com/SchedMD/slurm/commit/931d18143582aed08c3eea028be6d49713f0c392
I'm closing this bug. Reopen if the patch doesn't fix this problem for you. I've opened bug 1864 identifying the data structure changes needed to properly fix this problem.