Created attachment 5845 [details] slurm configuration file Hello, We just upgraded to slurm 17.11.1-2, from 17.02.9. Overall the upgrade went well, but while doing some testing we realized we hit a "new" node limit of 20 nodes per job, in other words, if we request more than 20 nodes, it stays queued with the following "AssocGrpBilling" message in squeue. Any ideas on what this means? I don't remember imposing this kind of limit before, but if so, I am unable to locate it - I cannot find any such TRES, QOS or Partition limits that limit the number of nodes to 20. Our latest slurm.conf is attached, let me know if you need anything else. Thanks! Rob
Hi Robert, can you show me the output of sacctmgr show tres?
Hi Marshall, Here is the output of “sacctmgr show tres”: [root@hpc-hn1 ~]# sacctmgr show tres Type Name ID -------- --------------- ------ cpu 1 mem 2 energy 3 node 4 billing gpu 5 gres gpu 6 Also, I said before that the limit we observed was 20 nodes, but it appears the problem is more general than that, here is the latest from “squeue” as other uses have started submitting jobs: [root@hpc-hn1 ~]# squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 287199 long C1_sch1_ raltman PD 0:00 1 (PartitionDown) 280637 longfat 1960s_da hty PD 0:00 1 (PartitionDown) 287093 longfat 1980s_se hty PD 0:00 1 (PartitionDown) 283292 longfat mdrun jharman PD 0:00 1 (PartitionDown) 283293 longfat mdrun jharman PD 0:00 1 (PartitionDown) 283294 longfat mdrun jharman PD 0:00 1 (PartitionDown) 301036 longfat 1980s_da hty PD 0:00 1 (PartitionDown) 325536 long DarkPion bostdiek PD 0:00 1 (PartitionDown) 312295 long TCFs ebeyerle PD 0:00 1 (PartitionDown) 316794 longfat 1970s_da hty PD 0:00 1 (PartitionDown) 331856 hiprio bash cphoffma PD 0:00 30 (Resources) 331094_[7-8] long Pythia bostdiek PD 0:00 1 (PartitionDown) 331873 gpu bash dsteck PD 0:00 1 (AssocGrpBilling) 331904 short SWorker riazi PD 0:00 1 (AssocGrpBilling) 331906 short SWorker riazi PD 0:00 1 (AssocGrpBilling) 331907 short SWorker riazi PD 0:00 1 (AssocGrpBilling) 331414 gpu align kkinning PD 0:00 1 (AssocGrpBilling) 331308 short iceberg_ dcarroll PD 0:00 12 (AssocGrpBilling) 331080 fat g09_CNT btaber PD 0:00 1 (AssocGrpBilling) 331884 fat runM1 rtumblin PD 0:00 1 (AssocGrpBilling) 331885 longfat runI2-51 rtumblin PD 0:00 1 (PartitionDown) 331886 fat runB2 rtumblin PD 0:00 1 (AssocGrpBilling) 331889 fat runB3 rtumblin PD 0:00 1 (AssocGrpBilling) 331898 fat runB4 rtumblin PD 0:00 1 (AssocGrpBilling) 331735 fat Samples_ jpreston PD 0:00 1 (AssocGrpBilling) 331875 long jnitest1 imamura PD 0:00 1 (PartitionDown) 331876 long jnitest0 imamura PD 0:00 1 (PartitionDown) 331877 long jnirtest imamura PD 0:00 1 (PartitionDown) 331887 short jnitest imamura PD 0:00 1 (AssocGrpBilling) 331911_[1-323] gpu bptt_mom tarakaki PD 0:00 1 (AssocGrpBilling) 331882 short bash mchase2 R 35:53 1 n001 331905 short SPARKMAS riazi R 21:34 1 n096 331881 fat runE1 rtumblin R 36:04 1 n128 331879 short X-18699 rtumblin R 38:48 1 n001 331059_44 gpu deltaE rdennis R 50:12 1 n120 331883 short jnitest imamura R 34:17 1 n001 331880 short jnitest imamura R 38:23 1 n001 331878 short jnitest imamura R 42:56 1 n001 331908 short bash oconnor3 R 18:47 1 n096 331911_0 gpu bptt_mom tarakaki R 13:48 1 n120 In some cases, like the fat partition, the resources just might not yet be available, but in the case of job 331887, there are resources available in the short partition that are not yet being used. Thanks, Rob On Jan 4, 2018, at 12:37 PM, bugs@schedmd.com<mailto:bugs@schedmd.com> wrote: Comment # 4<https://bugs.schedmd.com/show_bug.cgi?id=4579#c4> on bug 4579<https://bugs.schedmd.com/show_bug.cgi?id=4579> from Marshall Garey<mailto:marshall@schedmd.com> Hi Robert, can you show me the output of sacctmgr show tres? ________________________________ You are receiving this mail because: * You reported the bug.
Unfortunately you’ve hit a bug that was found this last week and we have a fix that is out today for it. Unfortunately, you’re in a state that requires manual intervention. The details about it are in the release notes for 17.11.2. https://github.com/SchedMD/slurm/blob/6f39ef81a1f88247fccce48f2e1ce51230df502b/RELEASE_NOTES#L36 A bug was discovered in MySQL where MySQL stores the auto_increment value in memory and resets the auto_increment seed to max id + 1 upon restarting. MariaDB 10.2 has fixed this by persistently storing the auto_increment value. https://mariadb.com/kb/en/library/auto_increment/#innodbxtradb What happened is that: 1. While in 17.02, or earlier, the database created a TRES table with known TRES’ cpu, mem, energy, node with ID’s 1-4 respectively. (a) When the TRES table is created it’s auto_increment number is set to 1001. (b) Slurm reserves TRES id’s 1-1000 for internal known TRES Types and assigns dynamic (e.g. gres, license) TRES types ids in the 1001+ range. 2. MySQL was restarted and the stored auto_increment seed number was lost (a) Upon restarting, MySQL sets the auto_increment number to the highest value +1 -- with is 5. 3. The controller registers with the slurmdbd and adds the gres/gpu TRES at ID 5. (a) AccountingStorageTRES=gres/gpu would have been added. 4. Upgraded to 17.11 5. Started slurmdbd (a) In 17.11 a new reserved TRES Type of billing was added at ID 5 -- overwriting the previous gres/gpu TRES Type. (b) Thus leaving a billing/gpu TRES type now in the database at ID 5. 6. The slurmctld was restarted and registered with the slurmdbd. (a) Upon registration the slurmctld told the slurmdbd about the gres/gpu TRES to track. (b) The SlurmDBD doesn’t have a gres/gpu TRES so it adds it at ID 6. In your case you probably had a limit on gres/gpu but it is now on billing/gpu (id=5). All references in other tables to TRES are done with the id instead of type/name. The billing limit is now enforcing the TRESBillingWeights limits defined on each partition. There are a couple of choices of how to proceed. 1. Not worry about the usage that has been recorded under TRES ID 6 since the upgrade and let the new 17.11.2 fix/convert the TRES table and TRES usage. 2. Convert all the usage in TRES id 6 back to TRES id 5 and then let new 17.11.2 fix/convert the TRES table and TRES usage. Option 1 is easiest and quickest but you loose a little bit of data. It involves: 1. remove TRES ID 6 in the tres_table 2. change the Type of billing back to the original Type (gres). (a) delete from tres_table where id=6; update tres_table set type='gres' where id=5; 3. restart the 17.11.2 slurmdbd which will finish the conversion, then you are done. Option 2 requires more work. It involves: 1. altering all (job, step, resv, usage, etc.) the tables with the new TRES ID of 6 back to ID 5 -- where are the previous usage is stored. 2. remove TRES ID 6 from the tres_table 3. change the Type of billing back to the original Type (gres). (a) delete from tres_table where id=6; update tres_table set type='gres' where id=5; 4. restart the 17.11.2 slurmdbd which will finish the conversion, then you are done. We can help with the sql commands to do the conversion. What path would you like to take?
Hi Marshall, Thank you for that info. We are early enough in the game here to go ahead with option #1. I will go ahead and build 17.11.2 on the cluster and yes, I will greatly appreciate the help with the necessary sql commands to start the conversion. Thanks! Rob On Jan 4, 2018, at 2:25 PM, bugs@schedmd.com<mailto:bugs@schedmd.com> wrote: Comment # 6<https://bugs.schedmd.com/show_bug.cgi?id=4579#c6> on bug 4579<https://bugs.schedmd.com/show_bug.cgi?id=4579> from Marshall Garey<mailto:marshall@schedmd.com> Unfortunately you’ve hit a bug that was found this last week and we have a fix that is out today for it. Unfortunately, you’re in a state that requires manual intervention. The details about it are in the release notes for 17.11.2. https://github.com/SchedMD/slurm/blob/6f39ef81a1f88247fccce48f2e1ce51230df502b/RELEASE_NOTES#L36 A bug was discovered in MySQL where MySQL stores the auto_increment value in memory and resets the auto_increment seed to max id + 1 upon restarting. MariaDB 10.2 has fixed this by persistently storing the auto_increment value. https://mariadb.com/kb/en/library/auto_increment/#innodbxtradb What happened is that: 1. While in 17.02, or earlier, the database created a TRES table with known TRES’ cpu, mem, energy, node with ID’s 1-4 respectively. (a) When the TRES table is created it’s auto_increment number is set to 1001. (b) Slurm reserves TRES id’s 1-1000 for internal known TRES Types and assigns dynamic (e.g. gres, license) TRES types ids in the 1001+ range. 2. MySQL was restarted and the stored auto_increment seed number was lost (a) Upon restarting, MySQL sets the auto_increment number to the highest value +1 -- with is 5. 3. The controller registers with the slurmdbd and adds the gres/gpu TRES at ID 5. (a) AccountingStorageTRES=gres/gpu would have been added. 4. Upgraded to 17.11 5. Started slurmdbd (a) In 17.11 a new reserved TRES Type of billing was added at ID 5 -- overwriting the previous gres/gpu TRES Type. (b) Thus leaving a billing/gpu TRES type now in the database at ID 5. 6. The slurmctld was restarted and registered with the slurmdbd. (a) Upon registration the slurmctld told the slurmdbd about the gres/gpu TRES to track. (b) The SlurmDBD doesn’t have a gres/gpu TRES so it adds it at ID 6. In your case you probably had a limit on gres/gpu but it is now on billing/gpu (id=5). All references in other tables to TRES are done with the id instead of type/name. The billing limit is now enforcing the TRESBillingWeights limits defined on each partition. There are a couple of choices of how to proceed. 1. Not worry about the usage that has been recorded under TRES ID 6 since the upgrade and let the new 17.11.2 fix/convert the TRES table and TRES usage. 2. Convert all the usage in TRES id 6 back to TRES id 5 and then let new 17.11.2 fix/convert the TRES table and TRES usage. Option 1 is easiest and quickest but you loose a little bit of data. It involves: 1. remove TRES ID 6 in the tres_table 2. change the Type of billing back to the original Type (gres). (a) delete from tres_table where id=6; update tres_table set type='gres' where id=5; 3. restart the 17.11.2 slurmdbd which will finish the conversion, then you are done. Option 2 requires more work. It involves: 1. altering all (job, step, resv, usage, etc.) the tables with the new TRES ID of 6 back to ID 5 -- where are the previous usage is stored. 2. remove TRES ID 6 from the tres_table 3. change the Type of billing back to the original Type (gres). (a) delete from tres_table where id=6; update tres_table set type='gres' where id=5; 4. restart the 17.11.2 slurmdbd which will finish the conversion, then you are done. We can help with the sql commands to do the conversion. What path would you like to take? ________________________________ You are receiving this mail because: * You reported the bug.
Fantastic. The mysql commands are in the long block of text in comment 5, but here they are again so you don't have to dig through that: delete from tres_table where id=6; update tres_table set type='gres' where id=5; Then simply start up the 17.11.2 slurmdbd. I highly recommend backing up your database, just in case something goes horribly wrong. Doing this myself: Here's what it looks like right now (before the fix): mysql> select * from tres_table; +---------------+---------+----+---------+------+ | creation_time | deleted | id | type | name | +---------------+---------+----+---------+------+ | 1515103785 | 0 | 1 | cpu | | | 1515103785 | 0 | 2 | mem | | | 1515103785 | 0 | 3 | energy | | | 1515103785 | 0 | 4 | node | | | 1515103857 | 0 | 5 | billing | gpu | | 1515104917 | 0 | 6 | gres | gpu | +---------------+---------+----+---------+------+ Fix the table: mysql> delete from tres_table where id=6; update tres_table set type='gres' where id=5; Query OK, 1 row affected (0.01 sec) Query OK, 1 row affected (0.01 sec) Rows matched: 1 Changed: 1 Warnings: 0 mysql> select * from tres_table; +---------------+---------+----+--------+------+ | creation_time | deleted | id | type | name | +---------------+---------+----+--------+------+ | 1515103785 | 0 | 1 | cpu | | | 1515103785 | 0 | 2 | mem | | | 1515103785 | 0 | 3 | energy | | | 1515103785 | 0 | 4 | node | | | 1515103857 | 0 | 5 | gres | gpu | +---------------+---------+----+--------+------+ Then start up the new slurmdbd. After that, it's fixed. It should look like this: mysql> select * from tres_table; +---------------+---------+------+----------------+------+ | creation_time | deleted | id | type | name | +---------------+---------+------+----------------+------+ | 1515103785 | 0 | 1 | cpu | | | 1515103785 | 0 | 2 | mem | | | 1515103785 | 0 | 3 | energy | | | 1515103785 | 0 | 4 | node | | | 1515108426 | 0 | 5 | billing | | | 1515108426 | 1 | 1000 | dynamic_offset | | | 1515103857 | 0 | 1001 | gres | gpu | +---------------+---------+------+----------------+------+
Hi Marshall, Thanks! The mysql change went without issue and the tables look good, like the ones you showed below. The new slurmdbd 17.11.2 started without a hitch and the cluster is being brought back online. Right now things look good, but I’ll let you know if anything else comes up. Rob On Jan 4, 2018, at 3:30 PM, bugs@schedmd.com<mailto:bugs@schedmd.com> wrote: Comment # 8<https://bugs.schedmd.com/show_bug.cgi?id=4579#c8> on bug 4579<https://bugs.schedmd.com/show_bug.cgi?id=4579> from Marshall Garey<mailto:marshall@schedmd.com> Fantastic. The mysql commands are in the long block of text in comment 5<x-msg://63/show_bug.cgi?id=4579#c5>, but here they are again so you don't have to dig through that: delete from tres_table where id=6; update tres_table set type='gres' where id=5; Then simply start up the 17.11.2 slurmdbd. I highly recommend backing up your database, just in case something goes horribly wrong. Doing this myself: Here's what it looks like right now (before the fix): mysql> select * from tres_table; +---------------+---------+----+---------+------+ | creation_time | deleted | id | type | name | +---------------+---------+----+---------+------+ | 1515103785 | 0 | 1 | cpu | | | 1515103785 | 0 | 2 | mem | | | 1515103785 | 0 | 3 | energy | | | 1515103785 | 0 | 4 | node | | | 1515103857 | 0 | 5 | billing | gpu | | 1515104917 | 0 | 6 | gres | gpu | +---------------+---------+----+---------+------+ Fix the table: mysql> delete from tres_table where id=6; update tres_table set type='gres' where id=5; Query OK, 1 row affected (0.01 sec) Query OK, 1 row affected (0.01 sec) Rows matched: 1 Changed: 1 Warnings: 0 mysql> select * from tres_table; +---------------+---------+----+--------+------+ | creation_time | deleted | id | type | name | +---------------+---------+----+--------+------+ | 1515103785 | 0 | 1 | cpu | | | 1515103785 | 0 | 2 | mem | | | 1515103785 | 0 | 3 | energy | | | 1515103785 | 0 | 4 | node | | | 1515103857 | 0 | 5 | gres | gpu | +---------------+---------+----+--------+------+ Then start up the new slurmdbd. After that, it's fixed. It should look like this: mysql> select * from tres_table; +---------------+---------+------+----------------+------+ | creation_time | deleted | id | type | name | +---------------+---------+------+----------------+------+ | 1515103785 | 0 | 1 | cpu | | | 1515103785 | 0 | 2 | mem | | | 1515103785 | 0 | 3 | energy | | | 1515103785 | 0 | 4 | node | | | 1515108426 | 0 | 5 | billing | | | 1515108426 | 1 | 1000 | dynamic_offset | | | 1515103857 | 0 | 1001 | gres | gpu | +---------------+---------+------+----------------+------+ ________________________________ You are receiving this mail because: * You reported the bug.
That's great to hear, thanks Robert. I'll close this ticket as resolved/fixed for now, reopen if you have further issues.