4579 – AssocGrpBilling message

Bug 4579 - AssocGrpBilling message

Summary: AssocGrpBilling message

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other bugs)
Version:	17.11.1
Hardware:	Linux Linux

Importance:	--- 4 - Minor Issue
Assignee:	Marshall Garey
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2018-01-04 12:39 MST by Robert Yelle
Modified:	2018-01-04 17:31 MST (History)
CC List:	0 users

See Also:
Site:	University of Oregon
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	17.11.2
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm configuration file (8.76 KB, text/plain) 2018-01-04 12:39 MST, Robert Yelle	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Robert Yelle 2018-01-04 12:39:38 MST

Created attachment 5845 [details]
slurm configuration file

Hello,

We just upgraded to slurm 17.11.1-2, from 17.02.9.  Overall the upgrade went well, but while doing some testing we realized we hit a "new" node limit of 20 nodes per job, in other words, if we request more than 20 nodes, it stays queued with the following "AssocGrpBilling" message in squeue.  Any ideas on what this means?  I don't remember imposing this kind of limit before, but if so, I am unable to locate it - I cannot find any such TRES, QOS or Partition limits that limit the number of nodes to 20.  Our latest slurm.conf is attached, let me know if you need anything else.

Thanks!

Rob

Comment 4 Marshall Garey 2018-01-04 13:37:49 MST

Hi Robert, can you show me the output of sacctmgr show tres?

Comment 5 Robert Yelle 2018-01-04 13:44:52 MST

Hi Marshall,

Here is the output of “sacctmgr show tres”:

[root@hpc-hn1 ~]# sacctmgr show tres
    Type            Name     ID
-------- --------------- ------
     cpu                      1
     mem                      2
  energy                      3
    node                      4
 billing             gpu      5
    gres             gpu      6

Also, I said before that the limit we observed was 20 nodes, but it appears the problem is more general than that, here is the latest from “squeue” as other uses have started submitting jobs:

[root@hpc-hn1 ~]# squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            287199      long C1_sch1_  raltman PD       0:00      1 (PartitionDown)
            280637   longfat 1960s_da      hty PD       0:00      1 (PartitionDown)
            287093   longfat 1980s_se      hty PD       0:00      1 (PartitionDown)
            283292   longfat    mdrun  jharman PD       0:00      1 (PartitionDown)
            283293   longfat    mdrun  jharman PD       0:00      1 (PartitionDown)
            283294   longfat    mdrun  jharman PD       0:00      1 (PartitionDown)
            301036   longfat 1980s_da      hty PD       0:00      1 (PartitionDown)
            325536      long DarkPion bostdiek PD       0:00      1 (PartitionDown)
            312295      long     TCFs ebeyerle PD       0:00      1 (PartitionDown)
            316794   longfat 1970s_da      hty PD       0:00      1 (PartitionDown)
            331856    hiprio     bash cphoffma PD       0:00     30 (Resources)
      331094_[7-8]      long   Pythia bostdiek PD       0:00      1 (PartitionDown)
            331873       gpu     bash   dsteck PD       0:00      1 (AssocGrpBilling)
            331904     short  SWorker    riazi PD       0:00      1 (AssocGrpBilling)
            331906     short  SWorker    riazi PD       0:00      1 (AssocGrpBilling)
            331907     short  SWorker    riazi PD       0:00      1 (AssocGrpBilling)
            331414       gpu    align kkinning PD       0:00      1 (AssocGrpBilling)
            331308     short iceberg_ dcarroll PD       0:00     12 (AssocGrpBilling)
            331080       fat  g09_CNT   btaber PD       0:00      1 (AssocGrpBilling)
            331884       fat    runM1 rtumblin PD       0:00      1 (AssocGrpBilling)
            331885   longfat runI2-51 rtumblin PD       0:00      1 (PartitionDown)
            331886       fat    runB2 rtumblin PD       0:00      1 (AssocGrpBilling)
            331889       fat    runB3 rtumblin PD       0:00      1 (AssocGrpBilling)
            331898       fat    runB4 rtumblin PD       0:00      1 (AssocGrpBilling)
            331735       fat Samples_ jpreston PD       0:00      1 (AssocGrpBilling)
            331875      long jnitest1  imamura PD       0:00      1 (PartitionDown)
            331876      long jnitest0  imamura PD       0:00      1 (PartitionDown)
            331877      long jnirtest  imamura PD       0:00      1 (PartitionDown)
            331887     short  jnitest  imamura PD       0:00      1 (AssocGrpBilling)
    331911_[1-323]       gpu bptt_mom tarakaki PD       0:00      1 (AssocGrpBilling)
            331882     short     bash  mchase2  R      35:53      1 n001
            331905     short SPARKMAS    riazi  R      21:34      1 n096
            331881       fat    runE1 rtumblin  R      36:04      1 n128
            331879     short  X-18699 rtumblin  R      38:48      1 n001
         331059_44       gpu   deltaE  rdennis  R      50:12      1 n120
            331883     short  jnitest  imamura  R      34:17      1 n001
            331880     short  jnitest  imamura  R      38:23      1 n001
            331878     short  jnitest  imamura  R      42:56      1 n001
            331908     short     bash oconnor3  R      18:47      1 n096
          331911_0       gpu bptt_mom tarakaki  R      13:48      1 n120


In some cases, like the fat partition, the resources just might not yet be available, but in the case of job 331887, there are resources available in the short partition that are not yet being used.

Thanks,

Rob

On Jan 4, 2018, at 12:37 PM, bugs@schedmd.com<mailto:bugs@schedmd.com> wrote:


Comment # 4<https://bugs.schedmd.com/show_bug.cgi?id=4579#c4> on bug 4579<https://bugs.schedmd.com/show_bug.cgi?id=4579> from Marshall Garey<mailto:marshall@schedmd.com>

Hi Robert, can you show me the output of sacctmgr show tres?

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 6 Marshall Garey 2018-01-04 15:25:53 MST

Unfortunately you’ve hit a bug that was found this last week and we have a fix that is out today for it. Unfortunately, you’re in a state that requires manual intervention. The details about it are in the release notes for 17.11.2.
https://github.com/SchedMD/slurm/blob/6f39ef81a1f88247fccce48f2e1ce51230df502b/RELEASE_NOTES#L36

A bug was discovered in MySQL where MySQL stores the auto_increment value in memory and resets the auto_increment seed to max id + 1 upon restarting. MariaDB 10.2 has fixed this by persistently storing the auto_increment value.
https://mariadb.com/kb/en/library/auto_increment/#innodbxtradb

What happened is that:
1. While in 17.02, or earlier, the database created a TRES table with known TRES’ cpu, mem, energy, node with ID’s 1-4 respectively.
(a) When the TRES table is created it’s auto_increment number is set to 1001.
(b) Slurm reserves TRES id’s 1-1000 for internal known TRES Types and assigns dynamic (e.g. gres, license) TRES types ids in the 1001+ range.
2. MySQL was restarted and the stored auto_increment seed number was lost
(a) Upon restarting, MySQL sets the auto_increment number to the highest value +1 -- with is 5.
3. The controller registers with the slurmdbd and adds the gres/gpu TRES at ID 5.
(a) AccountingStorageTRES=gres/gpu would have been added.
4. Upgraded to 17.11
5. Started slurmdbd
(a) In 17.11 a new reserved TRES Type of billing was added at ID 5 -- overwriting the previous gres/gpu TRES Type.
(b) Thus leaving a billing/gpu TRES type now in the database at ID 5.
6. The slurmctld was restarted and registered with the slurmdbd.
(a) Upon registration the slurmctld told the slurmdbd about the gres/gpu TRES to track.
(b) The SlurmDBD doesn’t have a gres/gpu TRES so it adds it at ID 6.

In your case you probably had a limit on gres/gpu but it is now on billing/gpu (id=5). All references in other tables to TRES are done with the id instead of type/name. The billing limit is now enforcing the TRESBillingWeights limits defined on each partition.

There are a couple of choices of how to proceed.

1. Not worry about the usage that has been recorded under TRES ID 6 since the upgrade and let the new 17.11.2 fix/convert the TRES table and TRES usage.
2. Convert all the usage in TRES id 6 back to TRES id 5 and then let new 17.11.2 fix/convert the TRES table and TRES usage.

Option 1 is easiest and quickest but you loose a little bit of data. It involves:
1. remove TRES ID 6 in the tres_table
2. change the Type of billing back to the original Type (gres).
(a) delete from tres_table where id=6; update tres_table set type='gres' where id=5;
3. restart the 17.11.2 slurmdbd which will finish the conversion, then you are done.

Option 2 requires more work. It involves:
1. altering all (job, step, resv, usage, etc.) the tables with the new TRES ID of 6 back to ID 5 -- where are the previous usage is stored.
2. remove TRES ID 6 from the tres_table
3. change the Type of billing back to the original Type (gres).
(a) delete from tres_table where id=6; update tres_table set type='gres' where id=5;
4. restart the 17.11.2 slurmdbd which will finish the conversion, then you are done.

We can help with the sql commands to do the conversion. What path would you like to take?

Comment 7 Robert Yelle 2018-01-04 16:23:16 MST

Hi Marshall,

Thank you for that info. We are early enough in the game here to go ahead with option #1. I will go ahead and build 17.11.2 on the cluster and yes, I will greatly appreciate the help with the necessary sql commands to start the conversion.

Thanks!

Rob

On Jan 4, 2018, at 2:25 PM, bugs@schedmd.com<mailto:bugs@schedmd.com> wrote:

Comment # 6<https://bugs.schedmd.com/show_bug.cgi?id=4579#c6> on bug 4579<https://bugs.schedmd.com/show_bug.cgi?id=4579> from Marshall Garey<mailto:marshall@schedmd.com>

Unfortunately you’ve hit a bug that was found this last week and we have a fix
that is out today for it. Unfortunately, you’re in a state that requires manual
intervention. The details about it are in the release notes for 17.11.2.
https://github.com/SchedMD/slurm/blob/6f39ef81a1f88247fccce48f2e1ce51230df502b/RELEASE_NOTES#L36

A bug was discovered in MySQL where MySQL stores the auto_increment value in
memory and resets the auto_increment seed to max id + 1 upon restarting.
MariaDB 10.2 has fixed this by persistently storing the auto_increment value.
https://mariadb.com/kb/en/library/auto_increment/#innodbxtradb

What happened is that:
1. While in 17.02, or earlier, the database created a TRES table with known
TRES’ cpu, mem, energy, node with ID’s 1-4 respectively.
(a) When the TRES table is created it’s auto_increment number is set to 1001.
(b) Slurm reserves TRES id’s 1-1000 for internal known TRES Types and assigns
dynamic (e.g. gres, license) TRES types ids in the 1001+ range.
2. MySQL was restarted and the stored auto_increment seed number was lost
(a) Upon restarting, MySQL sets the auto_increment number to the highest
value +1 -- with is 5.
3. The controller registers with the slurmdbd and adds the gres/gpu TRES at ID
5.
(a) AccountingStorageTRES=gres/gpu would have been added.
4. Upgraded to 17.11
5. Started slurmdbd
(a) In 17.11 a new reserved TRES Type of billing was added at ID 5 --
overwriting the previous gres/gpu TRES Type.
(b) Thus leaving a billing/gpu TRES type now in the database at ID 5.
6. The slurmctld was restarted and registered with the slurmdbd.
(a) Upon registration the slurmctld told the slurmdbd about the gres/gpu TRES
to track.
(b) The SlurmDBD doesn’t have a gres/gpu TRES so it adds it at ID 6.

In your case you probably had a limit on gres/gpu but it is now on billing/gpu
(id=5). All references in other tables to TRES are done with the id instead of
type/name. The billing limit is now enforcing the TRESBillingWeights limits
defined on each partition.

There are a couple of choices of how to proceed.

1. Not worry about the usage that has been recorded under TRES ID 6 since the
upgrade and let the new 17.11.2 fix/convert the TRES table and TRES usage.
2. Convert all the usage in TRES id 6 back to TRES id 5 and then let new
17.11.2 fix/convert the TRES table and TRES usage.

Option 1 is easiest and quickest but you loose a little bit of data. It
involves:
1. remove TRES ID 6 in the tres_table
2. change the Type of billing back to the original Type (gres).
(a) delete from tres_table where id=6; update tres_table set type='gres'
where id=5;
3. restart the 17.11.2 slurmdbd which will finish the conversion, then you are
done.

Option 2 requires more work. It involves:
1. altering all (job, step, resv, usage, etc.) the tables with the new TRES ID
of 6 back to ID 5 -- where are the previous usage is stored.
2. remove TRES ID 6 from the tres_table
3. change the Type of billing back to the original Type (gres).
(a) delete from tres_table where id=6; update tres_table set type='gres'
where id=5;
4. restart the 17.11.2 slurmdbd which will finish the conversion, then you are
done.

We can help with the sql commands to do the conversion. What path would you
like to take?

________________________________
You are receiving this mail because:

* You reported the bug.

Comment 8 Marshall Garey 2018-01-04 16:30:42 MST

Fantastic. The mysql commands are in the long block of text in comment 5, but here they are again so you don't have to dig through that:

delete from tres_table where id=6; update tres_table set type='gres' where id=5;

Then simply start up the 17.11.2 slurmdbd.

I highly recommend backing up your database, just in case something goes horribly wrong.

Doing this myself:


Here's what it looks like right now (before the fix):

mysql> select * from tres_table;
+---------------+---------+----+---------+------+
| creation_time | deleted | id | type    | name |
+---------------+---------+----+---------+------+
|    1515103785 |       0 |  1 | cpu     |      |
|    1515103785 |       0 |  2 | mem     |      |
|    1515103785 |       0 |  3 | energy  |      |
|    1515103785 |       0 |  4 | node    |      |
|    1515103857 |       0 |  5 | billing | gpu  |
|    1515104917 |       0 |  6 | gres    | gpu  |
+---------------+---------+----+---------+------+


Fix the table:


mysql> delete from tres_table where id=6; update tres_table set type='gres' where id=5;
Query OK, 1 row affected (0.01 sec)

Query OK, 1 row affected (0.01 sec)
Rows matched: 1  Changed: 1  Warnings: 0

mysql> select * from tres_table;
+---------------+---------+----+--------+------+
| creation_time | deleted | id | type   | name |
+---------------+---------+----+--------+------+
|    1515103785 |       0 |  1 | cpu    |      |
|    1515103785 |       0 |  2 | mem    |      |
|    1515103785 |       0 |  3 | energy |      |
|    1515103785 |       0 |  4 | node   |      |
|    1515103857 |       0 |  5 | gres   | gpu  |
+---------------+---------+----+--------+------+



Then start up the new slurmdbd. After that, it's fixed. It should look like this:

mysql> select * from tres_table;
+---------------+---------+------+----------------+------+
| creation_time | deleted | id   | type           | name |
+---------------+---------+------+----------------+------+
|    1515103785 |       0 |    1 | cpu            |      |
|    1515103785 |       0 |    2 | mem            |      |
|    1515103785 |       0 |    3 | energy         |      |
|    1515103785 |       0 |    4 | node           |      |
|    1515108426 |       0 |    5 | billing        |      |
|    1515108426 |       1 | 1000 | dynamic_offset |      |
|    1515103857 |       0 | 1001 | gres           | gpu  |
+---------------+---------+------+----------------+------+

Comment 9 Robert Yelle 2018-01-04 17:30:17 MST

Hi Marshall,

Thanks!  The mysql change went without issue and the tables look good, like the ones you showed below.  The new slurmdbd 17.11.2 started without a hitch and the cluster is being brought back online.  Right now things look good, but I’ll let you know if anything else comes up.

Rob


On Jan 4, 2018, at 3:30 PM, bugs@schedmd.com<mailto:bugs@schedmd.com> wrote:


Comment # 8<https://bugs.schedmd.com/show_bug.cgi?id=4579#c8> on bug 4579<https://bugs.schedmd.com/show_bug.cgi?id=4579> from Marshall Garey<mailto:marshall@schedmd.com>

Fantastic. The mysql commands are in the long block of text in comment 5<x-msg://63/show_bug.cgi?id=4579#c5>, but
here they are again so you don't have to dig through that:

delete from tres_table where id=6; update tres_table set type='gres' where
id=5;

Then simply start up the 17.11.2 slurmdbd.

I highly recommend backing up your database, just in case something goes
horribly wrong.

Doing this myself:


Here's what it looks like right now (before the fix):

mysql> select * from tres_table;
+---------------+---------+----+---------+------+
| creation_time | deleted | id | type    | name |
+---------------+---------+----+---------+------+
|    1515103785 |       0 |  1 | cpu     |      |
|    1515103785 |       0 |  2 | mem     |      |
|    1515103785 |       0 |  3 | energy  |      |
|    1515103785 |       0 |  4 | node    |      |
|    1515103857 |       0 |  5 | billing | gpu  |
|    1515104917 |       0 |  6 | gres    | gpu  |
+---------------+---------+----+---------+------+


Fix the table:


mysql> delete from tres_table where id=6; update tres_table set type='gres'
where id=5;
Query OK, 1 row affected (0.01 sec)

Query OK, 1 row affected (0.01 sec)
Rows matched: 1  Changed: 1  Warnings: 0

mysql> select * from tres_table;
+---------------+---------+----+--------+------+
| creation_time | deleted | id | type   | name |
+---------------+---------+----+--------+------+
|    1515103785 |       0 |  1 | cpu    |      |
|    1515103785 |       0 |  2 | mem    |      |
|    1515103785 |       0 |  3 | energy |      |
|    1515103785 |       0 |  4 | node   |      |
|    1515103857 |       0 |  5 | gres   | gpu  |
+---------------+---------+----+--------+------+



Then start up the new slurmdbd. After that, it's fixed. It should look like
this:

mysql> select * from tres_table;
+---------------+---------+------+----------------+------+
| creation_time | deleted | id   | type           | name |
+---------------+---------+------+----------------+------+
|    1515103785 |       0 |    1 | cpu            |      |
|    1515103785 |       0 |    2 | mem            |      |
|    1515103785 |       0 |    3 | energy         |      |
|    1515103785 |       0 |    4 | node           |      |
|    1515108426 |       0 |    5 | billing        |      |
|    1515108426 |       1 | 1000 | dynamic_offset |      |
|    1515103857 |       0 | 1001 | gres           | gpu  |
+---------------+---------+------+----------------+------+

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 10 Marshall Garey 2018-01-04 17:31:27 MST

That's great to hear, thanks Robert. I'll close this ticket as resolved/fixed for now, reopen if you have further issues.