Bug 12947

Summary: Cannot delete accounts/users in 20.11.8
Product: Slurm Reporter: Bas van der Vlies <bas.vandervlies>
Component: slurmdbdAssignee: Chad Vizino <chad>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: jaap.dijkshoorn
Version: 20.11.8   
Hardware: Linux   
OS: Linux   
Site: SURF Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: lisa cluster dump

Description Bas van der Vlies 2021-12-02 04:35:54 MST
Hey we have the same problem as described in issue:
 * https://bugs.schedmd.com/show_bug.cgi?id=12558

Is there a fix for this problem?

```
 [2021-12-02T11:38:17.066] error: CONN:8 No error
[2021-12-02T11:40:13.961] error: mysql_query failed: 1364 Field 'max_tres_pj' doesn't have a default value
update "lisa_assoc_table" as t1 set mod_time=1638441610, deleted=1, def_qos_id=DEFAULT, shares=DEFAULT, max_jobs=DEFAULT, max_jobs_accrue=DEFAULT, min_prio_thresh=DEFAULT, max_submit_jobs=DEFAULT, max_wall_pj=DEFAULT, max_tres_pj=DEFAULT, max_tres_pn=DEFAULT, max_tres_mins_pj=DEFAULT, max_tres_run_mins=DEFAULT, grp_jobs=DEFAULT, grp_submit_jobs=DEFAULT, grp_jobs_accrue=DEFAULT, grp_wall=DEFAULT, grp_tres=DEFAULT, grp_tres_mins=DEFAULT, grp_tres_run_mins=DEFAULT, qos=DEFAULT, delta_qos=DEFAULT, priority=DEFAULT where (id_assoc=46778 || id_assoc=46779 || id_assoc=46780 || id_assoc=46781 || id_assoc=46782 || id_assoc=46783 || id_assoc=46784);
[2021-12-02T11:40:15.467] error: CONN:8 No error
```

The deleting of the accounts/users is piling up!
Comment 1 jaap.dijkshoorn@surf.nl 2021-12-02 04:42:13 MST
And can you confirm that this is the fix / workaound:

 * https://bugs.schedmd.com/show_bug.cgi?id=12558#c3
Comment 2 Jason Booth 2021-12-02 09:55:04 MST
Would you be able to provide a database dump without jobs and job steps and attach that information to this bug for us to review?

For example:

>  mysqldump slurm_acct_db --ignore-table=slurm_acct_db.clustername_job_table --ignore-table=slurm_acct_db.clustername_step_table | xz - > db_dump.xz

Please modify "slurm_acct_db.clustername_step_table" to reflect your database.ClusterName_ ...


What we will be looking at are if the left and right values mentioned here.

https://slurm.schedmd.com/slurmdbd.html#OPT_-R[comma-separated-cluster-name-list]
Comment 3 Jason Booth 2021-12-02 10:16:31 MST
I failed to reply to your second commend. Please do still provide the database dump for us to review.


> And can you confirm that this is the fix / workaound:
> * https://bugs.schedmd.com/show_bug.cgi?id=12558#c3

It is unclear at this point, however we have seen a few issues like this in the past. We have not been able to duplicate this (NULL/DEFAULT). We think this has something to do with the database/version exposed from upgrading OS components, specifically the database.
Comment 5 Bas van der Vlies 2021-12-02 10:24:09 MST
Created attachment 22496 [details]
lisa cluster dump
Comment 6 Chad Vizino 2021-12-02 11:14:35 MST
(In reply to Bas van der Vlies from comment #0)
> Hey we have the same problem as described in issue:
>  * https://bugs.schedmd.com/show_bug.cgi?id=12558
> 
> Is there a fix for this problem?
> 
> ```
>  [2021-12-02T11:38:17.066] error: CONN:8 No error
> [2021-12-02T11:40:13.961] error: mysql_query failed: 1364 Field
> 'max_tres_pj' doesn't have a default value
> update "lisa_assoc_table" as t1 set mod_time=1638441610, deleted=1,
> def_qos_id=DEFAULT, shares=DEFAULT, max_jobs=DEFAULT,
> max_jobs_accrue=DEFAULT, min_prio_thresh=DEFAULT, max_submit_jobs=DEFAULT,
> max_wall_pj=DEFAULT, max_tres_pj=DEFAULT, max_tres_pn=DEFAULT,
> max_tres_mins_pj=DEFAULT, max_tres_run_mins=DEFAULT, grp_jobs=DEFAULT,
> grp_submit_jobs=DEFAULT, grp_jobs_accrue=DEFAULT, grp_wall=DEFAULT,
> grp_tres=DEFAULT, grp_tres_mins=DEFAULT, grp_tres_run_mins=DEFAULT,
> qos=DEFAULT, delta_qos=DEFAULT, priority=DEFAULT where (id_assoc=46778 ||
> id_assoc=46779 || id_assoc=46780 || id_assoc=46781 || id_assoc=46782 ||
> id_assoc=46783 || id_assoc=46784);
> [2021-12-02T11:40:15.467] error: CONN:8 No error
> ```
> 
> The deleting of the accounts/users is piling up!
This has to do with the db engine you are using. From the dump it looks like you are using MariaDB 10.3.31. Was it the same one you used when you first installed slurm (when the db tables were created)?

For MariaDB the related doc at https://mariadb.com/kb/en/text/ says:

>Before MariaDB 10.2.1, BLOB and TEXT columns could not be assigned a DEFAULT value. This restriction was lifted in MariaDB 10.2.1.
For another customer who had this issue (before the bug you reference) we suggested adding defaults to the cluster assoc table to resolve this. For your cluster you can try this after after stopping the slurm daemons, running the commands and restarting things:

>ALTER TABLE lisa_assoc_table MODIFY max_tres_pj text DEFAULT ''; 
>ALTER TABLE lisa_assoc_table MODIFY max_tres_pn text DEFAULT '';
>ALTER TABLE lisa_assoc_table MODIFY max_tres_mins_pj text DEFAULT '';
>ALTER TABLE lisa_assoc_table MODIFY max_tres_run_mins text DEFAULT '';
>ALTER TABLE lisa_assoc_table MODIFY grp_tres text DEFAULT '';
>ALTER TABLE lisa_assoc_table MODIFY grp_tres_mins text DEFAULT '';
>ALTER TABLE lisa_assoc_table MODIFY grp_tres_run_mins text DEFAULT '';
>ALTER TABLE lisa_assoc_table MODIFY qos blob DEFAULT '';
>ALTER TABLE lisa_assoc_table MODIFY delta_qos blob DEFAULT '';
Comment 7 Bas van der Vlies 2021-12-02 11:40:06 MST
Thanks we will try it. I had to search:
 * We started with slurm 17 and debian 9
 * This upgrade started with debian 10.10 --> mariadb 10.3.29
 * upgraded from slurm 20.02 --> 20.11.8
 * upgraded debian 10.11 --> mariadb 10.3.31

So the mariadb version does not differ that much. That is why we are suprised.
Comment 8 Bas van der Vlies 2021-12-02 12:23:45 MST
Thanks Jason/Chad the alter commands works as expected.

Just one question if we upgrade to the next major (21.08.4) and  the table `cluster_assoc_table` has been changed in that release a field is added or removed do we have the same problem and must we alter the table again?
Comment 9 jaap.dijkshoorn@surf.nl 2021-12-02 12:45:52 MST
Another question to this is the 'update' sql command of the assoc done within slurm changed with the upgrade of slurm?
Comment 11 Chad Vizino 2021-12-02 14:55:21 MST
(In reply to Bas van der Vlies from comment #8)
> Thanks Jason/Chad the alter commands works as expected.
> 
> Just one question if we upgrade to the next major (21.08.4) and  the table
> `cluster_assoc_table` has been changed in that release a field is added or
> removed do we have the same problem and must we alter the table again?
Great! You should not need to alter the table again if you're keeping the db (you do not drop/create the table). If you run:
>mysqldump slurm_acct_db lisa_assoc_table
You should now see the the text/blob fields with default values similar to this:

>CREATE TABLE `cluster_assoc_table` (
> ...
> `user` tinytext NOT NULL DEFAULT '',
> ...
Comment 12 Chad Vizino 2021-12-02 15:36:21 MST
(In reply to jaap.dijkshoorn@surf.nl from comment #9)
> Another question to this is the 'update' sql command of the assoc done
> within slurm changed with the upgrade of slurm?

Sorry. I don't understand what you are asking. would you elaborate?
Comment 14 jaap.dijkshoorn@surf.nl 2021-12-03 00:33:04 MST
(In reply to Chad Vizino from comment #12)
> (In reply to jaap.dijkshoorn@surf.nl from comment #9)
> > Another question to this is the 'update' sql command of the assoc done
> > within slurm changed with the upgrade of slurm?
> 
> Sorry. I don't understand what you are asking. would you elaborate?

We want to understand how we end up hitting this 'bug' at this point. Was it the way the tables were created in the past by SLURM in mysql (mariadb) and that deleting stuff in this table is now handled slightly different then before? or something else? Cause the update of Mariadb at this stage was minor. (from 10.3.29 to 10.3.31)
Comment 16 Chad Vizino 2021-12-03 15:40:08 MST
(In reply to jaap.dijkshoorn@surf.nl from comment #14)
> We want to understand how we end up hitting this 'bug' at this point. Was it
> the way the tables were created in the past by SLURM in mysql (mariadb) and
> that deleting stuff in this table is now handled slightly different then
> before? or something else? Cause the update of Mariadb at this stage was
> minor. (from 10.3.29 to 10.3.31)
About the failing query you list in comment 0:

>[2021-12-02T11:40:13.961] error: mysql_query failed: 1364 Field 'max_tres_pj' doesn't have a default value
>update "lisa_assoc_table" as t1 set mod_time=1638441610, deleted=1, def_qos_id=DEFAULT, shares=DEFAULT, max_jobs=DEFAULT, max_jobs_accrue=DEFAULT, min_prio_thresh=DEFAULT, max_submit_jobs=DEFAULT, max_wall_pj=DEFAULT, max_tres_pj=DEFAULT,...
The Slurm code generating this query has not changed from 20.02 to 20.11 (see it in remove_common() in src/plugins/accounting_storage/mysql/accounting_storage_mysql.c).

The minor (.29 to .31) MariaDB upgrade should have had nothing to do with it. From comment 7 you list these changes:

>  * We started with slurm 17 and debian 9
>  * This upgrade started with debian 10.10 --> mariadb 10.3.29
>  * upgraded from slurm 20.02 --> 20.11.8
>  * upgraded debian 10.11 --> mariadb 10.3.31
Debian 9 shipped with MariaDB 10.1:

>https://mariadb.com/resources/blog/mariadb-server-default-in-debian-9/
And, as mentioned in comment 6:

>Before MariaDB 10.2.1, BLOB and TEXT columns could not be assigned a DEFAULT value. This restriction was lifted in MariaDB 10.2.1.
Your cluster assoc table would have been created without defaults (no "default ''") for the text and blob fields with MariaDB 10.1 when you first installed Slurm and we can see that from your dump from comment 5:

>CREATE TABLE `lisa_assoc_table` (
>  ... 
>  `user` tinytext NOT NULL,
>  `acct` tinytext NOT NULL,
>  `partition` tinytext NOT NULL,
>  `parent_acct` tinytext NOT NULL,
>  `max_tres_pj` text NOT NULL,
>  ... 
>  PRIMARY KEY (`id_assoc`),
>  UNIQUE KEY `udex` (`user`(42),`acct`(42),`partition`(42)),
>  KEY `lft` (`lft`),
>  KEY `account` (`acct`(42))
>) ENGINE=InnoDB AUTO_INCREMENT=46995 DEFAULT CHARSET=utf8mb4;
The problem appears to be a MariaDB issue but I can't pinpoint it. Was your slurm accounting db dropped and recreated at any point when doing Slurm/MariaDB/Distro upgrades?
Comment 17 Chad Vizino 2021-12-07 15:41:57 MST
Since the message slurm is reporting is from MariaDB (comment 0), you way want to check with your db admin on the issue.

Also, we've downgraded the severity of the ticket (comment 8).
Comment 18 Chad Vizino 2021-12-16 11:58:09 MST
Any update on this from your end? Would like to close if things are working now.
Comment 19 jaap.dijkshoorn@surf.nl 2021-12-17 00:58:08 MST
(In reply to Chad Vizino from comment #16)
> (In reply to jaap.dijkshoorn@surf.nl from comment #14)
> > We want to understand how we end up hitting this 'bug' at this point. Was it
> > the way the tables were created in the past by SLURM in mysql (mariadb) and
> > that deleting stuff in this table is now handled slightly different then
> > before? or something else? Cause the update of Mariadb at this stage was
> > minor. (from 10.3.29 to 10.3.31)
> About the failing query you list in comment 0:
> 
> >[2021-12-02T11:40:13.961] error: mysql_query failed: 1364 Field 'max_tres_pj' doesn't have a default value
> >update "lisa_assoc_table" as t1 set mod_time=1638441610, deleted=1, def_qos_id=DEFAULT, shares=DEFAULT, max_jobs=DEFAULT, max_jobs_accrue=DEFAULT, min_prio_thresh=DEFAULT, max_submit_jobs=DEFAULT, max_wall_pj=DEFAULT, max_tres_pj=DEFAULT,...
> The Slurm code generating this query has not changed from 20.02 to 20.11
> (see it in remove_common() in
> src/plugins/accounting_storage/mysql/accounting_storage_mysql.c).
> 
> The minor (.29 to .31) MariaDB upgrade should have had nothing to do with
> it. From comment 7 you list these changes:
> 
> >  * We started with slurm 17 and debian 9
> >  * This upgrade started with debian 10.10 --> mariadb 10.3.29
> >  * upgraded from slurm 20.02 --> 20.11.8
> >  * upgraded debian 10.11 --> mariadb 10.3.31
> Debian 9 shipped with MariaDB 10.1:
> 
> >https://mariadb.com/resources/blog/mariadb-server-default-in-debian-9/
> And, as mentioned in comment 6:
> 
> >Before MariaDB 10.2.1, BLOB and TEXT columns could not be assigned a DEFAULT value. This restriction was lifted in MariaDB 10.2.1.
> Your cluster assoc table would have been created without defaults (no
> "default ''") for the text and blob fields with MariaDB 10.1 when you first
> installed Slurm and we can see that from your dump from comment 5:
> 
> >CREATE TABLE `lisa_assoc_table` (
> >  ... 
> >  `user` tinytext NOT NULL,
> >  `acct` tinytext NOT NULL,
> >  `partition` tinytext NOT NULL,
> >  `parent_acct` tinytext NOT NULL,
> >  `max_tres_pj` text NOT NULL,
> >  ... 
> >  PRIMARY KEY (`id_assoc`),
> >  UNIQUE KEY `udex` (`user`(42),`acct`(42),`partition`(42)),
> >  KEY `lft` (`lft`),
> >  KEY `account` (`acct`(42))
> >) ENGINE=InnoDB AUTO_INCREMENT=46995 DEFAULT CHARSET=utf8mb4;
> The problem appears to be a MariaDB issue but I can't pinpoint it. Was your
> slurm accounting db dropped and recreated at any point when doing
> Slurm/MariaDB/Distro upgrades?


Yes it was. We did this during the upgrade of SLURM from 19 to 20.02
Comment 20 Chad Vizino 2022-01-03 14:53:59 MST
Closing since it seems like things are working for you now.
Comment 21 Chad Vizino 2022-03-22 13:37:22 MDT
Please see bug 13562 comment 21 for a more complete analysis of this issue and some additional steps you may want to consider.