8213 – Lost transaction records (sreport) after upgrades (v19.05.04)

Bug 8213 - Lost transaction records (sreport) after upgrades (v19.05.04)

Summary: Lost transaction records (sreport) after upgrades (v19.05.04)

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Database (show other bugs)
Version:	19.05.4
Hardware:	Linux Linux

Importance:	--- 4 - Minor Issue
Assignee:	Albert Gil
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2019-12-11 08:46 MST by Damien
Modified:	2020-05-29 09:32 MDT (History)
CC List:	0 users

See Also:
Site:	Monash University
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Current slurm.conf (7.38 KB, text/plain) 2019-12-11 08:52 MST, Damien	Details
slurmdbd logs (10.23 MB, application/x-gzip) 2019-12-11 08:59 MST, Damien	Details
More slurmdbd logs (2.92 MB, application/x-gzip) 2019-12-11 09:03 MST, Damien	Details
mariadb.log. from sql server (12.85 KB, text/plain) 2019-12-11 09:08 MST, Damien	Details
Current slurmdbd.log (32.63 MB, text/x-log) 2020-01-01 18:16 MST, Damien	Details
yesterday's slurmdbd log (2.72 MB, application/gzip) 2020-01-01 18:18 MST, Damien	Details
Current slurmdbd logs (10.32 MB, application/x-gzip) 2020-01-03 06:33 MST, Damien	Details
Yesterday's slurmdbd logs (12.58 MB, application/x-gzip) 2020-01-04 06:51 MST, Damien	Details
slurmdbd logs (1.63 MB, application/x-gzip) 2020-01-05 22:40 MST, Damien	Details
Current slurmdbd.log (2.02 MB, application/gzip) 2020-01-12 22:37 MST, Damien	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Damien 2019-12-11 08:46:34 MST

Hi Slurm Support

How are you doing ?

We have discovered that we lost significant transaction records via sreport after these events:

1) https://bugs.schedmd.com/show_bug.cgi?id=8033

2) Upgrade v19.05.05 from v18.08.06 on 2019-11-20 16:00:54 MST


Some findings,

Records before 11th Sep 2019 is no longer there.

sreport cluster AccountUtilizationByUser -t hours start=2019-01-01 end=2019-09-10
--------------------------------------------------------------------------------
Cluster/Account/User Utilization 2019-01-01T00:00:00 - 2019-09-09T23:59:59 (21776400 secs)
Usage reported in CPU Hours
--------------------------------------------------------------------------------
  Cluster         Account     Login     Proper Name         Used   Energy 
--------- --------------- --------- --------------- ------------ -------- 



Records after 11th Sep 2019,

sreport cluster AccountUtilizationByUser -t hours start=2019-09-11 end=2019-09-12
--------------------------------------------------------------------------------
Cluster/Account/User Utilization 2019-09-11T00:00:00 - 2019-09-11T23:59:59 (86400 secs)
Usage reported in CPU Hours
--------------------------------------------------------------------------------
  Cluster         Account     Login     Proper Name      Used   Energy 
--------- --------------- --------- --------------- --------- -------- 
       m3            root                               83995        0 
       m3            p001                                1774        0
.....
.....
.....


[damienl@m3-login1 ~]$ which sreport
/opt/slurm-19.05.4/bin/sreport
[damienl@m3-login1 ~]$ sreport -V
slurm 19.05.4



We understand that from bug 8033, there will be some lost records, but not sure how much is lost. Or is this a separate bug from the upgrade ?


I will be supplying you with more configs and logs.



Kindly investigate and advise.

Many Thanks


Damien

Comment 1 Damien 2019-12-11 08:50:32 MST

Created attachment 12532 [details]
Current slurmdbd.conf

Comment 2 Damien 2019-12-11 08:52:51 MST

Created attachment 12533 [details]
Current slurm.conf

Comment 3 Damien 2019-12-11 08:59:16 MST

Created attachment 12534 [details]
slurmdbd logs

Comment 4 Damien 2019-12-11 09:03:26 MST

Created attachment 12536 [details]
More slurmdbd logs

Comment 5 Damien 2019-12-11 09:08:36 MST

Created attachment 12537 [details]
mariadb.log. from sql server

Comment 6 Damien 2019-12-11 09:26:16 MST

This is taken at 8th Nov 2019


MariaDB [slurm_acct_db]> SELECT COUNT(*) FROM m3_job_table;
+----------+
| COUNT(*) |
+----------+
| 11701937 |



Kindly let us know if you need clarification or more logs



Thanks

Damien

Comment 8 Albert Gil 2019-12-11 12:22:31 MST

Hi Damien,

> We understand that from bug 8033, there will be some lost records, but not
> sure how much is lost. Or is this a separate bug from the upgrade ?

The accounting information lost in bug 8033 is limited to the specific runaway jobs detected there.
I'm pretty sure that not ALL the jobs where runaway jobs for such long period (2019-01-01 to 2019-09-10), so something happen and looks unrelated to bug 8033.

This looks like a problem related to a rollup (the aggregation process that creates the info from jobs to sreport), but I still need to look into the logs and config that you provided to be certain.

Can you get jobs for such period by sacct?

> MariaDB [slurm_acct_db]> SELECT COUNT(*) FROM m3_job_table;
> +----------+
> | COUNT(*) |
> +----------+
> | 11701937 |

Thanks for that information, could you also post the output of this other SQL command:

MariaDB [slurm_acct_db_1905]> SELECT COUNT(*) FROM m3_job_table WHERE time_start > 1546300800 && time_start < 1568073600;

This will tell us if the information is there, so we can restore the aggregated info usually shown by sreport.


Regards,
Albert

Comment 9 Damien 2019-12-11 15:19:52 MST

MariaDB [(none)]> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| brick              |
| db_backup          |
| mysql              |
| performance_schema |
| slurm_acct_db      |
| test               |
+--------------------+
7 rows in set (0.02 sec)

MariaDB [(none)]> use slurm_acct_db;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed

MariaDB [slurm_acct_db]> SELECT COUNT(*) FROM m3_job_table;
+----------+
| COUNT(*) |
+----------+
| 12082135 |
+----------+
1 row in set (4.75 sec)


MariaDB [slurm_acct_db]> SELECT COUNT(*) FROM m3_job_table WHERE time_start > 1546300800 && time_start < 1568073600;
+----------+
| COUNT(*) |
+----------+
|  3286576 |
+----------+
1 row in set (21.96 sec)

MariaDB [slurm_acct_db]>

Comment 10 Damien 2019-12-11 16:38:06 MST

Hi Albert,

We seems to lost last year records too.


Details:
--

sreport cluster AccountUtilizationByUser -t hours start=2018-01-01 end=2018-09-10
--------------------------------------------------------------------------------
Cluster/Account/User Utilization 2018-01-01T00:00:00 - 2018-09-09T23:59:59 (21776400 secs)
Usage reported in CPU Hours
--------------------------------------------------------------------------------
  Cluster         Account     Login     Proper Name        Used   Energy 
--------- --------------- --------- --------------- ----------- -------- 
[damienl@m3-login1 etc]$


--

Cheers

Damien

Comment 12 Albert Gil 2019-12-13 10:19:38 MST

Hi Damien,

The bad news is that I don't see any error in the logs that you sent me that could explain your error. Probably it happen before that date.
The good news is that from your comment 9, it looks that we will be able to restore that information.

I will focus on restore the info, but if you sent me more slurmdbd logs from dates when you know that it was working and when it started to fail, I will try to find out why that happen.

Before to explain you how to restore the, I would need:
1) The oldest date that you want to recover the sreport info from.
2) The output of this command:
MariaDB [slurm_acct_db_1905]> select * from m3_last_ran_table;

Also some collateral comments:
- I've marked your slurmdbd.conf as private. Remember to remove the StoragePass of that file or to mark it as private in the future.
- I've seen that you use DebugLevel=9 for slurmdbd. That's really not necessary.
- I've seen that you use DebugFlags=gres. If you have any specific issue, it could make sense. But in general we recommend to use "scontrol setdebugflags +gres" to troubleshoot something, and "scontrol setdebugflags -gres" when you are done. Changing the slurm.conf to do that is not recommended, because it force you to reconfigure and because it will stay there if you forget it.

Regards,
Albert

Comment 13 Damien 2019-12-13 18:22:42 MST

Hi Albert

Thanks for your reply.

We kept only 7 days worth of slurmdbd.log, it might not be sufficient for this investigation. Nevertheless I will still cloudsend* these files to albert.gil@schedmd.com  as they are quite huge.

Ideally, we want to go back as much as possible (This database started late 2015), but in this case, backtrack to at least 2018-01-01.

The attached slurmdbd.conf has a fake password.

I will attempting to downgrade both debug levels, which will be more suitable settings ?



Kindly advise


Thanks

Damien

Comment 14 Damien 2019-12-13 18:26:18 MST

MariaDB [(none)]> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| brick              |
| db_backup          |
| mysql              |
| performance_schema |
| slurm_acct_db      |
| test               |
+--------------------+
7 rows in set (0.02 sec)

MariaDB [(none)]> use slurm_acct_db;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
MariaDB [slurm_acct_db]> select * from m3_last_ran_table;
+---------------+--------------+----------------+
| hourly_rollup | daily_rollup | monthly_rollup |
+---------------+--------------+----------------+
|    1576284158 |   1576284158 |     1576284158 |
+---------------+--------------+----------------+
1 row in set (0.00 sec)

MariaDB [slurm_acct_db]>

Comment 15 Damien 2019-12-13 23:24:41 MST

Hi Albert

I have cloudsend* the logs to you.

See whether if this is sufficient.



Thanks

Damien

Comment 16 Damien 2019-12-13 23:39:00 MST

Hi Albert

You might also want to refer 'https://bugs.schedmd.com/show_bug.cgi?id=8033' attachments
"current slurmdbd.log"

Those logs were when we just upgrade to v19.05.04



I hope that this is helpful


Thanks

Damien

Comment 17 Albert Gil 2019-12-16 08:34:29 MST

Hi Damien,

> Ideally, we want to go back as much as possible (This database started late 2015), but in this case, backtrack to at least 2018-01-01.

I'm pretty sure that the data is there because you don't have any PurgeJobAfter neither PurgeResvAfter, but just in case, can you run this command to ensure that the information is there:

MariaDB [slurm_acct_db_1905]> SELECT COUNT(*) FROM m3_job_table WHERE time_start > 1514764800 && time_start < 1517443200;

These command will return the number similar to the number of jobs started in 2018-Jan.
If the number makes sense (is not zero), we will restore the information.

BUT, before that we need to fix an IMPORTANT issue in your setup that could obstruct the restoring process:
I've detected that your logrotate of the slurmdbd is using SIGHUP to signal the slurmdbd.
This is not recommended anymore since 17.11, but we should use SIGUSR2:

https://github.com/SchedMD/slurm/blob/slurm-17-11-13-2/RELEASE_NOTES#L221
https://slurm.schedmd.com/slurm.conf.html#SECTION_LOGGING

https://slurm.schedmd.com/SLUG17/FieldNotes.pdf

This is specially important because the rollup process to restore data from 2018-01 will take several days, and this SIGHUP may interfere on it.

Once we know that the jobs info to restore the sreport info from is in the DB and that logrotate won't interfere with the rollup process, we will be ready to start the rollup process.

Regards,
Albert

Comment 19 Damien 2019-12-16 16:26:48 MST

MariaDB [(none)]> 
MariaDB [(none)]> 
MariaDB [(none)]> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| brick              |
| db_backup          |
| mysql              |
| performance_schema |
| slurm_acct_db      |
| test               |
+--------------------+
7 rows in set (0.02 sec)

MariaDB [(none)]> use slurm_acct_db;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
MariaDB [slurm_acct_db]> SELECT COUNT(*) FROM m3_job_table WHERE time_start > 1514764800 && time_start < 1517443200;
+----------+
| COUNT(*) |
+----------+
|   123964 |
+----------+
1 row in set (30.55 sec)

MariaDB [slurm_acct_db]>

Comment 20 Damien 2019-12-16 23:07:17 MST

$ pwd
/etc/logrotate.d


$ cat slurmdb 
/mnt/slurm-logs/slurmdbd.log
{
 compress
 missingok 
 nocopytruncate 
 nocreate 
 nodelaycompress 
 nomail  
 notifempty 
 noolddir 
 rotate 5 
 sharedscripts 
 size=5M 
 create 640 slurm root 
  postrotate
        pkill -x --signal SIGUSR2 slurmdbd
 endscript
}

Comment 21 Albert Gil 2019-12-17 08:19:51 MST

Hi Damien,

Ok, it seems that we are ready to start the rollup process.
These are the steps that you must follow:

0) Enable DB_USAGE as DebugFlags of slurmdbd.conf, and leave DebugLevel as Info.
1) Stop the slurmdbd
2) Update the o2_last_ran_table with the desired <EPOCH_TIME> using the command:

MariaDB [slurmdbd]> UPDATE m3_last_ran_table SET hourly_rollup = <EPOCH_TIME>, daily_rollup = <EPOCH_TIME>, monthly_rollup = <EPOCH_TIME>;

Note that <EPOCH_TIME> must be the integer number that represents the Epoch time that you want to recover your sreport info from.
You can use epochconverter.com or a similar tool (eg date command [1]) to obtain your desired number.
It seems that you want to restore data from 2018-01-01, then the <EPOCH_TIME> should be 1514761200 (in my timezone).

3) Start slurmdbd
4) Monitor the slurmdbd:
4.1) Watch slurmdbd log and check the lines with "m3 curr hour is now <EpochTime1>-<EpochTime2>". These lines will give you the current status/date of the rollup.
4.2) With mytop and similar tools ensure that everything is working fine at sql level.
4.3) With "sdiag" and "sacctmgr show stats" you can also see if the DB is working normal, or if there is some user sending a lot or huge queries to it.


For two years it may take several days to complete, it will depend on how many jobs per day you have and the performance of your DB. Keep an eye in the logs to be sure that the process is going well, and let me know how it works.

Regards,
Albert

[1]
$ date "+%s" -d "01/01/2018 00:00:00"
1514761200
$ date -d @1514761200
Mon Jan  1 00:00:00 CET 2018

Comment 22 Damien 2019-12-17 16:00:08 MST

Hi Albert

Thanks for this instruction and advice.

Some questions related to this,

1) When doing these record restorations, how will it affect our daily Slurm operations ? srun, sbatch, sacctmgr, sacct ?

2) If I do this,
MariaDB [slurmdbd]> UPDATE m3_last_ran_table SET hourly_rollup = <EPOCH_TIME>,
daily_rollup = <EPOCH_TIME>, monthly_rollup = <EPOCH_TIME>;
How will it affects my current records in the system now ? any effects ?

3) If I do this,
MariaDB [slurmdbd]> UPDATE m3_last_ran_table SET hourly_rollup = <EPOCH_TIME>,
daily_rollup = <EPOCH_TIME>, monthly_rollup = <EPOCH_TIME>;
For start-date=2018-01-01 , How will it reacts if we are doing start-date=2016-01-01 in the future ?

4) In what extent when one user's many queries to sacct or squeue is going to affect Slurm or its database ? 

5) What is causing these lost records ? bugs? Database faults? Upgrading procedure?




Kindly advise


Thanks

Damien

Comment 23 Albert Gil 2019-12-18 06:01:20 MST

Hi Damien,

> 1) When doing these record restorations, how will it affect our daily Slurm
> operations ? srun, sbatch, sacctmgr, sacct ?

It won't affect, almost at all.
The slurmdbd is already running rollup process (threads) every hour to update the information that you can see in sreport.
As it does it every hour it use to take just seconds to finish.
The main difference is that now we are going to request the slurmdbd to do a rollup process/thread from 2018-01-01, and that will take a lot of time.
And the only affectation that you will see is that the info of sreport won't be updated until the full and long rollup is completed, unlike in the normal scenario, where this information is updated every hour.

> 2) If I do this,
> MariaDB [slurmdbd]> UPDATE m3_last_ran_table SET hourly_rollup =
> <EPOCH_TIME>,
> daily_rollup = <EPOCH_TIME>, monthly_rollup = <EPOCH_TIME>;
> How will it affects my current records in the system now ? any effects ?

First of all, to avoid any confusion, please note that you should replace <EPOCH_TIME> with the actual epoch time number representing the date that you want to recover from (eg 1514761200).
This command won't affect any record of the system, but will trigger a rollup to recompute the sreport aggregated information from the actual records of each job (that are totally untouched). So, only sreport information is updated.

> 3) If I do this,
> MariaDB [slurmdbd]> UPDATE m3_last_ran_table SET hourly_rollup =
> <EPOCH_TIME>,
> daily_rollup = <EPOCH_TIME>, monthly_rollup = <EPOCH_TIME>;
> For start-date=2018-01-01 , How will it reacts if we are doing
> start-date=2016-01-01 in the future ?

The rollup process recomputes / restores sreport info from the date that we put, up to current time.
So, if you want to restore the info form 2016-01-01 maybe you should do it now, because whenever you do it you will also recompute the information from 2018-01-01 to now. You will do it twice and it could take quite a lot of time (days).

> 4) In what extent when one user's many queries to sacct or squeue is going
> to affect Slurm or its database ? 

That's a general issue.
The amount of RPC commands that slurmctld and slurmdbd receive and respond could be important.
Sometimes users do so aggressive queries that it could lead to a degradation of the responsiveness of the system (ie, kind of an an unintentional DoS attack).
Particularly, big and constant sacct queries could degrade the responsiveness of slurmdbd (or the sql backend), and constant squeue queries the responsiveness of slurmctld.
That's why sdiag and "sacctmgr show stats" report the RPC by users, to detect those aggressive queries if responsiveness is being affected, to allow admins to take actions and request the users to avoid such queries.

I just mentioned it because we want slurmdbd as healthy as possible while running the long rollup, because if it's stopped, we would need to restart the long rollup from the beginning. And you won't like it! ;-)

> 5) What is causing these lost records ? bugs? Database faults? Upgrading
> procedure?

That's the big question that we are working on to discover.
Your issue is strange, but we've seen it more than once, so we are working to try to discover what could lead to this.
Looks like a corner case of specific signaling done by logrotates while the slurmdbd is doing special rollups due runaway jobs or similar scenarios.
But we are still working on it.


Please keep me updated about how the long rollup goes,
Albert

Comment 24 Damien 2019-12-18 18:38:54 MST

Hi Albert

How do we know the rollup process has completed (after a few days) ? Is there a special string (Done ?) at the slurmdbd.log ?

Or what should we look out ?


Cheers

Damien

Comment 25 Albert Gil 2019-12-19 12:00:20 MST

Hi Damien,

> How do we know the rollup process has completed (after a few days) ? Is
> there a special string (Done ?) at the slurmdbd.log ?
> Or what should we look out ?

The simplest way is checking the slurmdbd.log lines:

m3 curr hour is now <EpochTime1>-<EpochTime2>

If those EpochTimes are the last current hour, that means that the long rollup ended and you can check if it ended properly with sreport.

But actually, the last log of a long rollup should be:

(as_mysql_usage.c:372) query
update "m3_last_ran_table" set hourly_rollup=<CurrentEpochTime>, daily_rollup=<CurrentEpochTime>, monthly_rollup=<CurrentEpochTime>

You can also check the m3_last_ran_table directly in sql.

But my recommendation is tail the slurmdbd.log and keep an eye in the values of the lines "m3 curr hour is now <EpochTime1>-<EpochTime2>".
This is the best way to see the progress and to see when it's done.

Regards,
Albert

Comment 26 Albert Gil 2019-12-23 07:21:49 MST

Hi Damien,

Is the rollup started?

Have a nice days,
Albert

Comment 27 Damien 2019-12-23 07:40:10 MST

Hi Albert

Thanks for the follow up.

We are still working on it.



Cheers

Damien

Comment 28 Damien 2020-01-01 16:54:49 MST

Hi Slurm Support

How are you doing ? Happy New Year 2020.

I am following up on this.

We have started the rollup process for a few days:

MariaDB [slurm_acct_db]> UPDATE m3_last_ran_table SET hourly_rollup = 1514725200, daily_rollup = 1514725200, monthly_rollup = 1514725200;
Query OK, 1 row affected (0.01 sec)
Rows matched: 1  Changed: 1  Warnings: 0
MariaDB [slurm_acct_db]> 




So far, I have fail to notice any "m3 curr hour is now <EpochTime1>" message inside our slurmdbd.logs



In addition, I can observed these errors:

cat slurmdbd.log |grep error
[2020-01-02T03:21:53.184] error: We have more time than is possible (7624746+1209600+0)(8834346) > 8712000 for cluster m3(2420) from 2018-05-07T02:00:00 - 2018-05-07T03:00:00 tres 1
[2020-01-02T03:22:04.458] error: We have more time than is possible (8048961+1196640+0)(9245601) > 8712000 for cluster m3(2420) from 2018-05-07T03:00:00 - 2018-05-07T04:00:00 tres 1
[2020-01-02T03:22:15.952] error: We have more time than is possible (8115044+1209600+0)(9324644) > 8712000 for cluster m3(2420) from 2018-05-07T04:00:00 - 2018-05-07T05:00:00 tres 1



Any advice on these issues ?



Thanks

Damien

Comment 29 Damien 2020-01-01 18:16:47 MST

Created attachment 12649 [details]
Current slurmdbd.log

Comment 30 Damien 2020-01-01 18:18:13 MST

Created attachment 12650 [details]
yesterday's slurmdbd log

Comment 31 Albert Gil 2020-01-02 03:21:58 MST

Hi Damien,

> How are you doing ? Happy New Year 2020.

Good, thanks!
Hope you do too!
:-))

> We have started the rollup process for a few days:
> 
> MariaDB [slurm_acct_db]> UPDATE m3_last_ran_table SET hourly_rollup =
> 1514725200, daily_rollup = 1514725200, monthly_rollup = 1514725200;
> 
> So far, I have fail to notice any "m3 curr hour is now <EpochTime1>" message
> inside our slurmdbd.logs

I think that you skipped the step 0) of comment 21:

> 0) Enable DB_USAGE as DebugFlags of slurmdbd.conf, and leave DebugLevel as Info.

The lines "m3 curr hour is now <EpochTime1>" will only appear if DB_USAGE is enabled.
Also, I can see a lot of high debug logs, that will be too verbose. Set DebugLevel at Info will be better.

What is the current output of:
MariaDB [slurm_acct_db_1905]> select * from m3_last_ran_table;

Most probably the rollup is running, but we cannot see it.
We can try to change DebugFlags and DebugLevel and SIGHUP the daemon to see the new output.

> In addition, I can observed these errors:
> 
> cat slurmdbd.log |grep error
> [2020-01-02T03:21:53.184] error: We have more time than is possible
> (7624746+1209600+0)(8834346) > 8712000 for cluster m3(2420) from
> 2018-05-07T02:00:00 - 2018-05-07T03:00:00 tres 1
> [2020-01-02T03:22:04.458] error: We have more time than is possible
> (8048961+1196640+0)(9245601) > 8712000 for cluster m3(2420) from
> 2018-05-07T03:00:00 - 2018-05-07T04:00:00 tres 1
> [2020-01-02T03:22:15.952] error: We have more time than is possible
> (8115044+1209600+0)(9324644) > 8712000 for cluster m3(2420) from
> 2018-05-07T04:00:00 - 2018-05-07T05:00:00 tres 1

The dates shown (2018-05) are telling me that these are errors occurred time ago and reshown becuase of the rollup being run.
These errors use to be related to reservations, and we are working on them (eg bug 6839).

At this point, my recommendation is to ignore them if their dates are so old, and once the rollup ends, if they still happen, then we will focus on them.

Regards,
Albert

Comment 32 Damien 2020-01-02 03:55:12 MST

MariaDB [(none)]> 
MariaDB [(none)]> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| brick              |
| db_backup          |
| mysql              |
| performance_schema |
| slurm_acct_db      |
| test               |
+--------------------+
7 rows in set (0.09 sec)

MariaDB [(none)]> use slurm_acct_db;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
MariaDB [slurm_acct_db]> select * from m3_last_ran_table;
+---------------+--------------+----------------+
| hourly_rollup | daily_rollup | monthly_rollup |
+---------------+--------------+----------------+
|    1568076288 |   1568076288 |     1568076288 |
+---------------+--------------+----------------+
1 row in set (0.00 sec)

MariaDB [slurm_acct_db]>

Comment 33 Damien 2020-01-02 03:56:59 MST

Hi Albert,

Thanks for your reply.

Should I repeat steps 0-4 again ?


Cheers

Damien

Comment 34 Albert Gil 2020-01-02 09:42:49 MST

Hi Damien,

> MariaDB [slurm_acct_db]> select * from m3_last_ran_table;
> +---------------+--------------+----------------+
> | hourly_rollup | daily_rollup | monthly_rollup |
> +---------------+--------------+----------------+
> |    1568076288 |   1568076288 |     1568076288 |
> +---------------+--------------+----------------+

I'm not sure about this output.
I expected the value that you set (1514725200) or current time.
But maybe it makes sense if slurmdbd is signaled in the middle of the rollup process.
Have you signaled the slumdbd process with a SIGHUP as I mentioned before?

> Should I repeat steps 0-4 again?

If you have done the SIGHUP, the logs now should tell us about the progress of the rollup process.

Let me see today's log again and then we decide.
If you do it it won't hurt, but maybe we are almost done and we will restarted from the beginning.

Regards,
Albert

Comment 35 Damien 2020-01-02 21:59:07 MST

Hi Albert, 

This is today's figures


MariaDB [slurm_acct_db]> select * from m3_last_ran_table;
+---------------+--------------+----------------+
| hourly_rollup | daily_rollup | monthly_rollup |
+---------------+--------------+----------------+
|    1568076288 |   1568076288 |     1568076288 |
+---------------+--------------+----------------+
1 row in set (0.00 sec)

MariaDB [slurm_acct_db]> 


It is same as yesterday.


Cheers

Damien

Comment 36 Albert Gil 2020-01-03 02:38:59 MST

Hi Damien,

The values of m3_last_ran_table are informational, but not definitive.
My recommendation is:
1) Edit your slurmdbd.conf to set DebugFlags=DB_USAGE and DebugLevel=Info
2) Signal the slurmdbd daemon with SIGHUP to make the above changes to be loaded, with a command like:
$ kill -s SIGHUP <slurmdbd PID>
3) Wait some minutes to gather some logs with the new configuration, and send me that logs.

With the new logs I will confirm if the rollup is actually running or not.
If it is running, we will only need to wait, and we will eb able to monitor as explained in comment 21.
If it's not, we will need to repeat the steps 0) to 4) of the comment 21. 

If you want, we can go directly to repeat the steps 0) to 4) of the comment 21.
That would mean to restart the rollup form the beginning.
It's not a problem in the sense that data won't be lost, but maybe it's not necessary and will take longer to complete.

If you follow the above recommended 3 steps I will tell you if it's necessary or not.

Regards,
Albert

Comment 37 Damien 2020-01-03 06:20:36 MST

# cat slurmdbd.conf 


#
# Authentication info
AuthType=auth/munge
#AuthInfo=/var/run/munge/munge.socket.2
#
# slurmDBD info
#DbdAddr=
DbdHost=m3-mgmt2
DbdBackupHost=m3-mgmt1
#DbdPort=7031
SlurmUser=slurm
#MessageTimeout=300
#DefaultQOS=normal,standby
DebugLevel=info
DebugFlags=DB_USAGE

Comment 38 Damien 2020-01-03 06:33:49 MST

Created attachment 12664 [details]
Current slurmdbd logs

Current slurmdbd logs

Please examine this


Thanks

Comment 39 Damien 2020-01-03 06:37:51 MST

Hi Albert,

Should we attempt this task again ?



UPDATE m3_last_ran_table SET hourly_rollup = 1514725200, daily_rollup = 1514725200, monthly_rollup = 1514725200;






Cheers

Damien

Comment 40 Albert Gil 2020-01-03 06:42:32 MST

> Should we attempt this task again ?
> 
> UPDATE m3_last_ran_table SET hourly_rollup = 1514725200, daily_rollup =
> 1514725200, monthly_rollup = 1514725200;

No.
In your new logs I can see that the rollup is almost finished:

[2020-01-04T00:28:24.457] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1575428400-1575432000

That is, the rollup is running and already in Dec-2019 (1575428400).
So, in some minutes or hours it will be done.

Let's keep an eye on these logs lines "m3 curr hour is now", and once they arrive to current time, let's check if your sreport is working again.

Regards,
Albert

Comment 41 Albert Gil 2020-01-03 06:52:20 MST

Hi Damien,

> In your new logs I can see that the rollup is almost finished:
> 
> [2020-01-04T00:28:24.457] 0(as_mysql_rollup.c:1072) m3 curr hour is now
> 1575428400-1575432000
> 
> That is, the rollup is running and already in Dec-2019 (1575428400).
> So, in some minutes or hours it will be done.
> 
> Let's keep an eye on these logs lines "m3 curr hour is now", and once they
> arrive to current time, let's check if your sreport is working again.

Actually, after a closer look of your last logs, I can see this:

[2020-01-04T00:17:45.276] Terminate signal (SIGINT or SIGTERM) received
[2020-01-04T00:18:59.820] slurmdbd version 19.05.4 started
[2020-01-04T00:18:59.822] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1568073600-1568077200

So, it seems that you signaled / killed slurmdbd without SIGHUP, right?
In such case, I'm not sure if the roll up process will complete properly.
Let's wait a little bit, and then, if necessary, we will repeat comment 21.

Regards,
Albert

Comment 42 Damien 2020-01-03 08:11:34 MST

Hi Albert,

Thanks for the details.


The current rollup is:


# grep curr slurmdbd.log
....
....
....
[2020-01-04T00:31:00.954] 0(as_mysql_rollup.c:1755) curr day is now 1575464400-1575550800
[2020-01-04T00:31:01.603] 0(as_mysql_rollup.c:1755) curr day is now 1575550800-1575637200
[2020-01-04T00:31:02.251] 0(as_mysql_rollup.c:1755) curr day is now 1575637200-1575723600
[2020-01-04T00:31:02.898] 0(as_mysql_rollup.c:1755) curr day is now 1575723600-1575810000
[2020-01-04T00:31:03.545] 0(as_mysql_rollup.c:1755) curr day is now 1575810000-1575896400
[2020-01-04T00:31:04.193] 0(as_mysql_rollup.c:1755) curr day is now 1575896400-1575982800
[2020-01-04T00:31:04.841] 0(as_mysql_rollup.c:1755) curr day is now 1575982800-1576069200
[2020-01-04T00:31:05.494] 0(as_mysql_rollup.c:1755) curr day is now 1576069200-1576155600
[2020-01-04T00:31:06.145] 0(as_mysql_rollup.c:1755) curr day is now 1576155600-1576242000
[2020-01-04T00:31:06.795] 0(as_mysql_rollup.c:1755) curr day is now 1576242000-1576328400
[2020-01-04T00:31:07.770] 0(as_mysql_rollup.c:1755) curr day is now 1576328400-1576414800
[2020-01-04T00:31:08.417] 0(as_mysql_rollup.c:1755) curr day is now 1576414800-1576501200
[2020-01-04T00:31:09.068] 0(as_mysql_rollup.c:1755) curr day is now 1576501200-1576587600
[2020-01-04T00:31:09.718] 0(as_mysql_rollup.c:1755) curr day is now 1576587600-1576674000
[2020-01-04T00:31:10.371] 0(as_mysql_rollup.c:1755) curr day is now 1576674000-1576760400
[2020-01-04T00:31:11.022] 0(as_mysql_rollup.c:1755) curr day is now 1576760400-1576846800
[2020-01-04T00:31:11.672] 0(as_mysql_rollup.c:1755) curr day is now 1576846800-1576933200
[2020-01-04T00:31:12.320] 0(as_mysql_rollup.c:1755) curr day is now 1576933200-1577019600
[2020-01-04T00:31:12.968] 0(as_mysql_rollup.c:1755) curr day is now 1577019600-1577106000
[2020-01-04T00:31:13.618] 0(as_mysql_rollup.c:1755) curr day is now 1577106000-1577192400
[2020-01-04T00:31:14.266] 0(as_mysql_rollup.c:1755) curr day is now 1577192400-1577278800
[2020-01-04T00:31:14.914] 0(as_mysql_rollup.c:1755) curr day is now 1577278800-1577365200
[2020-01-04T00:31:15.574] 0(as_mysql_rollup.c:1755) curr day is now 1577365200-1577451600
[2020-01-04T00:31:16.222] 0(as_mysql_rollup.c:1755) curr day is now 1577451600-1577538000
[2020-01-04T00:31:16.870] 0(as_mysql_rollup.c:1755) curr day is now 1577538000-1577624400
[2020-01-04T00:31:17.518] 0(as_mysql_rollup.c:1755) curr day is now 1577624400-1577710800
[2020-01-04T00:31:18.167] 0(as_mysql_rollup.c:1755) curr day is now 1577710800-1577797200
[2020-01-04T00:31:18.817] 0(as_mysql_rollup.c:1755) curr day is now 1577797200-1577883600
[2020-01-04T00:31:19.544] 0(as_mysql_rollup.c:1755) curr day is now 1577883600-1577970000
[2020-01-04T00:31:20.198] 0(as_mysql_rollup.c:1755) curr day is now 1577970000-1578056400
[2020-01-04T00:31:20.848] 0(as_mysql_rollup.c:1755) curr month is now 1567260000-1569852000
[2020-01-04T00:31:21.009] 0(as_mysql_rollup.c:1755) curr month is now 1569852000-1572526800
[2020-01-04T00:31:21.072] 0(as_mysql_rollup.c:1755) curr month is now 1572526800-1575118800
[2020-01-04T00:31:21.138] 0(as_mysql_rollup.c:1755) curr month is now 1575118800-1577797200
[2020-01-04T01:00:00.220] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1578056400-1578060000
[2020-01-04T02:00:00.328] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1578060000-1578063600





Is this going to be alright ?



Cheers

Damien

Comment 43 Albert Gil 2020-01-03 08:54:02 MST

Hi Damien,

From the dates of your last logs I can see that the rollup is already finished.
But I'm not sure if it went well because of the SIGTERM used (instead of the SIGUP).

Could you check if the sreport info is already working as you expected?
For example, are these commands returning the right information for you?

$ sreport cluster AccountUtilizationByUser -t hours start=2019-01-01 end=2019-09-10
$ sreport cluster AccountUtilizationByUser -t hours start=2018-01-01 end=2018-09-10

Or are they still returning an empty output?
If they are still empty, then we would need to repeat the steps 0) to 4) of comment 21.

Regards,
Albert

Comment 44 Damien 2020-01-04 06:44:17 MST

Hi Albert,

Somehow, the rollups isn't progressing at all, see details:


$ sreport cluster AccountUtilizationByUser -t hours start=2018-01-01 end=2018-09-10
--------------------------------------------------------------------------------
Cluster/Account/User Utilization 2018-01-01T00:00:00 - 2018-09-09T23:59:59 (21776400 secs)
Usage reported in CPU Hours
--------------------------------------------------------------------------------
  Cluster         Account     Login     Proper Name        Used   Energy 
--------- --------------- --------- --------------- ----------- -------- 



$ sreport cluster AccountUtilizationByUser -t hours start=2019-01-01 end=2019-05-10
--------------------------------------------------------------------------------
Cluster/Account/User Utilization 2019-01-01T00:00:00 - 2019-05-09T23:59:59 (11149200 secs)
Usage reported in CPU Hours
--------------------------------------------------------------------------------
  Cluster         Account     Login     Proper Name        Used   Energy 
--------- --------------- --------- --------------- ----------- -------- 






Should I repeat steps 0-4 again ?



Cheers

Damien

Comment 45 Damien 2020-01-04 06:51:35 MST

Created attachment 12670 [details]
Yesterday's  slurmdbd logs

Comment 46 Damien 2020-01-05 02:38:08 MST

Hi Albert,

I will re-run steps 0-4 again.


Thanks

Damien

Comment 47 Damien 2020-01-05 02:55:09 MST

Details:


# systemctl stop slurmdbd 


MariaDB [(none)]> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| brick              |
| db_backup          |
| mysql              |
| performance_schema |
| slurm_acct_db      |
| test               |
+--------------------+
7 rows in set (0.02 sec)

MariaDB [(none)]> use slurm_acct_db;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
MariaDB [slurm_acct_db]> UPDATE m3_last_ran_table SET hourly_rollup = 1514725200, daily_rollup = 1514725200, monthly_rollup = 1514725200;
Query OK, 1 row affected (0.01 sec)
Rows matched: 1  Changed: 1  Warnings: 0



# systemctl start slurmdbd 
# systemctl status slurmdbd 
● slurmdbd.service - Slurm DBD accounting daemon
   Loaded: loaded (/etc/systemd/system/slurmdbd.service; disabled; vendor preset: disabled)
   Active: active (running) since Sun 2020-01-05 20:48:55 AEDT; 52s ago
  Process: 23892 ExecStart=/opt/slurm-19.05.4/sbin/slurmdbd (code=exited, status=0/SUCCESS)
 Main PID: 23894 (slurmdbd)
   CGroup: /system.slice/slurmdbd.service
           └─23894 /opt/slurm-19.05.4/sbin/slurmdbd

Jan 05 20:48:55 m3-mgmt2 systemd[1]: Starting Slurm DBD accounting daemon...
Jan 05 20:48:55 m3-mgmt2 systemd[1]: PID file /opt/slurm/var/run/slurmdbd.pid not readable (yet?) after start.
Jan 05 20:48:55 m3-mgmt2 systemd[1]: Started Slurm DBD accounting daemon.


# cat slurmdbd.conf 
#
# Example slurmdbd.conf file.
#
# See the slurmdbd.conf man page for more information.
#
# Archive info
#ArchiveJobs=yes
#ArchiveDir="/tmp"
#ArchiveSteps=yes
#ArchiveScript=
#JobPurge=12
#StepPurge=1
#
# Authentication info
AuthType=auth/munge
#AuthInfo=/var/run/munge/munge.socket.2
#
# slurmDBD info
#DbdAddr=
DbdHost=m3-mgmt2
DbdBackupHost=m3-mgmt1
#DbdPort=7031
SlurmUser=slurm
#MessageTimeout=300
#DefaultQOS=normal,standby
DebugLevel=info
DebugFlags=DB_USAGE
LogFile=/mnt/slurm-logs/slurmdbd.log
PidFile=/opt/slurm/var/run/slurmdbd.pid
...
...
...




# sdiag 
*******************************************************
sdiag output at Sun Jan 05 20:53:48 2020 (1578218028)
Data since      Sun Jan 05 11:00:00 2020 (1578182400)
*******************************************************
Server thread count:  3
Agent queue size:     0
Agent count:          0
DBD Agent queue size: 0

Jobs submitted: 7088
Jobs started:   6353
Jobs completed: 5562
Jobs canceled:  1018
Jobs failed:    0

Job states ts:  Sun Jan 05 20:53:42 2020 (1578218022)
Jobs pending:   1448
Jobs running:   592

Main schedule statistics (microseconds):
	Last cycle:   27471
	Max cycle:    802957
	Total cycles: 1959
	Mean cycle:   70358
	Mean depth cycle:  408
	Cycles per minute: 3
	Last queue length: 1408

Backfilling stats
	Total backfilled jobs (since last slurm start): 202497
	Total backfilled jobs (since last stats cycle start): 5039
	Total backfilled heterogeneous job components: 0
	Total cycles: 1019
	Last cycle when: Sun Jan 05 20:53:21 2020 (1578218001)
	Last cycle: 141285
	Max cycle:  145525328
	Mean cycle: 2192981
	Last depth cycle: 1408
	Last depth cycle (try sched): 264
	Depth Mean: 1719
	Depth Mean (try depth): 307
	Last queue length: 1408
	Queue length mean: 1729

Latency for 1000 calls to gettimeofday(): 27 microseconds

Remote Procedure Call statistics by message type
	REQUEST_FED_INFO                        ( 2049) count:5285629 ave_time:109    total_time:577229559
	REQUEST_JOB_INFO_SINGLE                 ( 2021) count:5102192 ave_time:385304 total_time:1965900032724
	REQUEST_PARTITION_INFO                  ( 2009) count:4041568 ave_time:155    total_time:628195478
	REQUEST_NODE_INFO_SINGLE                ( 2040) count:1035813 ave_time:306177 total_time:317142507075
	REQUEST_STEP_COMPLETE                   ( 5016) count:306999 ave_time:99697  total_time:30606993657
	MESSAGE_EPILOG_COMPLETE                 ( 6012) count:303591 ave_time:259603 total_time:78813185056
	REQUEST_COMPLETE_PROLOG                 ( 6018) count:303531 ave_time:358997 total_time:108966800488
	REQUEST_COMPLETE_BATCH_SCRIPT           ( 5018) count:300354 ave_time:296373 total_time:89017103008
	REQUEST_SUBMIT_BATCH_JOB                ( 4003) count:291833 ave_time:208149 total_time:60744983368
	REQUEST_JOB_USER_INFO                   ( 2039) count:183380 ave_time:240325 total_time:44070870638
	MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:155714 ave_time:216273 total_time:33676870292
	REQUEST_KILL_JOB                        ( 5032) count:18667  ave_time:31078  total_time:580139572
	REQUEST_CONTROL_STATUS                  ( 2053) count:16582  ave_time:113    total_time:1879345
	REQUEST_JOB_READY                       ( 4019) count:11394  ave_time:320067 total_time:3646846680
	REQUEST_UPDATE_NODE                     ( 3002) count:11050  ave_time:47152  total_time:521040427
	REQUEST_JOB_STEP_CREATE                 ( 5001) count:3674   ave_time:5399   total_time:19837210
	REQUEST_RESOURCE_ALLOCATION             ( 4001) count:3213   ave_time:99653  total_time:320187814
	REQUEST_COMPLETE_JOB_ALLOCATION         ( 5017) count:3197   ave_time:310173 total_time:991624330
	REQUEST_JOB_INFO                        ( 2003) count:2940   ave_time:230417 total_time:677428673
	REQUEST_SHARE_INFO                      ( 2022) count:1720   ave_time:4350   total_time:7482128
	REQUEST_NODE_INFO                       ( 2007) count:732    ave_time:162297 total_time:118801883
	REQUEST_RESERVATION_INFO                ( 2024) count:681    ave_time:112640 total_time:76707947
	REQUEST_BUILD_INFO                      ( 2001) count:496    ave_time:124063 total_time:61535616
	REQUEST_JOB_PACK_ALLOC_INFO             ( 4027) count:484    ave_time:318084 total_time:153953059
	REQUEST_PING                            ( 1008) count:209    ave_time:98     total_time:20516
	REQUEST_CANCEL_JOB_STEP                 ( 5005) count:149    ave_time:166049 total_time:24741416
	REQUEST_JOB_REQUEUE                     ( 5023) count:41     ave_time:320618 total_time:13145362
	REQUEST_UPDATE_JOB                      ( 3001) count:26     ave_time:241148 total_time:6269871
	ACCOUNTING_UPDATE_MSG                   (10001) count:23     ave_time:267997 total_time:6163948
	REQUEST_STATS_INFO                      ( 2035) count:19     ave_time:160    total_time:3049
	REQUEST_JOB_STEP_INFO                   ( 2005) count:18     ave_time:79970  total_time:1439470
	REQUEST_UPDATE_RESERVATION              ( 3009) count:7      ave_time:339883 total_time:2379187
	REQUEST_CREATE_RESERVATION              ( 3006) count:6      ave_time:584184 total_time:3505106
	REQUEST_JOB_ALLOCATION_INFO             ( 4014) count:5      ave_time:294182 total_time:1470914
	ACCOUNTING_REGISTER_CTLD                (10003) count:4      ave_time:88363  total_time:353453
	REQUEST_PRIORITY_FACTORS                ( 2026) count:2      ave_time:222624 total_time:445249
	REQUEST_DELETE_RESERVATION              ( 3008) count:2      ave_time:90494  total_time:180988
	REQUEST_TRIGGER_SET                     ( 2017) count:1      ave_time:143    total_time:143

Remote Procedure Call statistics by user
	swil0005        (   10264) count:4776587 ave_time:211373 total_time:1009642698670
	root            (       0) count:3454922 ave_time:190785 total_time:659149073667
	asegal          (   11199) count:602103 ave_time:116767 total_time:70306296721
	vpam0001        (   11208) count:440336 ave_time:103518 total_time:45582950337
	hven0001        (   10350) count:382377 ave_time:121750 total_time:46554575000
	claudiak        (   11247) count:374354 ave_time:97038  total_time:36326902148
	joshuamhardy    (   10156) count:340457 ave_time:119542 total_time:40699111831
	robe0002        (   12508) count:322189 ave_time:104570 total_time:33691308597
	jinmingz        (   12653) count:307693 ave_time:110626 total_time:34038914724
	spiper          (   12815) count:273645 ave_time:54378  total_time:14880367316
	ychen           (   11648) count:241669 ave_time:87925  total_time:21248959266
	ska565          (   10255) count:237335 ave_time:98843  total_time:23458969424
	aazhar          (   11233) count:208767 ave_time:114387 total_time:23880429936
	strikm          (   12849) count:204618 ave_time:112867 total_time:23094801719
	scho0011        (   10572) count:203313 ave_time:107091 total_time:21773130513
	cpen0001        (   10450) count:201477 ave_time:147547 total_time:29727373710
	qhao            (   12730) count:170165 ave_time:123615 total_time:21034990075
	vieth           (   11159) count:169133 ave_time:158804 total_time:26859114073
	xiaoliuz        (   10462) count:165261 ave_time:124182 total_time:20522580031
	zoey            (   12871) count:155061 ave_time:162552 total_time:25205622325
	anirka          (   12061) count:148696 ave_time:120795 total_time:17961820065
	kngu0030        (   11191) count:147892 ave_time:151416 total_time:22393360967
	huiw            (   11363) count:143849 ave_time:164400 total_time:23648810727
	jafarl          (   10347) count:143272 ave_time:147580 total_time:21144169452
	szhou           (   10465) count:137997 ave_time:123024 total_time:16977057538
	sarahsyh        (   11365) count:121744 ave_time:137291 total_time:16714470080
	mcfadyen        (   10725) count:120742 ave_time:100926 total_time:12186021144
	ttha0011        (   11105) count:119246 ave_time:133196 total_time:15883199475
	zyu             (   12933) count:113759 ave_time:143486 total_time:16322839750
	moya0001        (   12915) count:99689  ave_time:63050  total_time:6285402851
	ldill           (   12521) count:76008  ave_time:132145 total_time:10044077828
	galisa          (   10716) count:72641  ave_time:156348 total_time:11357326488
	rbeare          (   10299) count:58607  ave_time:136209 total_time:7982833426
	lindacac        (   11464) count:58196  ave_time:151928 total_time:8841612977
	ksabaroe        (   10365) count:56816  ave_time:164156 total_time:9326702165
	a1058400        (   11860) count:55839  ave_time:139722 total_time:7801981104
	mfarrell        (   10679) count:50020  ave_time:130276 total_time:6516443014
	ctav0001        (   12340) count:48213  ave_time:162952 total_time:7856434261
	pzhao           (   12949) count:47245  ave_time:148631 total_time:7022114217
	tnewing         (   10925) count:45732  ave_time:155070 total_time:7091706843
	louisaps        (   11398) count:44061  ave_time:92058  total_time:4056168232
	yannyinc        (   10505) count:42865  ave_time:149373 total_time:6402875018
	aw474           (   11662) count:41634  ave_time:144716 total_time:6025106294
	msolovev        (   12837) count:41541  ave_time:146177 total_time:6072372827
	mhos0007        (   10900) count:41360  ave_time:172423 total_time:7131415452
	jaszczur        (   11828) count:40177  ave_time:138800 total_time:5576578996
	fgag0002        (   12538) count:38351  ave_time:166825 total_time:6397911747
	xzha0043        (   11476) count:34900  ave_time:155263 total_time:5418679277
	zhiqianx        (   11058) count:34607  ave_time:154425 total_time:5344192307
	apoz0003        (   11024) count:33772  ave_time:117764 total_time:3977134722
	stuarto         (   10273) count:30187  ave_time:177455 total_time:5356834440
	jste0021        (   10774) count:30128  ave_time:158491 total_time:4775018907
	dmpar7          (   11685) count:29634  ave_time:234738 total_time:6956235225
	govindap        (   11549) count:28820  ave_time:175051 total_time:5044973901
	qwang1          (   11135) count:27969  ave_time:97339  total_time:2722490753
	jhendrik        (   10515) count:24099  ave_time:182218 total_time:4391274365
	pcoo0005        (   11639) count:23095  ave_time:167659 total_time:3872085939
	mkoeda          (   11862) count:22950  ave_time:174783 total_time:4011285397
	yaghmaien       (   12664) count:22198  ave_time:161156 total_time:3577361041
	megh0001        (   11997) count:21933  ave_time:47738  total_time:1047050184
	nourank         (   11994) count:19776  ave_time:180700 total_time:3573531225
	mnak0010        (   11610) count:19580  ave_time:27441  total_time:537299418
	clementeadam    (   10691) count:19172  ave_time:173858 total_time:3333207528
	melvinl         (   12907) count:19136  ave_time:108347 total_time:2073334479
	tandrill        (   11162) count:18161  ave_time:178073 total_time:3233995756
	slurm           (     497) count:16609  ave_time:505    total_time:8396746
	hfettke         (   11128) count:15678  ave_time:169441 total_time:2656503506
	phoebeimms      (   10690) count:14828  ave_time:157577 total_time:2336565926
	bmoffat         (   12008) count:14212  ave_time:133626 total_time:1899101747
	earsenau        (   10724) count:13152  ave_time:137014 total_time:1802009458
	tdissana        (   10686) count:13007  ave_time:146443 total_time:1904788443
	andrewpe        (   10152) count:12822  ave_time:244065 total_time:3129410173
	permezelf       (   12929) count:12798  ave_time:151064 total_time:1933327415
	julianab        (   12946) count:11901  ave_time:168979 total_time:2011019083
	dawhite         (   12881) count:11852  ave_time:125491 total_time:1487323424
	rshi0007        (   10423) count:11698  ave_time:133010 total_time:1555961613
	bktan12         (   11006) count:11396  ave_time:161681 total_time:1842522728
	mom008          (   11465) count:11258  ave_time:72377  total_time:814826789
	aburmest1       (   12975) count:10463  ave_time:141915 total_time:1484861084
	kpaw0001        (   10567) count:9956   ave_time:62541  total_time:622665731
	suzanm          (   11205) count:9887   ave_time:163837 total_time:1619862596
	jamesr          (   12917) count:9581   ave_time:160528 total_time:1538024920
	judominguez     (   10934) count:8887   ave_time:203153 total_time:1805421024
	ademarco        (   10256) count:8833   ave_time:177587 total_time:1568630680
	kdkan3          (   12953) count:8647   ave_time:116190 total_time:1004701014
	aabu0005        (   11216) count:8446   ave_time:174502 total_time:1473846180
	ymei            (   11123) count:8334   ave_time:130454 total_time:1087204816
	nrogasch        (   10517) count:8241   ave_time:177806 total_time:1465303523
	gfullers        (   12941) count:7946   ave_time:140812 total_time:1118896543
	damienl         (   10005) count:7622   ave_time:164583 total_time:1254456277
	tbiczok         (   11180) count:7504   ave_time:131691 total_time:988211100
	dspark          (   11831) count:7055   ave_time:181935 total_time:1283555206
	adamshephard    (   10901) count:6706   ave_time:32015  total_time:214695781
	dapergu         (   11857) count:6370   ave_time:173681 total_time:1106351132
	zseeger         (   10027) count:6305   ave_time:28131  total_time:177372258
	iharding        (   10381) count:5833   ave_time:148085 total_time:863781620
	ewilliam        (   12847) count:5709   ave_time:191181 total_time:1091455829
	ykha0001        (   10766) count:5457   ave_time:151603 total_time:827298039
	sfwon17         (   12934) count:5120   ave_time:176767 total_time:905047774
	benfulcher      (   10424) count:4486   ave_time:131741 total_time:590991109
	simonb          (   10830) count:4435   ave_time:183933 total_time:815743363
	sravanfar       (   12509) count:4248   ave_time:169800 total_time:721313291
	camillab        (   10501) count:4161   ave_time:159129 total_time:662137685
	kmban4          (   11033) count:4131   ave_time:190332 total_time:786261896
	nsingh          (   12834) count:4027   ave_time:184753 total_time:744002623
	aurina          (   10385) count:3997   ave_time:126118 total_time:504095710
	yoge0001        (   11113) count:3978   ave_time:27056  total_time:107632094
	ljia110         (   10233) count:3910   ave_time:27307  total_time:106771981
	peterl          (   10905) count:3480   ave_time:119417 total_time:415573169
	vishalc         (   12719) count:3345   ave_time:32306  total_time:108066322
	inas0002        (   10978) count:3215   ave_time:116009 total_time:372970046
	sgwe0001        (   10173) count:2929   ave_time:47957  total_time:140468963
	ctiw0001        (   12015) count:2895   ave_time:58289  total_time:168747048
	bmajor          (   12512) count:2727   ave_time:139026 total_time:379126127
	jmton7          (   12943) count:2196   ave_time:42457  total_time:93237247
	xvuthith        (   10699) count:1748   ave_time:76291  total_time:133358371
	tsepehri        (   11188) count:1707   ave_time:183171 total_time:312674469
	aher0013        (   11326) count:1642   ave_time:28058  total_time:46071730
	tbui            (   12935) count:1587   ave_time:22361  total_time:35487104
	scohen1         (   11404) count:1519   ave_time:123714 total_time:187922032
	sbednarek       (   11506) count:1498   ave_time:33230  total_time:49779274
	pzaremoo        (   10041) count:1352   ave_time:91206  total_time:123311633
	mziemann        (   10442) count:1269   ave_time:83213  total_time:105598546
	rkai0001        (   10719) count:1229   ave_time:45272  total_time:55640022
	aceg4000        (   10727) count:1066   ave_time:123745 total_time:131912834
	jzhou           (   11691) count:1035   ave_time:25848  total_time:26753363
	taoh            (   11834) count:1030   ave_time:19250  total_time:19827874
	amar0047        (   10989) count:962    ave_time:95813  total_time:92172157
	mtang1          (   12982) count:920    ave_time:90786  total_time:83523830
	ganjq           (   10783) count:898    ave_time:54372  total_time:48826402
	hbandara        (   10629) count:810    ave_time:133367 total_time:108027292
	uqhsun8         (   12012) count:806    ave_time:113564 total_time:91533045
	andrewc         (   10452) count:791    ave_time:153213 total_time:121191798
	hhew0002        (   11689) count:790    ave_time:116917 total_time:92364847
	philipc         (   10003) count:738    ave_time:101720 total_time:75069494
	tshaw           (   11932) count:735    ave_time:111857 total_time:82215140
	creboul         (   10332) count:666    ave_time:240181 total_time:159960962
	xuanlih         (   10644) count:658    ave_time:105494 total_time:69415062
	esomayeh        (   10379) count:639    ave_time:52879  total_time:33790207
	yzha0576        (   12828) count:628    ave_time:17735  total_time:11137867
	lawyl1          (   10025) count:595    ave_time:69755  total_time:41504246
	helenm          (   11051) count:594    ave_time:153806 total_time:91361024
	scottk          (   11656) count:544    ave_time:162602 total_time:88455901
	pmajka          (   11257) count:535    ave_time:96999  total_time:51894557
	ciden1          (   10709) count:523    ave_time:126679 total_time:66253180
	hbor4           (   12526) count:515    ave_time:39184  total_time:20180243
	dsubedi         (   12857) count:483    ave_time:62795  total_time:30330074
	wcao0006        (   10768) count:436    ave_time:153782 total_time:67049126
	bangw           (   11947) count:413    ave_time:72297  total_time:29858921
	kcorbett        (   12751) count:411    ave_time:238572 total_time:98053421
	mchen1          (   10991) count:408    ave_time:103528 total_time:42239709
	sgujjari        (   12661) count:406    ave_time:147109 total_time:59726647
	rfro0003        (   11455) count:370    ave_time:112172 total_time:41503868
	wtsch1          (   11038) count:363    ave_time:147584 total_time:53573184
	mliu153         (   10667) count:355    ave_time:73361  total_time:26043229
	yyer0001        (   12041) count:353    ave_time:216543 total_time:76439952
	yzhang1         (   11678) count:346    ave_time:68847  total_time:23821374
	rjoh0016        (   12888) count:333    ave_time:192596 total_time:64134786
	elam            (   10854) count:324    ave_time:237683 total_time:77009311
	gahun3          (   12820) count:305    ave_time:37434  total_time:11417606
	debopris        (   11349) count:304    ave_time:13713  total_time:4168760
	vchr0002        (   11487) count:296    ave_time:80685  total_time:23882886
	rwic0002        (   11371) count:292    ave_time:65437  total_time:19107737
	xjia0032        (   11502) count:285    ave_time:19269  total_time:5491763
	hede2           (   12967) count:274    ave_time:159774 total_time:43778263
	mbelouso        (   10091) count:268    ave_time:68789  total_time:18435508
	hngu0026        (   11057) count:257    ave_time:27413  total_time:7045295
	rrag0004        (   12334) count:237    ave_time:33318  total_time:7896424
	pubua           (   12833) count:231    ave_time:80494  total_time:18594191
	zdyson          (   11463) count:218    ave_time:192859 total_time:42043381
	siyuanh         (   10797) count:211    ave_time:54773  total_time:11557109
	kvoigt          (   11346) count:206    ave_time:125098 total_time:25770354
	zguo61          (   12968) count:168    ave_time:31954  total_time:5368372
	mkha0097        (   12931) count:165    ave_time:27113  total_time:4473809
	dbw             (   10800) count:158    ave_time:181044 total_time:28605027
	pleu0005        (   11794) count:158    ave_time:40353  total_time:6375793
	trungn          (   10419) count:150    ave_time:223696 total_time:33554444
	michaelr        (   11512) count:118    ave_time:43280  total_time:5107081
	amaligg         (   10873) count:111    ave_time:95843  total_time:10638642
	kwong           (   12517) count:107    ave_time:158567 total_time:16966704
	ssingh          (   11014) count:104    ave_time:153402 total_time:15953858
	ec2-user        (    1000) count:90     ave_time:68201  total_time:6138175
	bche132         (   10239) count:86     ave_time:114158 total_time:9817664
	klow12          (   10192) count:85     ave_time:116761 total_time:9924697
	pwib0001        (   12841) count:62     ave_time:37243  total_time:2309079
	szhang          (   12694) count:60     ave_time:26926  total_time:1615580
	pt5684          (   11786) count:60     ave_time:219032 total_time:13141977
	khodgins        (   11538) count:49     ave_time:209620 total_time:10271421
	aazi0007        (   12038) count:45     ave_time:15090  total_time:679090
	sherenan        (   11901) count:22     ave_time:204737 total_time:4504227
	xli479          (   11039) count:21     ave_time:334382 total_time:7022034
	nisald          (   12338) count:21     ave_time:2039   total_time:42831
	gholamrh        (   10097) count:15     ave_time:5580   total_time:83703
	ctan            (   10189) count:11     ave_time:135917 total_time:1495095
	swatts          (   11445) count:8      ave_time:173205 total_time:1385641
	chines          (   10011) count:8      ave_time:51216  total_time:409728
	lper0012        (   11103) count:7      ave_time:68797  total_time:481585
	pab07           (   10875) count:6      ave_time:950    total_time:5704
	rgomi           (   11448) count:3      ave_time:150652 total_time:451957
	yliu0291        (   12964) count:3      ave_time:3896   total_time:11688

Pending RPC statistics
	No pending RPCs


# sacctmgr show stats
Rollup statistics
	Hour       count:0      ave_time:0      max_time:0            total_time:0           
	Day        count:0      ave_time:0      max_time:0            total_time:0           
	Month      count:0      ave_time:0      max_time:0            total_time:0           

Remote Procedure Call statistics by message type
	DBD_STEP_COMPLETE        ( 1441) count:60     ave_time:29866  total_time:1792012
	DBD_JOB_COMPLETE         ( 1424) count:30     ave_time:40383  total_time:1211494
	DBD_SEND_MULT_MSG        ( 1474) count:12     ave_time:145284 total_time:1743411
	SLURM_PERSIST_INIT       ( 6500) count:3      ave_time:3442   total_time:10328
	DBD_FINI                 ( 1401) count:2      ave_time:226    total_time:452
	DBD_STEP_START           ( 1442) count:2      ave_time:77160  total_time:154320
	DBD_REGISTER_CTLD        ( 1434) count:1      ave_time:23248  total_time:23248
	DBD_GET_ASSOCS           ( 1410) count:1      ave_time:903980 total_time:903980
	DBD_GET_TRES             ( 1486) count:1      ave_time:41687  total_time:41687
	DBD_GET_QOS              ( 1448) count:1      ave_time:736    total_time:736
	DBD_JOB_START            ( 1425) count:1      ave_time:81920  total_time:81920
	DBD_CLUSTER_TRES         ( 1407) count:1      ave_time:1376   total_time:1376

Remote Procedure Call statistics by user
	slurm               (       497) count:108    ave_time:46427  total_time:5014128
	root                (         0) count:7      ave_time:135833 total_time:950836

Comment 48 Damien 2020-01-05 02:59:11 MST

# grep curr slurmdbd.log

....
....
....
[2020-01-05T10:12:30.247] 0(as_mysql_rollup.c:1755) curr day is now 1577365200-1577451600
[2020-01-05T10:12:30.900] 0(as_mysql_rollup.c:1755) curr day is now 1577451600-1577538000
[2020-01-05T10:12:31.553] 0(as_mysql_rollup.c:1755) curr day is now 1577538000-1577624400
[2020-01-05T10:12:32.206] 0(as_mysql_rollup.c:1755) curr day is now 1577624400-1577710800
[2020-01-05T10:12:32.860] 0(as_mysql_rollup.c:1755) curr day is now 1577710800-1577797200
[2020-01-05T10:12:33.517] 0(as_mysql_rollup.c:1755) curr day is now 1577797200-1577883600
[2020-01-05T10:12:34.177] 0(as_mysql_rollup.c:1755) curr day is now 1577883600-1577970000
[2020-01-05T10:12:34.835] 0(as_mysql_rollup.c:1755) curr day is now 1577970000-1578056400
[2020-01-05T10:12:35.494] 0(as_mysql_rollup.c:1755) curr day is now 1578056400-1578142800
[2020-01-05T10:12:36.152] 0(as_mysql_rollup.c:1755) curr month is now 1567260000-1569852000
[2020-01-05T10:12:36.323] 0(as_mysql_rollup.c:1755) curr month is now 1569852000-1572526800
[2020-01-05T10:12:36.386] 0(as_mysql_rollup.c:1755) curr month is now 1572526800-1575118800
[2020-01-05T10:12:36.446] 0(as_mysql_rollup.c:1755) curr month is now 1575118800-1577797200
[2020-01-05T11:00:00.536] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1578178800-1578182400
[2020-01-05T12:00:00.651] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1578182400-1578186000
[2020-01-05T13:00:00.770] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1578186000-1578189600
[2020-01-05T14:00:00.879] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1578189600-1578193200
[2020-01-05T15:00:00.996] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1578193200-1578196800
[2020-01-05T16:00:00.117] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1578196800-1578200400
[2020-01-05T17:00:00.241] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1578200400-1578204000
[2020-01-05T18:00:00.422] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1578204000-1578207600
[2020-01-05T19:00:00.538] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1578207600-1578211200
[2020-01-05T20:00:00.687] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1578211200-1578214800
[2020-01-05T20:48:59.824] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1514725200-1514728800
[2020-01-05T20:49:36.231] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1514728800-1514732400
[2020-01-05T20:50:11.603] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1514732400-1514736000
[2020-01-05T20:50:49.505] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1514736000-1514739600
[2020-01-05T20:51:24.940] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1514739600-1514743200
[2020-01-05T20:52:01.426] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1514743200-1514746800
[2020-01-05T20:52:35.893] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1514746800-1514750400
[2020-01-05T20:53:11.876] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1514750400-1514754000
[2020-01-05T20:53:47.179] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1514754000-1514757600
[2020-01-05T20:54:23.453] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1514757600-1514761200
[2020-01-05T20:54:59.562] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1514761200-1514764800
[2020-01-05T20:55:37.539] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1514764800-1514768400
[2020-01-05T20:56:13.236] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1514768400-1514772000
[2020-01-05T20:56:48.166] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1514772000-1514775600
[2020-01-05T20:57:22.783] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1514775600-1514779200
[2020-01-05T20:57:58.447] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1514779200-1514782800

Comment 49 Damien 2020-01-05 22:36:48 MST

Hi Albert,

Something doesn't look right,


See:

# grep curr slurmdbd.log 
....
....
....
[2020-01-06T16:18:40.770] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1523253600-1523257200
[2020-01-06T16:19:20.166] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1523257200-1523260800
[2020-01-06T16:20:04.982] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1523260800-1523264400
[2020-01-06T16:20:50.152] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1523264400-1523268000
[2020-01-06T16:21:39.481] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1523268000-1523271600
[2020-01-06T16:22:20.248] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1523271600-1523275200
[2020-01-06T16:22:58.921] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1523275200-1523278800
[2020-01-06T16:23:37.047] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1523278800-1523282400
[2020-01-06T16:24:16.126] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1523282400-1523286000


1523282400 Epoch time
Assuming that this timestamp is in seconds:
GMT: Monday, 9 April 2018 2:00:00 PM
Your time zone: Tuesday, 10 April 2018 12:00:00 AM GMT+10:00
Relative: 2 years ago


But,

# sreport cluster AccountUtilizationByUser -t hours start=2018-03-01 end=2018-03-10
--------------------------------------------------------------------------------
Cluster/Account/User Utilization 2018-03-01T00:00:00 - 2018-03-09T23:59:59 (777600 secs)
Usage reported in CPU Hours
--------------------------------------------------------------------------------
  Cluster         Account     Login     Proper Name       Used   Energy 
--------- --------------- --------- --------------- ---------- -------- 

Still no value.




Kindly advise


Thanks 

Damien

Comment 50 Damien 2020-01-05 22:40:00 MST

Created attachment 12671 [details]
slurmdbd logs

Current slurmdbd log

Kindly review

Comment 51 Albert Gil 2020-01-06 07:59:00 MST

Hi Damien,

> I will re-run steps 0-4 again.

That was the right decision.
Thanks for the detailed information, it helps me a lot.

> Something doesn't look right,
> 
> # grep curr slurmdbd.log 
> [2020-01-06T16:24:16.126] 0(as_mysql_rollup.c:1072) m3 curr hour is now
> 1523282400-1523286000
> 
> 1523282400 Epoch time
> Assuming that this timestamp is in seconds:
> GMT: Monday, 9 April 2018 2:00:00 PM
> Your time zone: Tuesday, 10 April 2018 12:00:00 AM GMT+10:00
> Relative: 2 years ago
> 
> # sreport cluster AccountUtilizationByUser -t hours start=2018-03-01
> end=2018-03-10
> -----------------------------------------------------------------------------
> ---
> Cluster/Account/User Utilization 2018-03-01T00:00:00 - 2018-03-09T23:59:59
> (777600 secs)
> Usage reported in CPU Hours
> -----------------------------------------------------------------------------
> ---
>   Cluster         Account     Login     Proper Name       Used   Energy 
> --------- --------------- --------- --------------- ---------- -------- 
> 
> Still no value.

I understand your point, but that's expected.
Now the rollup is working properly, so please don't stop the daemon until it fully ends.

The reason why you still cannot see that information is because the rollup first does all the hourly aggregations/tables until current time, then the daily ones and finally the monthly ones.
Your sreport command is internally translated into queries to the daily tables, and they are not yet done.
You should check with sreport only when the rollup is fully completed.

Now it's working.
Keep an eye in the logs and just ensure that slurmdbd is not stopped until the rollup is done.

And keep me updated! ;-)

Regards,
Albert

Comment 52 Damien 2020-01-08 06:29:39 MST

Hi Albert,

The progress is still on the slow side, details:

# grep curr slurmdbd.log
...
...
[2020-01-09T00:24:46.189] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1547226000-1547229600
[2020-01-09T00:24:58.307] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1547229600-1547233200
[2020-01-09T00:25:10.480] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1547233200-1547236800
[2020-01-09T00:25:23.083] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1547236800-1547240400


1547240400 Timestamp to Human date  [batch convert]
Supports Unix timestamps in seconds, milliseconds, microseconds and nanoseconds.
Assuming that this timestamp is in seconds:
GMT: Friday, 11 January 2019 8:00:00 PM
Your time zone: Saturday, 12 January 2019 7:00:00 AM GMT+11:00 DST
Relative: A year ago




Cheers

Damien

Comment 53 Albert Gil 2020-01-08 07:07:32 MST

It looks good.
Let's see if it finishes for the end of the week...?
Will see.

Thanks for letting me know,
Albert

Comment 54 Damien 2020-01-12 15:09:06 MST

# grep Error slurmdbd.log

.....
.....
.....

[2020-01-13T09:00:06.350] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1578837600-1578841200
[2020-01-13T09:00:06.438] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1578841200-1578844800
[2020-01-13T09:00:06.526] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1578844800-1578848400
[2020-01-13T09:00:06.613] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1578848400-1578852000
[2020-01-13T09:00:06.704] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1578852000-1578855600
[2020-01-13T09:00:06.792] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1578855600-1578859200
[2020-01-13T09:00:06.881] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1578859200-1578862800
[2020-01-13T09:00:06.969] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1578862800-1578866400
[2020-01-13T09:00:07.081] 0(as_mysql_rollup.c:1755) curr day is now 1578574800-1578661200
[2020-01-13T09:00:10.728] 0(as_mysql_rollup.c:1755) curr day is now 1578661200-1578747600
[2020-01-13T09:00:13.876] 0(as_mysql_rollup.c:1755) curr day is now 1578747600-1578834000

Comment 55 Damien 2020-01-12 22:33:29 MST

Hi Albert

The sreports are working now (2018-2020), but I am observing this:
--

$ sacctmgr show reservation 
   Cluster            Name                           TRES           TimeStart             TimeEnd UnusedWall 
---------- --------------- ------------------------------ ------------------- ------------------- ---------- 
        m3  CryoemFacility                         cpu=60 2019-10-28T20:16:55 2020-05-27T10:12:23 9.3522e+06 
        m3       cryosparc                         cpu=28 2019-10-28T20:16:55 2020-11-08T11:33:37 1.2592e+07 
        m3       M3-backup                         cpu=48 2019-10-28T20:16:55 2020-03-14T08:43:14 1.2571e+07 
        m3  mxbeampostproc                         cpu=48 2019-10-28T20:16:55 2025-01-08T11:38:57 1.2592e+07 
        m3             nfs                          cpu=6 2019-10-28T20:16:55 2020-09-22T18:17:40 1.2592e+07 
        m3          simple                        cpu=120 2019-10-28T20:16:55 2020-06-29T10:41:27 1.1655e+07 
        m3             uow                         cpu=52 2019-10-28T20:16:55 2025-01-08T10:00:00 1.2393e+07 
        m3                                                2019-10-30T12:03:48 2020-01-13T09:45:32 1.2376e+07 
        m3                                                2019-10-30T12:03:48 2025-06-30T09:00:00 1.2449e+07 
        m3                                                2019-10-30T12:03:48 2025-06-30T09:00:00 1.2449e+07 
        m3         highmem                         cpu=36 2019-12-02T19:57:29 2028-11-30T11:34:35 8.9936e+06 
        m3                                        cpu=188 2020-01-06T11:47:03 2020-01-13T12:03:39 5.3497e+06 
        m3            DDN2                        cpu=360 2020-01-07T22:48:49 2020-01-17T00:00:00 6.3564e+06 
        m3                                                2020-01-13T09:45:32 2025-06-30T09:00:00 2.2468e+04 
        m3                                        cpu=188 2020-01-13T12:03:39 2025-06-30T09:00:00 1.1489e+04 



Notice the empty entries...


$ scontrol show reservation 
ReservationName=mxbeampostproc StartTime=2017-12-18T12:20:35 EndTime=2025-01-08T11:38:57 Duration=2577-23:18:22
   Nodes=m3d[006-007] NodeCnt=2 CoreCnt=48 Features=(null) PartitionName=(null) Flags=IGNORE_JOBS,SPEC_NODES
   TRES=cpu=48
   Users=(null) Accounts=-training Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a

ReservationName=cryosparc StartTime=2018-11-09T11:33:37 EndTime=2020-11-08T11:33:37 Duration=730-00:00:00
   Nodes=m3h015 NodeCnt=1 CoreCnt=28 Features=(null) PartitionName=(null) Flags=SPEC_NODES
   TRES=cpu=28
   Users=jafarl,lancew Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a

ReservationName=sexton StartTime=2019-02-07T11:36:44 EndTime=2025-06-30T09:00:00 Duration=2334-22:23:16
   Nodes=dgx[003-004],m3g[000,006],m3p010 NodeCnt=5 CoreCnt=188 Features=(null) PartitionName=(null) Flags=IGNORE_JOBS,SPEC_NODES
   TRES=cpu=188
   Users=galisa,lynnliang,mbelouso,xzha0043,ctan,spiper,rjoh0016,trungn,lancew Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a

ReservationName=M3-backup StartTime=2019-03-15T08:43:14 EndTime=2025-06-30T09:00:00 Duration=2299-01:16:46
   Nodes=m3d[002,008] NodeCnt=2 CoreCnt=48 Features=(null) PartitionName=(null) Flags=SPEC_NODES
   TRES=cpu=48
   Users=ctan,trungn,rchiu Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a

ReservationName=CryoemFacility StartTime=2019-05-28T10:12:23 EndTime=2025-06-30T09:00:00 Duration=2224-22:47:37
   Nodes=m3c004,m3g014 NodeCnt=2 CoreCnt=60 Features=(null) PartitionName=(null) Flags=IGNORE_JOBS,SPEC_NODES
   TRES=cpu=60
   Users=lancew,hven0001,jafarl Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a

ReservationName=AWX StartTime=2019-06-07T14:48:59 EndTime=2025-06-30T09:00:00 Duration=2214-18:11:01
   Nodes=m3a[009-010] NodeCnt=2 CoreCnt=48 Features=(null) PartitionName=(null) Flags=SPEC_NODES
   TRES=cpu=48
   Users=trungn,ctan,damienl Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a

ReservationName=simple StartTime=2019-06-30T10:41:27 EndTime=2020-06-29T10:41:27 Duration=365-00:00:00
   Nodes=m3a[016-020] NodeCnt=5 CoreCnt=120 Features=(null) PartitionName=(null) Flags=SPEC_NODES
   TRES=cpu=120
   Users=hael,creboul,zqi1,sarahle,lancew,marionb,ctan,skie0002,mcaggian Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a

ReservationName=uow StartTime=2019-09-13T11:06:48 EndTime=2025-01-08T10:00:00 Duration=1943-21:53:12
   Nodes=m3c011,m3h011 NodeCnt=2 CoreCnt=52 Features=(null) PartitionName=(null) Flags=IGNORE_JOBS,SPEC_NODES
   TRES=cpu=52
   Users=lancew,simonb,jafarl,jbouwer Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a

ReservationName=highmem StartTime=2019-12-02T15:26:56 EndTime=2028-11-30T11:34:35 Duration=3285-20:07:39
   Nodes=m3m000 NodeCnt=1 CoreCnt=36 Features=(null) PartitionName=(null) Flags=IGNORE_JOBS,SPEC_NODES
   TRES=cpu=36
   Users=trungn,ctan,damienl,chines,lancew,jafarl,nwong,kerriw,philipc,smichnow,jchang,mmcg0002,angu0022,ska565,ozgeo,rbeare,zseeger,keving,swil0005,lawyl1,nlkar1,stya0001,kholt Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a

ReservationName=DDN2 StartTime=2020-01-06T14:40:54 EndTime=2020-01-17T00:00:00 Duration=10-09:19:06
   Nodes=m3i[030-039] NodeCnt=10 CoreCnt=360 Features=(null) PartitionName=(null) Flags=IGNORE_JOBS,SPEC_NODES
   TRES=cpu=360
   Users=(null) Accounts=nq46 Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a

--

Is this problem related ?


Cheers

Damien

Comment 56 Damien 2020-01-12 22:37:44 MST

Created attachment 12711 [details]
Current slurmdbd.log

Kindly review

Comment 57 Albert Gil 2020-01-13 06:11:19 MST

Hi Damien,

> The sreports are working now (2018-2020)

Good!
I think that now we can consider the initial issue related to sreport fixed.

> but I am observing this:
> $ sacctmgr show reservation 
> Notice the empty entries...

I see some issues:
- mxbeampostproc started even before you installed 18.08 (=2017-12-18)
- Except "simple" and "DDN2" have a Duration > 1 year.

I'm not sure why or how you get so long reservations, is that intentional?

> Is this problem related ?

No, the blank values and the long values are not related to the rollup that we just done.
Maybe it's a good idea if we close this one and we open a new one.
We will need some output of the db to work on it, but I'm just thinking that probably it will be easy if you share with us the DB so we can look at it better.

Regards,
Albert

Comment 58 Damien 2020-01-13 07:01:48 MST

Hi Albert,

Thanks for your replies.

What is standard practise for the duration of a reservation ? Not more than 2 years ?


Yes, the empty entries should be treated as separate ticket.


You want the output of the DB ? How? Which SQL commands ? or mysqldump ?



Cheers

Damien

Comment 59 Albert Gil 2020-01-14 11:56:03 MST

Hi Damien,

> What is standard practise for the duration of a reservation ? Not more than
> 2 years ?

Well, I would say that so long reservations is strange.
Maybe if I've more info about the goal of your reservations we can recommend you some alternatives.
For example some flags like DAILY, WEEKLY, WEEKDAY or WEEKEND can make sense in some scenarios. Or maybe your goal can be better accomplished with a different partition configuration, or QoS.
It depends on your goals.
Feel free to open a specific bug to explain your goals and how are you using reservations and we will discuss it further there. 

> Yes, the empty entries should be treated as separate ticket.
> You want the output of the DB ? How? Which SQL commands ? or mysqldump ?

Yes, mysqldump is a good option.
I guess that you will need to use some kind of sharing files tools, so mark your comment as private to ensure to keep your db private.


Regards,
Damien

Comment 60 Albert Gil 2020-01-20 06:00:00 MST

Hi Damien,

It this is Ok for you, I'm closing this ticket as "info given" because the main issue related to the sreport information is fixed.
Please don't hesitate to reopen it if you need further support, or open a new one to keep helping you with the reservation issue that you may have.

Regards,
Albert

Comment 61 Damien 2020-01-24 21:51:30 MST

Hi Albert, 

Thanks for everything. 

I am fine to close this case, but we have not cover any measure to prevent this from happening again.


Cheers

Damien

Comment 62 Albert Gil 2020-01-27 03:43:09 MST

You are right Damien, we are tracking internally the investigation of what could lead to your issue, that has been also reported by other users.
I'll let you know if we find something.

Albert

Comment 63 Damien 2020-05-27 04:29:22 MDT

Good Afternoon Slurm Support

Sorry for bringing this up after this while.

Is there any follow-ups or preventive measures for this matter? I want to mitigate this because we are planning to move from 19.05.04  to  20.02.03 soon.


Cheers

Damien

Comment 64 Albert Gil 2020-05-28 05:06:28 MDT

Hi Damien!

> Sorry for bringing this up after this while.

We are glad to hear from you again, specially if everything is working and you are just being preventive! ;-)

> Is there any follow-ups or preventive measures for this matter? I want to
> mitigate this because we are planning to move from 19.05.04  to  20.02.03
> soon.

So far, most of the similar cases are related to slurmdbd being killed externally in the middle of a rollup process.
As mentioned in comment 17 and specially at the end of the comment 23, in your case that was done by logrotate, but you already fixed it.
In other cases it might be related to a very-very large number or very-very old runaway jobs, as they also lead to long rollup process.
Some minor cases are also related to very-very long reservations, that may lead to also long rollups.

But, actually, all similar cases were reported on 18.08 or 19.05, but none has been already reported on 20.02, and plenty of sites are already using 20.02.

Anyway, the specific preventive actions related to this issue on rollup/sreport info would be:
- Check and fix runaway jobs with "sacctmgr show runaway". If you are clean, they won't trigger rollups.
- Check that reservations are correctly on sync between slurmctrld (with "scontrol show reservations") and slurmdbd (wuth "sacctmgr show reservations"). If they are on sync, it won't trugger rollups neither.
- Check the logs of slurmdbd to see that it is not "terminated" due any unknown reason, and that includes not manual signaling by SIGINT / SIGTERM / SIGHUP. If slurmdbd is not being killed, rollups will work normally.

With that extra preventive actions, you can go ahead and follow the normal upgrade steps (that also include other preventive but more general actions). As a reference:

https://slurm.schedmd.com/quickstart_admin.html#upgrade
https://slurm.schedmd.com/SLUG19/Field_Notes_3.pdf (slides 5-10)

I hope that helps,
Albert

Comment 65 Damien 2020-05-28 08:34:28 MDT

Hi Albert, 

Thanks for your reply.

I will observed your notes and read Tim's rants.



Cheers

Damien

Comment 66 Albert Gil 2020-05-28 08:48:30 MDT

Ok Damien.
I'm closing this bug again as infogiven.
But if you face any issue or need any support along the upgrade, please don't hesitate to open a new bug.
You can reopen this too, but if the issue is not related to the original one, open a new one and keep this one focused on the original issue is a much better option.

Thanks,
Albert

Comment 67 Damien 2020-05-29 09:32:42 MDT

Thanks