Bug 2741 - Sacct informations not updated
Summary: Sacct informations not updated
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Database (show other bugs)
Version: 15.08.11
Hardware: Linux Linux
: --- 2 - High Impact
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2016-05-17 06:01 MDT by Davide Vanzo
Modified: 2016-05-18 06:37 MDT (History)
2 users (show)

See Also:
Site: Vanderbilt
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Config and log files (3.80 MB, application/x-compressed-tar)
2016-05-17 06:01 MDT, Davide Vanzo
Details
slurmctl entries from messages log on primary controller (977.29 KB, application/x-gzip)
2016-05-17 06:51 MDT, Sean Johnson
Details
slurmdbd.log (12.44 KB, text/x-log)
2016-05-18 06:18 MDT, Davide Vanzo
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Davide Vanzo 2016-05-17 06:01:06 MDT
Created attachment 3101 [details]
Config and log files

Hi,
I am contacting you to ask advice on an issue we are experiencing with the SLURM database. We noticed that the jobs status from sacct do not match the status shown by squeue. It is as if the database stopped getting updates from SLURM, showing old job status.
Our first thought was a slurmdbd malfunction so we restarted the daemon but it did not solve the issue. Same for mysqld. The only strange thing we noticed is a series of identical errors in slurmdbd.log file that say:

[2016-05-17T13:50:19.786] error: There is no reservation by id 35, time_start 1455136005, and cluster 'accre'

However there was no reservation set to start at that time.
Attached you can find the configuration files and logs. Please let me know if you need any additional information.

Davide
Comment 1 Tim Wickberg 2016-05-17 06:09:03 MDT
Can you check if there's a dbd.messages file in your SlurmSpoolDir on the slurmctld host?

It looks like jobs have run under a reservation that slurmdbd wasn't aware of, and all successive job info messages are likely being spooled in that file until this is resolved.

I'm looking to see if any recent change may have provoke this, or if there's a way to get those currently-stuck jobs out of the way to restore normal operation.
Comment 2 Davide Vanzo 2016-05-17 06:24:44 MDT
Tim,
yes, the dbd.messages file exists but is empty.
A couple more information. Last week we activated the elasticsearch plugin. However we do not see how this should be the cause of the current issue.
Two days ago, close to the date/time when the database seem to have stopped receiving updates, we updated the start time of five reservations that we created before for maintenance purposes. None of such reservations has ever or is currently active.

Davide
Comment 3 Tim Wickberg 2016-05-17 06:25:20 MDT
For some reason, there's a reservation update message blocking things. The timestamp reference is for February though, which seems unusual.

Are there any long-running reservations in the system? Can you attach 'scontrol show reservation' for reference?

Is there anything that may have been trying to modify past reservations for some reason? The message first appears at 2016-05-16T10:11:24.824, but the slurmctld log doesn't cover that, and I'm curious if there was something odd happening then. If you have yesterday's slurmctld log that may shed some light.
Comment 4 Sean Johnson 2016-05-17 06:41:18 MDT
> Are there any long-running reservations in the system? 

There is, for our QA systems of which there is only one that slurm is aware of. And jobs don't get scheduled to that system. 

> Can you attach 'scontrol show reservation' for reference?

Yup[1]. We have a set of reservations for next week for a scheduled rolling outage of compute nodes in order to perform a GPFS upgrade. Other than that the only standing reservation is the cfengine_test one, that is ostensibly our QA system. 

> If you have yesterday's slurmctld log that may shed some light.

It appears to me that we don't have a slurmctld log. I looked in /var/log/slurm on the primary controller, and there was a slurmdbd log, but not a controller log. On the secondary controller the /var/log/slurm directory doesn't exist.

Also, a grep through the configuration[2] leads me to believe that we aren't logging. 



[1]
[johns443@vmps11 ~]$ scontrol show res
ReservationName=cfengine_test StartTime=2015-05-22T14:44:45 EndTime=2016-05-27T00:00:00 Duration=370-09:15:15
   Nodes=vmp588 NodeCnt=1 CoreCnt=8 Features=(null) PartitionName=(null) Flags=IGNORE_JOBS,SPEC_NODES
   TRES=cpu=8
   Users=(null) Accounts=-guest_account Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=0

ReservationName=gpfs_upgrade_1 StartTime=2016-05-23T13:00:00 EndTime=2016-05-23T15:00:00 Duration=02:00:00
   Nodes=vmp[201-203,217-221,301,306-343,381,416,502-504,507,509-510,512,514-518,520-528,552-564,591-593,603-610,612,614,616,618,620,623,802,805-807,1040-1044,1081-1091,1243-1244,1258] NodeCnt=123 CoreCnt=1236 Features=(null) PartitionName=(null) Flags=MAINT,SPEC_NODES
   TRES=cpu=1236
   Users=vanzod Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a

ReservationName=gpfs_upgrade_2 StartTime=2016-05-25T13:00:00 EndTime=2016-05-25T15:00:00 Duration=02:00:00
   Nodes=vmp[204-206,222-226,302,344-380,382-383,421,505,529-540,542-548,565-578,594-596,611,613,615,617,619,622,624,626,628,630,632,634,636,808,813,815,818,902,1001-1002,1045-1048,1092-1102,1207,1245-1247] NodeCnt=125 CoreCnt=1260 Features=(null) PartitionName=(null) Flags=MAINT,SPEC_NODES
   TRES=cpu=1260
   Users=vanzod Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a

ReservationName=gpfs_upgrade_3 StartTime=2016-05-27T13:00:00 EndTime=2016-05-27T15:00:00 Duration=02:00:00
   Nodes=vmp[207-209,227-231,303,384-415,417-420,422-424,461,506,511,579,581-590,597-598,621,625,627,629,631,640,642,644,646,648,652-653,659,692,720-723,824-826,833,903,1003,1007-1025,1049-1052,1103-1113,1217,1248-1250] NodeCnt=126 CoreCnt=1268 Features=(null) PartitionName=(null) Flags=MAINT,SPEC_NODES
   TRES=cpu=1268
   Users=vanzod Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a

ReservationName=gpfs_upgrade_4 StartTime=2016-05-29T13:00:00 EndTime=2016-05-29T15:00:00 Duration=02:00:00
   Nodes=vmp[210-212,232-235,425-460,462,464-465,633,637-639,641,643,660-662,664-671,675-676,693-695,724-731,834-836,904,1026-1039,1053-1055,1061-1063,1114-1124,1251-1252] NodeCnt=113 CoreCnt=1144 Features=(null) PartitionName=(null) Flags=MAINT,SPEC_NODES
   TRES=cpu=1144
   Users=vanzod Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a

ReservationName=gpfs_upgrade_5 StartTime=2016-05-31T13:00:00 EndTime=2016-05-31T15:00:00 Duration=02:00:00
   Nodes=vmp[213-216,236-239,305,463,466-500,645,647,677-686,688-690,696-698,717-719,732-739,837-838,844,905,1056-1059,1064-1080,1125-1135,1253-1256] NodeCnt=114 CoreCnt=1160 Features=(null) PartitionName=(null) Flags=MAINT,SPEC_NODES
   TRES=cpu=1160
   Users=vanzod Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a

[2]
root@slurmsched1:/var/log/slurm# grep -i log /usr/scheduler/slurm/etc/*.conf
/usr/scheduler/slurm/etc/slurm.conf:#Epilog=
/usr/scheduler/slurm/etc/slurm.conf:#EpilogSlurmctld=
/usr/scheduler/slurm/etc/slurm.conf:#Prolog=
/usr/scheduler/slurm/etc/slurm.conf:#PrologFlags=
/usr/scheduler/slurm/etc/slurm.conf:#PrologSlurmctld=
/usr/scheduler/slurm/etc/slurm.conf:#SrunEpilog=
/usr/scheduler/slurm/etc/slurm.conf:#SrunProlog=
/usr/scheduler/slurm/etc/slurm.conf:#TaskEpilog=
/usr/scheduler/slurm/etc/slurm.conf:#TaskProlog=
/usr/scheduler/slurm/etc/slurm.conf:#TopologyPlugin=topology/tree
/usr/scheduler/slurm/etc/slurm.conf:#EpilogMsgTime=2000
/usr/scheduler/slurm/etc/slurm.conf:SlurmdTimeout=600    # CNJ -- raised per slurmctld log file entry on 20160201
/usr/scheduler/slurm/etc/slurm.conf:# LOGGING AND ACCOUNTING
/usr/scheduler/slurm/etc/slurm.conf:#SlurmctldLogFile=
/usr/scheduler/slurm/etc/slurm.conf:#SlurmdLogFile=
/usr/scheduler/slurm/etc/slurm.conf:#SlurmSchedLogFile=
/usr/scheduler/slurm/etc/slurm.conf:#SlurmSchedLogLevel=
/usr/scheduler/slurm/etc/slurmdbd.conf:LogFile=/var/log/slurm/slurmdbd.log
Comment 5 Sean Johnson 2016-05-17 06:51:28 MDT
Created attachment 3105 [details]
slurmctl entries from messages log on primary controller

Belay that, we do log, but just to /var/log/messages instead of a slurmctl.log. 

Attached is a grep of slurmctl out of /var/log/messages of the primary controller. It starts at May 15. If you would like messages older than that, I can get them.
Comment 6 Tim Wickberg 2016-05-17 07:04:00 MDT
The timing closely matches up to this:

May 16 10:10:30 slurmsched1 slurmctld[18575]: sched: Updated reservation=cfengine_test accounts=-guest_account nodes=vmp588 cores=8 licenses=(null) tres=1=8 watts=0 start=2016-02-23T17:09:02 end=2016-05-27T00:00:00

Although I'm unclear as to why the reservation start time that is indicated by the logs is out of sync here.

Can you run the following in MySQL directly and see what comes back?

"select * from accre_resv_table where id_resv=35;"
Comment 7 Sean Johnson 2016-05-17 07:12:46 MDT
> Can you run the following in MySQL directly and see what comes back?

Yup. Empty set[1]. :(

[1]
MariaDB [(none)]> show databases;
+--------------------------+
| Database                 |
+--------------------------+
| information_schema       |
| mysql                    |
| performance_schema       |
| slurm_acct_db            |
| slurm_acct_db_backups    |
| slurm_jobcomp_db         |
| slurm_jobcomp_db_backups |
+--------------------------+
7 rows in set (0.00 sec)

MariaDB [(none)]> use slurm_acct_db;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
MariaDB [slurm_acct_db]> show tables;
+-------------------------------+
| Tables_in_slurm_acct_db       |
+-------------------------------+
| accre_assoc_table             |
| accre_assoc_usage_day_table   |
| accre_assoc_usage_hour_table  |
| accre_assoc_usage_month_table |
| accre_event_table             |
| accre_job_table               |
| accre_last_ran_table          |
| accre_resv_table              |
| accre_step_table              |
| accre_suspend_table           |
| accre_usage_day_table         |
| accre_usage_hour_table        |
| accre_usage_month_table       |
| accre_wckey_table             |
| accre_wckey_usage_day_table   |
| accre_wckey_usage_hour_table  |
| accre_wckey_usage_month_table |
| acct_coord_table              |
| acct_table                    |
| clus_res_table                |
| cluster_table                 |
| qos_table                     |
| res_table                     |
| table_defs_table              |
| tres_table                    |
| txn_table                     |
| user_table                    |
+-------------------------------+
27 rows in set (0.00 sec)

MariaDB [slurm_acct_db]> select * from accre_resv_table where id_resv=35;
Empty set (0.00 sec)

MariaDB [slurm_acct_db]>
Comment 8 Davide Vanzo 2016-05-18 01:17:33 MDT
I'm raising the importance since we are not getting any more jobs added to the database.

Davide
Comment 11 Tim Wickberg 2016-05-18 05:15:20 MDT
It looks like you were caught by a bug (2507) with creating reservations when the database was down; this was fixed with 15.08.9 but looking at the logs the reservation was created before you'd upgraded. Going forward this shouldn't recur, unless some other reservations are similarly stuck.

I believe you can at least get things moving again by faking this missing reservation in the database, which would let the update message complete and outstand messages be processed.

If you go into mysql, you can manually create the record like so:

insert into accre_resv_table (id_resv, deleted, time_start, resv_name) values (35, 0, 1455136005, 'xxxbugxxx');

Within a minute of that entering the database I think update should start flowing again as normal, and sacct should start displaying the delayed job info.

Assuming that works, I'd like to see additional logs from slurmdbd and slurmctld, as well as the output from 'select * from accre_resv_table where resv_id=35;'.

After that, I'd want you to delete that reservation - both with 'scontrol delete reservation cfengine_test' and directly in the database ('delete from accre_resv_table where id_resv=35;') to ensure nothing related to that reservation causes additional issues.

- Tim
Comment 12 Davide Vanzo 2016-05-18 06:18:40 MDT
Created attachment 3110 [details]
slurmdbd.log

Tim,
after manually creating the reservation entry in accre_resv_table the
data started to flow back again as you predicted.
Here is the database query you requested:
MariaDB [slurm_acct_db]> select * from accre_resv_table where id_resv=35;
+---------+---------+-----------+-------+----------+----------+---------------+------------+------------+------+
| id_resv | deleted | assoclist | flags | nodelist | node_inx | resv_name     | time_start | time_end   | tres |
+---------+---------+-----------+-------+----------+----------+---------------+------------+------------+------+
|      35 |       0 |           | 32832 | vmp588   | 317      | cfengine_test | 1456268942 | 1464325200 | 1=8  |
+---------+---------+-----------+-------+----------+----------+---------------+------------+------------+------+
1 row in set (0.00 sec)
Attached you can find the slurmdbd log tail. The entry has been created
at 2016-05-18T13:31:33.097.
Now both the reservation and its database entry have been deleted.
Everything seems back to normal.
You can close the ticket. I will reopen it if any related issue arises.
Davide
On Wed, 2016-05-18 at 18:15 +0000, bugs@schedmd.com wrote:
> Comment # 11 on bug 2741 from Tim Wickberg
> It looks like you were caught by a bug (2507) with creating
> reservations when
> the database was down; this was fixed with 15.08.9 but looking at the
> logs the
> reservation was created before you'd upgraded. Going forward this
> shouldn't
> recur, unless some other reservations are similarly stuck.
> 
> I believe you can at least get things moving again by faking this
> missing
> reservation in the database, which would let the update message
> complete and
> outstand messages be processed.
> 
> If you go into mysql, you can manually create the record like so:
> 
> insert into accre_resv_table (id_resv, deleted, time_start,
> resv_name) values
> (35, 0, 1455136005, 'xxxbugxxx');
> 
> Within a minute of that entering the database I think update should
> start
> flowing again as normal, and sacct should start displaying the
> delayed job
> info.
> 
> Assuming that works, I'd like to see additional logs from slurmdbd
> and
> slurmctld, as well as the output from 'select * from accre_resv_table
> where
> resv_id=35;'.
> 
> After that, I'd want you to delete that reservation - both with
> 'scontrol
> delete reservation cfengine_test' and directly in the database
> ('delete from
> accre_resv_table where id_resv=35;') to ensure nothing related to
> that
> reservation causes additional issues.
> 
> - Tim
> You are receiving this mail because:
> You reported the bug.
Comment 13 Tim Wickberg 2016-05-18 06:37:29 MDT
Glad to hear that's resolved it; thanks for the patience getting this sorted, I wanted to make sure I understood the cause, and make sure that fix wouldn't cause any additional issues.

Marking resolved now.

- Tim