Ticket 6921 - sacctmgr show runawayjobs fails
Summary: sacctmgr show runawayjobs fails
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Accounting (show other tickets)
Version: 18.08.6
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Broderick Gardner
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-04-28 04:46 MDT by Pawel R. Dziekonski
Modified: 2019-04-30 08:01 MDT (History)
0 users

See Also:
Site: KAUST
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Pawel R. Dziekonski 2019-04-28 04:46:56 MDT
Hi,

we are unable to fix runaway jobs:

# sacctmgr show runawayjobs

NOTE: Runaway jobs are jobs that don't exist in the controller but are still considered pending, running or suspended in the database
          ID       Name  Partition    Cluster      State           TimeStart             TimeEnd
------------ ---------- ---------- ---------- ---------- ------------------- -------------------
2624541      chris_mot+      batch     dragon    PENDING             Unknown             Unknown
2955123        arrayJob      batch     dragon    PENDING             Unknown             Unknown
[...]
3126494           job-a      batch     dragon    RUNNING 2019-04-28T08:51:18             Unknown
3126614          BASMAH      batch     dragon    RUNNING 2019-04-28T09:54:05             Unknown
3126625          Depth2      batch     dragon    RUNNING 2019-04-28T10:17:11             Unknown
sacctmgr: error: slurmdbd: Sending message type 1488: 11: No error
sacctmgr: error: Failed to fix runaway job: Resource temporarily unavailable


The list contains about 10K jobs.

Please assist,
thanks,
Pawel
Comment 1 Broderick Gardner 2019-04-29 11:28:34 MDT
This might be failing because of the other bug submitted from your site, Bug 6922. That one is also more urgent, so we should wait until that one is resolved before trying to fix runaway jobs again.

Thanks
Comment 2 Pawel R. Dziekonski 2019-04-30 08:01:50 MDT
Hi,

OK, I'm waiting.

I noticed that today there was a progress on Bug 6922.

I rerun command and it fixed everything:
# sacctmgr show runawayjobs
Runaway Jobs: No runaway jobs found on cluster dragon

Pawel