Ticket 6921

Summary: sacctmgr show runawayjobs fails
Product: Slurm Reporter: Pawel R. Dziekonski <pawel.dziekonski>
Component: AccountingAssignee: Broderick Gardner <broderick>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 18.08.6   
Hardware: Linux   
OS: Linux   
Site: KAUST Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Pawel R. Dziekonski 2019-04-28 04:46:56 MDT
Hi,

we are unable to fix runaway jobs:

# sacctmgr show runawayjobs

NOTE: Runaway jobs are jobs that don't exist in the controller but are still considered pending, running or suspended in the database
          ID       Name  Partition    Cluster      State           TimeStart             TimeEnd
------------ ---------- ---------- ---------- ---------- ------------------- -------------------
2624541      chris_mot+      batch     dragon    PENDING             Unknown             Unknown
2955123        arrayJob      batch     dragon    PENDING             Unknown             Unknown
[...]
3126494           job-a      batch     dragon    RUNNING 2019-04-28T08:51:18             Unknown
3126614          BASMAH      batch     dragon    RUNNING 2019-04-28T09:54:05             Unknown
3126625          Depth2      batch     dragon    RUNNING 2019-04-28T10:17:11             Unknown
sacctmgr: error: slurmdbd: Sending message type 1488: 11: No error
sacctmgr: error: Failed to fix runaway job: Resource temporarily unavailable


The list contains about 10K jobs.

Please assist,
thanks,
Pawel
Comment 1 Broderick Gardner 2019-04-29 11:28:34 MDT
This might be failing because of the other bug submitted from your site, Bug 6922. That one is also more urgent, so we should wait until that one is resolved before trying to fix runaway jobs again.

Thanks
Comment 2 Pawel R. Dziekonski 2019-04-30 08:01:50 MDT
Hi,

OK, I'm waiting.

I noticed that today there was a progress on Bug 6922.

I rerun command and it fixed everything:
# sacctmgr show runawayjobs
Runaway Jobs: No runaway jobs found on cluster dragon

Pawel