Hi, we are unable to fix runaway jobs: # sacctmgr show runawayjobs NOTE: Runaway jobs are jobs that don't exist in the controller but are still considered pending, running or suspended in the database ID Name Partition Cluster State TimeStart TimeEnd ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- 2624541 chris_mot+ batch dragon PENDING Unknown Unknown 2955123 arrayJob batch dragon PENDING Unknown Unknown [...] 3126494 job-a batch dragon RUNNING 2019-04-28T08:51:18 Unknown 3126614 BASMAH batch dragon RUNNING 2019-04-28T09:54:05 Unknown 3126625 Depth2 batch dragon RUNNING 2019-04-28T10:17:11 Unknown sacctmgr: error: slurmdbd: Sending message type 1488: 11: No error sacctmgr: error: Failed to fix runaway job: Resource temporarily unavailable The list contains about 10K jobs. Please assist, thanks, Pawel
This might be failing because of the other bug submitted from your site, Bug 6922. That one is also more urgent, so we should wait until that one is resolved before trying to fix runaway jobs again. Thanks
Hi, OK, I'm waiting. I noticed that today there was a progress on Bug 6922. I rerun command and it fixed everything: # sacctmgr show runawayjobs Runaway Jobs: No runaway jobs found on cluster dragon Pawel