Ticket 2489

Summary: Modify how sacct shows a COMPLETED job if any of its steps didn't complete successfully
Product: Slurm Reporter: Alejandro Sanchez <alex>
Component: AccountingAssignee: Alejandro Sanchez <alex>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 5 - Enhancement    
Priority: ---    
Version: 16.05.x   
Hardware: Linux   
OS: Linux   
Site: SchedMD Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Alejandro Sanchez 2016-02-25 19:51:23 MST
This comes from Bug #2429. 

Problem was a job with one step. The step finished in a CANCELLED state due to exceeded memory and job finished in a COMPLETED state.

When querying off the information about the job with sacct -j <jobid>, the output indicates the job was COMPLETED, which gives the impression to the user that "all went fine" with the job, despite its step was CANCELLED due to exceeded memory.

It's true that if you query off sacct -j <jobid>.<stepid> it shows CANCELLED, but we should somehow indicate when sacct -j <jobid> if all their steps finished COMPLETED or if any of them failed.

Proposals:

a) When querying off sacct -j <jobid>, only flag state COMPLETED if job was completed and all its steps also finished in a COMPLETED state.

b) Add an extra field to sacct -j <jobid> indicating whether any of its steps failed or not.

c) Open for more proposals...