Ticket 4625 - Broken afterok dependency when state is OUT_OF_MEMORY
Summary: Broken afterok dependency when state is OUT_OF_MEMORY
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other tickets)
Version: 17.11.0
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Alejandro Sanchez
QA Contact:
URL:
Depends on: 3820
Blocks:
  Show dependency treegraph
 
Reported: 2018-01-12 15:37 MST by Stephane Thiell
Modified: 2018-01-25 04:46 MST (History)
4 users (show)

See Also:
Site: Stanford
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Stephane Thiell 2018-01-12 15:37:34 MST
Hello,

Our Sherlock users are reporting broken job dependencies since 17.11, breaking existing job pipelines when using afterok on a job with State being OUT_OF_MEMORY. Additionally, our users confirmed that even with the state being OUT_OF_MEMORY, the job completed sucesssfully.

At first glance, the error and status reported in that case is 0, which seems good:

Jan 12 12:27:07 sh-104-21 slurmstepd[75489]: task/cgroup: /slurm/uid_38982/job_6024617: alloc=64000MB mem.limit=64000MB memsw.limit=64000MB
Jan 12 12:27:07 sh-104-21 slurmstepd[75489]: task/cgroup: /slurm/uid_38982/job_6024617/step_batch: alloc=64000MB mem.limit=64000MB memsw.limit=64000MB
Jan 12 12:43:13 sh-104-21 slurmstepd[75489]: error: Exceeded step memory limit at some point.
Jan 12 12:43:18 sh-104-21 slurmstepd[75489]: error: Exceeded job memory limit at some point.
Jan 12 12:43:31 sh-104-21 slurmstepd[75489]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0
Jan 12 12:43:31 sh-104-21 slurmstepd[75489]: done with job

But still, afterok doesn't work. According to https://bugs.schedmd.com/show_bug.cgi?id=3820#c25 this shouldn't happen. Have you seen similar broken dependencies in that case? Any advise?

Thanks much!
Stephane Thiell
Comment 1 Alejandro Sanchez 2018-01-15 02:13:14 MST
Hi Stephane. We are internally reviewing a patch for bug 3820 so that job state will change to OUT_OF_MEMORY if oom-killer actually killed. That would avoid situations where pages were reclaimed by the kernel and process managed to succeed, but job state got changed marked OOM. I am going to mark this bug as dependant on bug 3820 for now, then we will address the issue here.

But taking a quick look at test_job_array_completed(), it doesn't consider state OUT_OF_MEMORY (which has ExitCode 0:125) and can see derived problems in src/slurmctld/job_scheduler.c's test_job_dependency() logic. We will study further the situation and come back to you, but most probably we will need to solve bug 3820 beforehand. Thanks for your understanding.
Comment 4 Stephane Thiell 2018-01-16 19:28:31 MST
Hi Alejandro,

Thanks much! This is impacting several jobs and multiple users have reported the issue. I'll closely follow bug 3820 too. I do hope you'll find a solution and provide a patch soon.

Best regards,

Stephane
Comment 5 Alejandro Sanchez 2018-01-19 07:59:27 MST
Stephane, after doing some more tests today with this, and discussing this internally, we think Slurm is behaving as expected with regards of 'afterok' dependency type.

Let me elaborate why with a few examples and some notes. The following example satisfies the 'afterok' dependency:

$ sbatch --wrap "sleep 20"
Submitted batch job 20012
$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             20012        p1     wrap     alex  R       0:01      1 compute1
$ sbatch -d afterok:20012 --wrap "sleep 88888"
Submitted batch job 20013
$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             20013        p1     wrap     alex PD       0:00      1 (Dependency)
             20012        p1     wrap     alex  R       0:10      1 compute1
$ squeue # (eventually, after 20012 finishes)
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             20013        p1     wrap     alex  R       0:04      1 compute1
$ sacct -j 20012
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
20012              wrap         p1      acct1          2  COMPLETED      0:0 
20012.batch       batch                 acct1          2  COMPLETED      0:0 

As you can see, 20012 finished with ExitCode 0:0 and state COMPLETED.

All of these states (from slurm.h) should be considered as failed with respect to dependencies:
	JOB_CANCELLED,		/* cancelled by user */
	JOB_FAILED,		/* completed execution unsuccessfully */
	JOB_TIMEOUT,		/* terminated on reaching time limit */
	JOB_NODE_FAIL,		/* terminated on node failure */
	JOB_PREEMPTED,		/* terminated due to preemption */
	JOB_BOOT_FAIL,		/* terminated due to node boot failure */
	JOB_DEADLINE,		/* terminated on deadline */
	JOB_OOM,		/* experienced out of memory error */
All of them indicate the job did not run to completion with exit code 0.

So for instance a CANCELLED job looks like this in sacct:

20010              wrap         p1      acct1          2 CANCELLED+      0:0 
20010.batch       batch                 acct1          2  CANCELLED     0:15 

and a OOM one like this:

20014         mem_eater         p1      acct1          2 OUT_OF_ME+    0:125 

and since they don't have the ExitCode 0:0 and state COMPLETED, they will never satisfy an 'afterok' dependency. 

You can view the logic in src/slurmctld/job_scheduler.c test_job_dependency() function, around this code spot:

                } else if (dep_ptr->depend_type == SLURM_DEPEND_AFTER_OK) {
                        if (!IS_JOB_COMPLETED(djob_ptr))
                                depends = true;
                        else if (IS_JOB_COMPLETE(djob_ptr))
                                clear_dep = true;
                        else {
                                failure = true;
                                break;
                        }

The IS_JOB_COMPLETE macro is defined like this in src/common/slurm_protocol_defs.h :

#define IS_JOB_COMPLETE(_X)             \
        ((_X->job_state & JOB_STATE_BASE) == JOB_COMPLETE)

Thus in terms of dependency a job will only satisfy it it is finished with state JOB_COMPLETE, other finished states won't.

Now, regarding those jobs whose spawned step tasks memory usage hit the limit but aren't oom-killed and instead manage to finish successfully, they won't be marked as JOB_OOM anymore after the patch prepared for bug 3820. Just FYI it was a quiet involved patch, but things to be working as expected now and just we are waiting for another team member to decide in which version(s) we check this in.

I think after this explanation we can proceed and mark this one as resolved/infogiven, unless you have any more questions.
Comment 6 Stephane Thiell 2018-01-24 21:08:23 MST
Hi Alejandro,

Thank you for the thorough explanation! Indeed, this behavior makes sense to me if bug 3820 is finally fixed.

Thanks!
Stephane
Comment 7 Alejandro Sanchez 2018-01-25 04:46:48 MST
(In reply to Stephane Thiell from comment #6)
> Hi Alejandro,
> 
> Thank you for the thorough explanation! Indeed, this behavior makes sense to
> me if bug 3820 is finally fixed.
> 
> Thanks!
> Stephane

All right, closing this bug then since I finally fixed bug 3820 too.