Bug 3214 - slurmstepd: error: Exceeded step memory limit at some point.
Summary: slurmstepd: error: Exceeded step memory limit at some point.
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmstepd (show other bugs)
Version: 16.05.5
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Alejandro Sanchez
QA Contact:
URL:
: 2489 (view as bug list)
Depends on:
Blocks:
 
Reported: 2016-10-27 07:12 MDT by Hjalti Sveinsson
Modified: 2020-03-13 11:15 MDT (History)
3 users (show)

See Also:
Site: deCODE
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Hjalti Sveinsson 2016-10-27 07:12:11 MDT
Hello,

one of our users got a few errors like these in her err output files.

„slurmstepd: error: Exceeded step memory limit at some point.“

But the jobs completed successfully. Can we be sure that the data created from these jobs is okay since the job ran out of memory at some point? 

Also we see that the MaxRSS size is lower then ReqMem size so SLURM accounting is not seeing the exceeded memory size.

[root@ru-lhpc-head ~]# sacct -o JobID,ReqMem,MaxVMSize,MaxRSS,MaxRSSTask,State,NodeList -j 61474580,61474581,61474582,61474583 
       JobID     ReqMem  MaxVMSize     MaxRSS MaxRSSTask      State        NodeList 
------------ ---------- ---------- ---------- ---------- ---------- --------------- 
61474580           13Gn                                   COMPLETED        lhpc-624 
61474580.ba+       13Gn  18422080K   5874364K          0  COMPLETED        lhpc-624 
61474581           13Gn                                   COMPLETED        lhpc-504 
61474581.ba+       13Gn  18749468K   6402976K          0  COMPLETED        lhpc-504 
61474582           13Gn                                   COMPLETED        lhpc-451 
61474582.ba+       13Gn  18421788K   5786416K          0  COMPLETED        lhpc-451 
61474583           13Gn                                   COMPLETED        lhpc-451 
61474583.ba+       13Gn  19142556K   5926744K          0  COMPLETED        lhpc-451 

Would be great to get a better understanding of this.

regards,
Hjalti Sveinsson
Comment 1 Alejandro Sanchez 2016-10-28 05:02:33 MDT
Hjalti - this is a known issue and there's an enhancement bug 2489 open to address this. If you query off sacct -j <jobid>.<stepid> it shows CANCELLED cause of the exceeded memory, but the job itself is shown as COMPLETED, which we understand might be confusing to the users.

I had this proposals but no work has been done yet:

a) When querying off sacct -j <jobid>, only flag state COMPLETED if job was completed and all its steps also finished in a COMPLETED state.

b) Add an extra field to sacct -j <jobid> indicating whether any of its steps failed or not.

c) Open to different proposals.

Let me discuss this with the team.

Regarding the memory question, I think that the MaxRSS being lower than the ReqMem doesn't mean that the memory was not exceed. In fact MaxVMSize is around 17G. The virtual memory includes memory that has been swaped out, shared libraries and memory shard among tasks.
Comment 3 Hjalti Sveinsson 2016-11-02 08:04:18 MDT
Hi Alejandro,

Okay, so you say this is a known issue. Then I have a few questions I need to get answers to.

1. How do we know if the job data is correct? If the ExitCode is 0:0 but we still got the "slurmstepd error: Exceeded step memory limit at some point" in the .err file can we be sure that the data is OKAY?

2. How do we get the slurmstepd id number? Where do we find that id?

3. You say this is a known issue. Has it been fixed in Slurm 16.05.6?

Please let us know as soon as possible so we can inform our users because we need to know if the data from the jobs is okay or not. 

regards,
Hjalti Sveinsson
Comment 4 Alejandro Sanchez 2016-11-02 09:01:16 MDT
(In reply to Hjalti Sveinsson from comment #3)
> Hi Alejandro,
> 
> Okay, so you say this is a known issue. Then I have a few questions I need
> to get answers to.
> 
> 1. How do we know if the job data is correct? If the ExitCode is 0:0 but we
> still got the "slurmstepd error: Exceeded step memory limit at some point"
> in the .err file can we be sure that the data is OKAY?

You can't be sure that the data is correct if any of your steps failed. Let me explain you with an example:

Let's launch a batch script containing two srun steps + the batch step:

$ cat test.batch 
#!/bin/bash
#SBATCH -N1
echo "step0:"
srun --mem=1M -N1 -n1 ./mem_eater
echo "step1:"
srun -N1 -n1 hostname
exit 0
$ sbatch test.batch 
Submitted batch job 2006

The first srun step will be cancelled because it's gonna exceed the memory limits and the second step will finish successfully:

$ cat slurm-2006.out 
step0:
srun: error: compute1: task 0: Killed
srun: Force Terminated job step 2006.0
step1:
testbed

Now let's look at the output of sacct to understand if the job "ran well":

$ sacct -j 2006 -o jobid,jobname,state,exitcode,derivedexitcode
       JobID    JobName      State ExitCode DerivedExitCode 
------------ ---------- ---------- -------- --------------- 
2006         test.batch  COMPLETED      0:0             0:9 
2006.batch        batch  COMPLETED      0:0                 
2006.0        mem_eater  CANCELLED      0:9                 
2006.1         hostname  COMPLETED      0:0                 

So if you look at the ExitCode field, the number at the left of the colon(:) indicates the exit code of the bash script, which is 0. When a signal was responsible for a job or step's termination, the signal number will be displayed after the exit code, delineated by a colon(:). So for this example, the step 0 of the job 2006 finished in a CANCELLED state due to a signal 9 as a consequence of the exceeded memory limit.

After reading the above description of a job's exit code, one can imagine a scenario where a central task of a batch job fails but the script returns an exit code of zero, indicating success. In many cases, a user may not be able to ascertain the success or failure of a job until after they have examined the job's output files.

The job includes a "derived exit code" field. It is initially set to the value of the highest exit code returned by all of the job's steps (srun invocations). The job's derived exit code is determined by the Slurm control daemon and sent to the database when the accounting_storage plugin is enabled.
 
> 2. How do we get the slurmstepd id number? Where do we find that id?

When you do sacct -j <jobid> the output lists a disaggregated list containing the job, the batch step, and the successive steps run by the job. Another way of knowing the stepid is adding the 'Steps' value to DebugFlags, then you can see in the slurmctld.log messages like this:

slurmctld: sched: _slurm_rpc_job_step_create: StepId=2006.0 compute1 usec=423
 
> 3. You say this is a known issue. Has it been fixed in Slurm 16.05.6?

More than a known issue, this is something we had to discuss. Today we talked about this and we've realized this is working by design. If we make the batch script exit code to be the highest of the step exit codes, then we could not discern wether the batch script failed because it failed or because any of the steps failed. So if the job script exited correctly the exit code is 0 and the state COMPLETED even if any of the steps failed. The good thing is that you can see all clearly disaggregated with sacct as I explained above.

> Please let us know as soon as possible so we can inform our users because we
> need to know if the data from the jobs is okay or not. 
> 
> regards,
> Hjalti Sveinsson

Please, let me know if this makes sense to you. Thanks.
Comment 5 Alejandro Sanchez 2016-11-02 09:05:03 MDT
*** Bug 2489 has been marked as a duplicate of this bug. ***
Comment 7 Hjalti Sveinsson 2016-11-02 09:47:35 MDT
Thank you for the quick response and the answers.

When I do run the command that you sent on the jobs that had the „slurmstepd: error: Exceeded step memory limit at some point.“ in the .err files I see ExitCode 0:0 and DerivedExitCode 0:0. That should mean that the jobs did run well even though we did get the error message. 

Can I safely say then that if we do get ExitCode 0:0 and DerivedExitCode 0:0 on jobs that the data from the jobs is okay and we can ignore the message in the .err file? Is the message maybe only saying that it had to swap out part of the job but it still finished since we have the exit codes at 0:0?

[root@ru-lhpc-head ~]# sacct -j 61474580,61474581,61474582,61474583 -o jobid,jobname,state,exitcode,derivedexitcode
       JobID    JobName      State ExitCode DerivedExitCode 
------------ ---------- ---------- -------- --------------- 
61474580          bphen  COMPLETED      0:0             0:0 
61474580.ba+      batch  COMPLETED      0:0                 
61474581          bphen  COMPLETED      0:0             0:0 
61474581.ba+      batch  COMPLETED      0:0                 
61474582          bphen  COMPLETED      0:0             0:0 
61474582.ba+      batch  COMPLETED      0:0                 
61474583          bphen  COMPLETED      0:0             0:0 
61474583.ba+      batch  COMPLETED      0:0                 

regards,
Hjalti Sveinsson
Comment 8 Alejandro Sanchez 2016-11-02 10:02:47 MDT
(In reply to Hjalti Sveinsson from comment #7)
> When I do run the command that you sent on the jobs that had the
> „slurmstepd: error: Exceeded step memory limit at some point.“ in the .err
> files I see ExitCode 0:0 and DerivedExitCode 0:0. That should mean that the
> jobs did run well even though we did get the error message. 

Exactly.

> Can I safely say then that if we do get ExitCode 0:0 and DerivedExitCode 0:0
> on jobs that the data from the jobs is okay and we can ignore the message in
> the .err file? Is the message maybe only saying that it had to swap out part
> of the job but it still finished since we have the exit codes at 0:0?

That's it. The "Excceded step memory limit at some point" message indicates that the step exceeded the memory limit imposed by either MaxMemPer{CPU,Node} parameter or the --mem option. But if you don't ConstrainSwapSpace in your cgroup.conf, or you do but the step does not exceeed the AllowedSwapSpace, then the step will be using the swap space and continue running, and thus not being signaled and not being CANCELLED. So yes, I'd say that you can safely say that the data is ok.

> [root@ru-lhpc-head ~]# sacct -j 61474580,61474581,61474582,61474583 -o
> jobid,jobname,state,exitcode,derivedexitcode
>        JobID    JobName      State ExitCode DerivedExitCode 
> ------------ ---------- ---------- -------- --------------- 
> 61474580          bphen  COMPLETED      0:0             0:0 
> 61474580.ba+      batch  COMPLETED      0:0                 
> 61474581          bphen  COMPLETED      0:0             0:0 
> 61474581.ba+      batch  COMPLETED      0:0                 
> 61474582          bphen  COMPLETED      0:0             0:0 
> 61474582.ba+      batch  COMPLETED      0:0                 
> 61474583          bphen  COMPLETED      0:0             0:0 
> 61474583.ba+      batch  COMPLETED      0:0                 
> 
> regards,
> Hjalti Sveinsson
Comment 9 Hjalti Sveinsson 2016-11-03 04:25:54 MDT
Great! Thank you for these answers!
Comment 10 Alejandro Sanchez 2016-11-03 04:35:10 MDT
(In reply to Hjalti Sveinsson from comment #9)
> Great! Thank you for these answers!

You're welcome, closing the bug. Please, re-open if you encounter further issues.
Comment 11 harig105 2017-03-13 18:10:01 MDT
Hello,

I think I am seeing a variant of this bug.

I ran an array job using sbatch --wait and one of the array jobs was cancelled due to exceeding memory. I was expecting sbatch to return a non-zero error code since in the man page it says "In the case of a job array, the exit code recorded will be the highest value for any task in the job array." But instead a zero is returned since the derived error code appears to be 0 even though the one of the steps is non-zero.

This makes sbatch --wait's return code unreliable for scripting since there is no way to determine the jobs ran or if they were cancelled partially.


#slurm_164842_932_jobs.err:
slurmstepd: error: Job 165775 exceeded memory limit (67670480 > 67108864), being killed
slurmstepd: error: Exceeded job memory limit
slurmstepd: error: *** JOB 165775 ON server1 CANCELLED AT 2017-03-13T06:16:41 ***
slurmstepd: error: Exceeded step memory limit at some point.


#sacct output:
> sacct -j 164842_932 -o jobid%16,AllocTres%21,MaxRSS,MaxVMSize,state%14,exitcode%8,derivedexitcode
           JobID             AllocTRES     MaxRSS  MaxVMSize          State ExitCode DerivedExitCode 
---------------- --------------------- ---------- ---------- -------------- -------- --------------- 
      164842_932 cpu=16,mem=64G,node=1                       CANCELLED by 0      0:0             0:0 
164842_932.batch cpu=16,mem=64G,node=1  67670480K  73585804K      CANCELLED     0:15      

#slurm version:
> slurmd -V
slurm 16.05.8
Comment 12 Danny Auble 2017-03-13 18:14:36 MDT
Please don't reopen bugs from other submitters. After reviewing the info given in this bug and you still feel there is a problem please open a new bug. 

Thanks!
Comment 13 Aditi Gaur 2020-03-13 11:15:03 MDT
Hi Felip,

Thanks for the patch! I think it works for my testing for now! My unit test works as well as I tried running slurm daemon, and that parses cleanly!

I did try regcomp earlier and at that point I had noticed a race condition in slurm daemon, before switching to memcpy but I dont see any race condition now, so maybe this fixes it completely!

Appreiciate your help!
Comment 14 Aditi Gaur 2020-03-13 11:15:53 MDT
sorry i replied to wrong bug.