Hello, one of our users got a few errors like these in her err output files. „slurmstepd: error: Exceeded step memory limit at some point.“ But the jobs completed successfully. Can we be sure that the data created from these jobs is okay since the job ran out of memory at some point? Also we see that the MaxRSS size is lower then ReqMem size so SLURM accounting is not seeing the exceeded memory size. [root@ru-lhpc-head ~]# sacct -o JobID,ReqMem,MaxVMSize,MaxRSS,MaxRSSTask,State,NodeList -j 61474580,61474581,61474582,61474583 JobID ReqMem MaxVMSize MaxRSS MaxRSSTask State NodeList ------------ ---------- ---------- ---------- ---------- ---------- --------------- 61474580 13Gn COMPLETED lhpc-624 61474580.ba+ 13Gn 18422080K 5874364K 0 COMPLETED lhpc-624 61474581 13Gn COMPLETED lhpc-504 61474581.ba+ 13Gn 18749468K 6402976K 0 COMPLETED lhpc-504 61474582 13Gn COMPLETED lhpc-451 61474582.ba+ 13Gn 18421788K 5786416K 0 COMPLETED lhpc-451 61474583 13Gn COMPLETED lhpc-451 61474583.ba+ 13Gn 19142556K 5926744K 0 COMPLETED lhpc-451 Would be great to get a better understanding of this. regards, Hjalti Sveinsson
Hjalti - this is a known issue and there's an enhancement bug 2489 open to address this. If you query off sacct -j <jobid>.<stepid> it shows CANCELLED cause of the exceeded memory, but the job itself is shown as COMPLETED, which we understand might be confusing to the users. I had this proposals but no work has been done yet: a) When querying off sacct -j <jobid>, only flag state COMPLETED if job was completed and all its steps also finished in a COMPLETED state. b) Add an extra field to sacct -j <jobid> indicating whether any of its steps failed or not. c) Open to different proposals. Let me discuss this with the team. Regarding the memory question, I think that the MaxRSS being lower than the ReqMem doesn't mean that the memory was not exceed. In fact MaxVMSize is around 17G. The virtual memory includes memory that has been swaped out, shared libraries and memory shard among tasks.
Hi Alejandro, Okay, so you say this is a known issue. Then I have a few questions I need to get answers to. 1. How do we know if the job data is correct? If the ExitCode is 0:0 but we still got the "slurmstepd error: Exceeded step memory limit at some point" in the .err file can we be sure that the data is OKAY? 2. How do we get the slurmstepd id number? Where do we find that id? 3. You say this is a known issue. Has it been fixed in Slurm 16.05.6? Please let us know as soon as possible so we can inform our users because we need to know if the data from the jobs is okay or not. regards, Hjalti Sveinsson
(In reply to Hjalti Sveinsson from comment #3) > Hi Alejandro, > > Okay, so you say this is a known issue. Then I have a few questions I need > to get answers to. > > 1. How do we know if the job data is correct? If the ExitCode is 0:0 but we > still got the "slurmstepd error: Exceeded step memory limit at some point" > in the .err file can we be sure that the data is OKAY? You can't be sure that the data is correct if any of your steps failed. Let me explain you with an example: Let's launch a batch script containing two srun steps + the batch step: $ cat test.batch #!/bin/bash #SBATCH -N1 echo "step0:" srun --mem=1M -N1 -n1 ./mem_eater echo "step1:" srun -N1 -n1 hostname exit 0 $ sbatch test.batch Submitted batch job 2006 The first srun step will be cancelled because it's gonna exceed the memory limits and the second step will finish successfully: $ cat slurm-2006.out step0: srun: error: compute1: task 0: Killed srun: Force Terminated job step 2006.0 step1: testbed Now let's look at the output of sacct to understand if the job "ran well": $ sacct -j 2006 -o jobid,jobname,state,exitcode,derivedexitcode JobID JobName State ExitCode DerivedExitCode ------------ ---------- ---------- -------- --------------- 2006 test.batch COMPLETED 0:0 0:9 2006.batch batch COMPLETED 0:0 2006.0 mem_eater CANCELLED 0:9 2006.1 hostname COMPLETED 0:0 So if you look at the ExitCode field, the number at the left of the colon(:) indicates the exit code of the bash script, which is 0. When a signal was responsible for a job or step's termination, the signal number will be displayed after the exit code, delineated by a colon(:). So for this example, the step 0 of the job 2006 finished in a CANCELLED state due to a signal 9 as a consequence of the exceeded memory limit. After reading the above description of a job's exit code, one can imagine a scenario where a central task of a batch job fails but the script returns an exit code of zero, indicating success. In many cases, a user may not be able to ascertain the success or failure of a job until after they have examined the job's output files. The job includes a "derived exit code" field. It is initially set to the value of the highest exit code returned by all of the job's steps (srun invocations). The job's derived exit code is determined by the Slurm control daemon and sent to the database when the accounting_storage plugin is enabled. > 2. How do we get the slurmstepd id number? Where do we find that id? When you do sacct -j <jobid> the output lists a disaggregated list containing the job, the batch step, and the successive steps run by the job. Another way of knowing the stepid is adding the 'Steps' value to DebugFlags, then you can see in the slurmctld.log messages like this: slurmctld: sched: _slurm_rpc_job_step_create: StepId=2006.0 compute1 usec=423 > 3. You say this is a known issue. Has it been fixed in Slurm 16.05.6? More than a known issue, this is something we had to discuss. Today we talked about this and we've realized this is working by design. If we make the batch script exit code to be the highest of the step exit codes, then we could not discern wether the batch script failed because it failed or because any of the steps failed. So if the job script exited correctly the exit code is 0 and the state COMPLETED even if any of the steps failed. The good thing is that you can see all clearly disaggregated with sacct as I explained above. > Please let us know as soon as possible so we can inform our users because we > need to know if the data from the jobs is okay or not. > > regards, > Hjalti Sveinsson Please, let me know if this makes sense to you. Thanks.
*** Bug 2489 has been marked as a duplicate of this bug. ***
Thank you for the quick response and the answers. When I do run the command that you sent on the jobs that had the „slurmstepd: error: Exceeded step memory limit at some point.“ in the .err files I see ExitCode 0:0 and DerivedExitCode 0:0. That should mean that the jobs did run well even though we did get the error message. Can I safely say then that if we do get ExitCode 0:0 and DerivedExitCode 0:0 on jobs that the data from the jobs is okay and we can ignore the message in the .err file? Is the message maybe only saying that it had to swap out part of the job but it still finished since we have the exit codes at 0:0? [root@ru-lhpc-head ~]# sacct -j 61474580,61474581,61474582,61474583 -o jobid,jobname,state,exitcode,derivedexitcode JobID JobName State ExitCode DerivedExitCode ------------ ---------- ---------- -------- --------------- 61474580 bphen COMPLETED 0:0 0:0 61474580.ba+ batch COMPLETED 0:0 61474581 bphen COMPLETED 0:0 0:0 61474581.ba+ batch COMPLETED 0:0 61474582 bphen COMPLETED 0:0 0:0 61474582.ba+ batch COMPLETED 0:0 61474583 bphen COMPLETED 0:0 0:0 61474583.ba+ batch COMPLETED 0:0 regards, Hjalti Sveinsson
(In reply to Hjalti Sveinsson from comment #7) > When I do run the command that you sent on the jobs that had the > „slurmstepd: error: Exceeded step memory limit at some point.“ in the .err > files I see ExitCode 0:0 and DerivedExitCode 0:0. That should mean that the > jobs did run well even though we did get the error message. Exactly. > Can I safely say then that if we do get ExitCode 0:0 and DerivedExitCode 0:0 > on jobs that the data from the jobs is okay and we can ignore the message in > the .err file? Is the message maybe only saying that it had to swap out part > of the job but it still finished since we have the exit codes at 0:0? That's it. The "Excceded step memory limit at some point" message indicates that the step exceeded the memory limit imposed by either MaxMemPer{CPU,Node} parameter or the --mem option. But if you don't ConstrainSwapSpace in your cgroup.conf, or you do but the step does not exceeed the AllowedSwapSpace, then the step will be using the swap space and continue running, and thus not being signaled and not being CANCELLED. So yes, I'd say that you can safely say that the data is ok. > [root@ru-lhpc-head ~]# sacct -j 61474580,61474581,61474582,61474583 -o > jobid,jobname,state,exitcode,derivedexitcode > JobID JobName State ExitCode DerivedExitCode > ------------ ---------- ---------- -------- --------------- > 61474580 bphen COMPLETED 0:0 0:0 > 61474580.ba+ batch COMPLETED 0:0 > 61474581 bphen COMPLETED 0:0 0:0 > 61474581.ba+ batch COMPLETED 0:0 > 61474582 bphen COMPLETED 0:0 0:0 > 61474582.ba+ batch COMPLETED 0:0 > 61474583 bphen COMPLETED 0:0 0:0 > 61474583.ba+ batch COMPLETED 0:0 > > regards, > Hjalti Sveinsson
Great! Thank you for these answers!
(In reply to Hjalti Sveinsson from comment #9) > Great! Thank you for these answers! You're welcome, closing the bug. Please, re-open if you encounter further issues.
Hello, I think I am seeing a variant of this bug. I ran an array job using sbatch --wait and one of the array jobs was cancelled due to exceeding memory. I was expecting sbatch to return a non-zero error code since in the man page it says "In the case of a job array, the exit code recorded will be the highest value for any task in the job array." But instead a zero is returned since the derived error code appears to be 0 even though the one of the steps is non-zero. This makes sbatch --wait's return code unreliable for scripting since there is no way to determine the jobs ran or if they were cancelled partially. #slurm_164842_932_jobs.err: slurmstepd: error: Job 165775 exceeded memory limit (67670480 > 67108864), being killed slurmstepd: error: Exceeded job memory limit slurmstepd: error: *** JOB 165775 ON server1 CANCELLED AT 2017-03-13T06:16:41 *** slurmstepd: error: Exceeded step memory limit at some point. #sacct output: > sacct -j 164842_932 -o jobid%16,AllocTres%21,MaxRSS,MaxVMSize,state%14,exitcode%8,derivedexitcode JobID AllocTRES MaxRSS MaxVMSize State ExitCode DerivedExitCode ---------------- --------------------- ---------- ---------- -------------- -------- --------------- 164842_932 cpu=16,mem=64G,node=1 CANCELLED by 0 0:0 0:0 164842_932.batch cpu=16,mem=64G,node=1 67670480K 73585804K CANCELLED 0:15 #slurm version: > slurmd -V slurm 16.05.8
Please don't reopen bugs from other submitters. After reviewing the info given in this bug and you still feel there is a problem please open a new bug. Thanks!
Hi Felip, Thanks for the patch! I think it works for my testing for now! My unit test works as well as I tried running slurm daemon, and that parses cleanly! I did try regcomp earlier and at that point I had noticed a race condition in slurm daemon, before switching to memcpy but I dont see any race condition now, so maybe this fixes it completely! Appreiciate your help!
sorry i replied to wrong bug.