Nate suggested to open a separate ticket for this case: https://bugs.schedmd.com/show_bug.cgi?id=8656#c101 "The read values look valid considering memory usage overhead. OverMemoryKill appears to only enforce the step memory limit (instead of the job total) currently, which I believe is worthy of a new ticket."
Tommi Looking into what Slurm should be doing. --Nate
(In reply to Nate Rini from comment #1) > Looking into what Slurm should be doing. Hi, Only reliable solution what comes to my mind is to combine extern step and running job step pss and verify that it's under the limit? Or do you mean case where --mem-per-cpu is set and one extern step consuming memory also?
(In reply to Tommi Tervo from comment #2) > (In reply to Nate Rini from comment #1) > > Looking into what Slurm should be doing. After consulting internally about how overmemorykill works, we decided that this is a documentation issue. (Updated here: https://github.com/SchedMD/slurm/commit/b82d7c29f4fabea702dba3b08e9581e450c4f064) Overmemorykill is not suggested due to its inherent limits and instead we suggest using cgroups and 'ConstrainRAMSpace=yes' which will limit the memory on a per job/step basis. > Only reliable solution what comes to my mind is to combine extern step and > running job step pss and verify that it's under the limit? Each step/task (process tree) in a job forks a new slurmstepd instance that would have to communicate with the lead slurmd instance in order to actually implement a limit for the whole job. None of the required RPCs or functionality current exist to implement this with overmemorykill. Extern steps and MPI jobs actual fork secondary tasks instance which also only enforce limits against the single process tree and slurmstepd instances further complicating matters. > Or do you mean case where --mem-per-cpu is set and one extern step consuming memory also? Memory limits are set per job and can be set for steps/tasks when using cgroups and 'ConstrainRAMSpace=yes' due to the built in hierarchy of cgroups in the Linux kernel. There is currently no plan to implement this for Overmemorykill as we don't suggest sites use it anymore. I'm closing this ticket, please reply to this ticket if you have any questions and we can continue from here. Thanks, --Nate