Hello, I just noticed that SLURM is no longer giving users information when their jobs are killed due to exceeding memory. I'm not sure when this changed (it may have changed in 14.11.5 but I just never noticed). Before I would get a message like this at the end of my stdout/stderr file: /usr/spool/slurm/job32105/slurm_script: line 10: 22492 Killed ./eat_while slurmstepd: error: Exceeded step memory limit at some point. Step may have been partially swapped out to disk. Now all I get is: /usr/spool/slurm/job1950924/slurm_script: line 9: 5830 Killed ./eat_while Was this change intentional? It's very useful to receive information from SLURM about why a job was killed, especially for memory errors since MaxRSS is often not super accurate so running sacct after the fact does not always reveal the cause of job failure very clearly. Best, Will
Hi, the messages were logged at the error level and some users complained so we moved them to the info level. Can you try to set your SlurmdDebug to info and test if you can see the message? David
(In reply to David Bigagli from comment #1) > Hi, > the messages were logged at the error level and some users complained so > we moved them to the info level. Can you try to set your SlurmdDebug to info > and test if you can see the message? > > David Looks like it's already set to info: [frenchwr@vmps11 ~]$ scontrol show config | grep -i debug DebugFlags = (null) SlurmctldDebug = info SlurmdDebug = info
Could you attach your program and the slurm.conf so I can try to reproduce it?
Created attachment 1890 [details] slurm.conf
Created attachment 1891 [details] eat.c
Created attachment 1892 [details] Makefile
Created attachment 1893 [details] SLURM batch script
This is actually an example that someone (maybe you?) at SchedMD shared with me when we initially set up SLURM and were testing cgroups.
I don't think it was me, but I will use it now :-). Thanks for all the data I will get back to you later. David
Hi, the messages logged at error level are propagated to the user output, messages logged at different level are not and that's why you don't see the info messages. The info messages are written to the slurmd log file. The reason we change it was because the error messages is written even if the process was not killed but just tried to exceed the RSS limit as reported by the memory.failcnt file. That's why using info seems more appropriate. Is this a serious problem for you? You could eventually keep a local patch. David
> > Is this a serious problem for you? You could eventually keep a local patch. > > David I wouldn't call it a serious problem but it certainly makes troubleshooting for the user more difficult when he/she receives no information about why his/her job was killed. Like I mentioned before, looking at MaxRSS from sacct does not always help, so it's basically a guessing game for users (and administrators) where they resubmit with an increased memory request and hope that the job does not die this time around. It seems like getting at least some information, assuming it's available, to a user about a job issue would be good practice and promotes user friendliness, even if it's a warning message about the RSS limit being exceeded. I'm not sure I understand why you'd elect to withhold that information from the user if it's available. Will
Because in some cases it is not an error but just an information about the job behavior which users may not care about and seeing the string "error" cause false alarms. It goes both ways unfortunately. David
In fact, the message was removed by user request because they found it misleading.
(In reply to David Bigagli from comment #12) > Because in some cases it is not an error but just an information about the > job > behavior which users may not care about and seeing the string "error" cause > false alarms. It goes both ways unfortunately. > > David Could it be made a warning message instead? Something like: Warning: job step may have been partially swapped out to disk. Or what about propagating the information back to the user as a specific exit code? Why is it that in the past this message was only showed when MaxRSS was exceeded? Why can't SLURM output an error message at the moment after it decides to kill a job due to exceeding MaxRSS+Swap? Seems like this would avoid the false alarm. Will
It could be a warning message or an info message but the problem is that since this message is logged by the slurmstepd then all warning/info messages the daemon write would be propagated to the user's output. Exit code also may not work because the use program is killed by oom which sends it a the kill signal and the exit code is undefined. It is not Slurm that is killing the job. It appears in the context MaxRSS+Swap in your installation. If you disable ConstrainSwapSpace=yes than oom killer wont be invoked and cgroup will constrain the application to the amount of memory requested, however when the application will exit user will still see the message. That was the original complain that prompeted the change. We could perhaps have a configurable flag to log this message or create a new logging function that will send a specific warning to a user, the message will appear as an information one. David
> It appears in the context MaxRSS+Swap in your installation. If you disable > ConstrainSwapSpace=yes than oom killer wont be invoked and cgroup will > constrain the application to the amount of memory requested, however when > the application > will exit user will still see the message. That was the original complain > that prompeted the change. So if we toggle ConstrainSwapSpace to "no" then a user would get an error from SLURM about exceeding memory? Am I understanding correctly?
With the original code yes, it would check the content of memory.failcnt and if the number in it is greater than zero, indicating the number of times the memory limit has reached the value set in memory.limit_in_bytes, it will write the message at error level in the slurmd log file and user output. With the modified code it will write at the info level so it will only go to the slurmd log file. Would it be hard for you to maintain a local patch? David
> We could perhaps have a configurable flag to log this message or create a new > logging function that will send a specific warning to a user, the message > will appear as an information one. This would get my vote. We'd prefer not to maintain a local patch if we can help it. I also imagine that I am not alone in preferring the prior behavior to the current, so I suspect other sites would also find a configurable option to be useful. FWIW, I understand why others would find this error message to be misleading and why you all made the change. However, rather than silencing the message completely, it seems like the better solution is to create an additional logging function since the root of the issue is that the message does not fit nicely into any of the current logging mechanisms in SLURM.
Well vox populi vox dei. We have reworked the fix to print a less scary message and put it back at log error. Hopefully this is a good compromise. $ srun --mem=1 ./eatmem 1024 Jun 4 13:30:53.111396 32372 0x7f012a97f700 slurmstepd-regor01: Exceeded step memory limit at some point. srun: error: regor01: task 0: Killed srun: Force Terminated job step 25748.0 Commit 707268a521ea9be. David
(In reply to David Bigagli from comment #19) > Well vox populi vox dei. We have reworked the fix to print a less scary > message > and put it back at log error. Hopefully this is a good compromise. > > $ srun --mem=1 ./eatmem 1024 > Jun 4 13:30:53.111396 32372 0x7f012a97f700 slurmstepd-regor01: Exceeded > step memory limit at some point. > srun: error: regor01: task 0: Killed > srun: Force Terminated job step 25748.0 > > Commit 707268a521ea9be. > > David Thank you, David. I look forward to seeing this in 14.11.8. Will
+1 to folks not happy with this change. This has caused numerous extra tickets for us where in the past the user could have known the reason for the job not succeeding. Was this change announced? I don't believe I saw it if so. Thanks for fixing this in 14.11.8. Chris