While working this ticket https://bugs.schedmd.com/show_bug.cgi?id=10613 I found out that the logs going to syslog are not complete but rather only partial (not including slurmstepd info). It should be the case that if a log file is not specified all log traffic should go to syslog. Essentially anything that would have ended up in the log file should be there. Thanks. -Paul Edmon-
Hello Paul, My local testing shows that the stepd logs reach /var/log/syslog, albiet with different labels prepended to the log line. In your case, I believe that Slurm actually is logging everything to /var/log/syslog, including stepd logs, but that journalctl is somehow filtering this. Can you attach your syslog covering the time period when you were running the jobs and getting the output from journalctl, so we can verify if this is correct? Thanks, -Michael
Created attachment 17479 [details] syslog-ng.conf If I am reading this correctly everything should end up in /var/log/messages. This is attached just for some context.
As an experiment I added: ### Logging SlurmdSyslogDebug=verbose SlurmdDebug=verbose SlurmdLogFile=/var/log/slurmd.log for holyitc03. So it should be logging to three locations journalctl, /var/log/messages (per the syslog conf I sent) and /var/log/slurmd.log This is the result for a test job: [root@holyitc03 slurm]# tail -f /var/log/messages Jan 14 14:46:54 holyitc03 slurmd[9039]: Slurmd shutdown completing Jan 14 14:46:55 holyitc03 slurmd[13572]: Can not stat gres.conf file (/etc/slurm/gres.conf), using slurm.conf data Jan 14 14:46:55 holyitc03 slurmd[13572]: topology/tree: init: topology tree plugin loaded Jan 14 14:46:55 holyitc03 slurmd[13572]: route/topology: init: route topology plugin loaded Jan 14 14:46:55 holyitc03 slurmd[13572]: Resource spec: system cgroup memory limit set to 4096 MB Jan 14 14:46:55 holyitc03 slurmd[13572]: task/affinity: init: task affinity plugin loaded with CPU mask 0xfffffffff Jan 14 14:46:55 holyitc03 slurmd[13572]: cred/munge: init: Munge credential signature plugin loaded Jan 14 14:46:55 holyitc03 slurmd[13572]: slurmd version 20.11.2 started Jan 14 14:46:55 holyitc03 slurmd[13572]: slurmd started on Thu, 14 Jan 2021 14:46:55 -0500 Jan 14 14:46:57 holyitc03 slurmd[13572]: CPUs=36 Boards=1 Sockets=2 Cores=18 Threads=1 Memory=257563 TmpDisk=280307 Uptime=225811 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) Jan 14 14:47:13 holyitc03 dbus[2713]: [system] Activating service name='org.freedesktop.problems' (using servicehelper) Jan 14 14:47:13 holyitc03 dbus[2713]: [system] Successfully activated service 'org.freedesktop.problems' Jan 14 14:48:06 holyitc03 slurmd[13572]: launch task StepId=67112303.0 request from UID:0 GID:0 HOST:10.31.160.220 PORT:33050 Jan 14 14:48:06 holyitc03 slurmd[13572]: task/affinity: lllp_distribution: JobId=67112303 implicit auto binding: cores,one_thread, dist 8192 Jan 14 14:48:06 holyitc03 slurmd[13572]: task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic Jan 14 14:48:06 holyitc03 slurmd[13572]: task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [67112303]: mask_cpu,one_thread, 0x000000001 Jan 14 14:48:12 holyitc03 slurm_epilog[13900]: Starting epilog for job 67112303 by root... Jan 14 14:48:12 holyitc03 slurm_epilog[13904]: Epilog for job 67112303 by root complete. [root@holyitc03 ~]# journalctl -f -- Logs begin at Mon 2021-01-11 19:03:31 EST. -- Jan 14 14:46:55 holyitc03.rc.fas.harvard.edu slurmd[13572]: cred/munge: init: Mu nge credential signature plugin loaded Jan 14 14:46:55 holyitc03.rc.fas.harvard.edu slurmd[13572]: slurmd version 20.11 .2 started Jan 14 14:46:55 holyitc03.rc.fas.harvard.edu slurmd[13572]: slurmd started on Th u, 14 Jan 2021 14:46:55 -0500 Jan 14 14:46:57 holyitc03.rc.fas.harvard.edu slurmd[13572]: CPUs=36 Boards=1 Soc kets=2 Cores=18 Threads=1 Memory=257563 TmpDisk=280307 Uptime=225811 CPUSpecList =(null) FeaturesAvail=(null) FeaturesActive=(null) Jan 14 14:47:12 holyitc03.rc.fas.harvard.edu sshd[13669]: Accepted publickey for root from 10.255.12.7 port 61748 ssh2: RSA SHA256:WFY1caOp1a7bZ2YT/gRyBrNwtuv6u vqwg05bUK8QZ7M Jan 14 14:47:12 holyitc03.rc.fas.harvard.edu sshd[13669]: pam_unix(sshd:session) : session opened for user root by (uid=0) Jan 14 14:47:13 holyitc03.rc.fas.harvard.edu dbus[2713]: [system] Activating ser vice name='org.freedesktop.problems' (using servicehelper) Jan 14 14:47:13 holyitc03.rc.fas.harvard.edu dbus[2713]: [system] Successfully a ctivated service 'org.freedesktop.problems' Jan 14 14:47:31 holyitc03.rc.fas.harvard.edu sshd[13748]: Accepted publickey for root from 10.255.12.7 port 61754 ssh2: RSA SHA256:WFY1caOp1a7bZ2YT/gRyBrNwtuv6u vqwg05bUK8QZ7M Jan 14 14:47:31 holyitc03.rc.fas.harvard.edu sshd[13748]: pam_unix(sshd:session) : session opened for user root by (uid=0) Jan 14 14:48:06 holyitc03.rc.fas.harvard.edu slurmd[13572]: launch task StepId=67112303.0 request from UID:0 GID:0 HOST:10.31.160.220 PORT:33050 Jan 14 14:48:06 holyitc03.rc.fas.harvard.edu slurmd[13572]: task/affinity: lllp_distribution: JobId=67112303 implicit auto binding: cores,one_thread, dist 8192 Jan 14 14:48:06 holyitc03.rc.fas.harvard.edu slurmd[13572]: task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic Jan 14 14:48:06 holyitc03.rc.fas.harvard.edu slurmd[13572]: task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [67112303]: mask_cpu,one_thread, 0x000000001 Jan 14 14:48:06 holyitc03.rc.fas.harvard.edu [13847]: [67112303.0]: pam_unix(slurm:session): session opened for user root by (uid=0) Jan 14 14:48:12 holyitc03.rc.fas.harvard.edu [13847]: [67112303.0]: pam_unix(slurm:session): session closed for user root Jan 14 14:48:12 holyitc03.rc.fas.harvard.edu slurm_epilog[13900]: Starting epilog for job 67112303 by root... Jan 14 14:48:12 holyitc03.rc.fas.harvard.edu slurm_epilog[13904]: Epilog for job 67112303 by root complete. [root@holyitc03 ~]# tail -f /var/log/slurmd.log [2021-01-14T14:46:55.007] Can not stat gres.conf file (/etc/slurm/gres.conf), using slurm.conf data [2021-01-14T14:46:55.008] topology/tree: init: topology tree plugin loaded [2021-01-14T14:46:55.010] route/topology: init: route topology plugin loaded [2021-01-14T14:46:55.014] Resource spec: system cgroup memory limit set to 4096 MB [2021-01-14T14:46:55.014] task/affinity: init: task affinity plugin loaded with CPU mask 0xfffffffff [2021-01-14T14:46:55.015] cred/munge: init: Munge credential signature plugin loaded [2021-01-14T14:46:55.015] slurmd version 20.11.2 started [2021-01-14T14:46:55.016] slurmd started on Thu, 14 Jan 2021 14:46:55 -0500 [2021-01-14T14:46:57.290] CPUs=36 Boards=1 Sockets=2 Cores=18 Threads=1 Memory=257563 TmpDisk=280307 Uptime=225811 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) [2021-01-14T14:48:06.665] [67112303.extern] task/affinity: init: task affinity plugin loaded with CPU mask 0xfffffffff [2021-01-14T14:48:06.673] [67112303.extern] cred/munge: init: Munge credential signature plugin loaded [2021-01-14T14:48:06.698] [67112303.extern] task/cgroup: _memcg_initialize: /slurm/uid_0/job_67112303: alloc=100MB mem.limit=100MB memsw.limit=200MB [2021-01-14T14:48:06.698] [67112303.extern] task/cgroup: _memcg_initialize: /slurm/uid_0/job_67112303/step_extern: alloc=100MB mem.limit=100MB memsw.limit=200MB [2021-01-14T14:48:06.743] launch task StepId=67112303.0 request from UID:0 GID:0 HOST:10.31.160.220 PORT:33050 [2021-01-14T14:48:06.744] task/affinity: lllp_distribution: JobId=67112303 implicit auto binding: cores,one_thread, dist 8192 [2021-01-14T14:48:06.744] task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic [2021-01-14T14:48:06.745] task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [67112303]: mask_cpu,one_thread, 0x000000001 [2021-01-14T14:48:06.764] [67112303.0] task/affinity: init: task affinity plugin loaded with CPU mask 0xfffffffff [2021-01-14T14:48:06.765] [67112303.0] cred/munge: init: Munge credential signature plugin loaded [2021-01-14T14:48:06.777] [67112303.0] task/cgroup: _memcg_initialize: /slurm/uid_0/job_67112303: alloc=100MB mem.limit=100MB memsw.limit=200MB [2021-01-14T14:48:06.777] [67112303.0] task/cgroup: _memcg_initialize: /slurm/uid_0/job_67112303/step_0: alloc=100MB mem.limit=100MB memsw.limit=200MB [2021-01-14T14:48:06.820] [67112303.0] in _window_manager [2021-01-14T14:48:06.820] [67112303.0] debug level is 'error' [2021-01-14T14:48:06.821] [67112303.0] starting 1 tasks [2021-01-14T14:48:06.821] [67112303.0] task 0 (13854) started 2021-01-14T14:48:06 [2021-01-14T14:48:12.522] [67112303.0] task 0 (13854) exited with exit code 0. [2021-01-14T14:48:12.555] [67112303.0] done with job [2021-01-14T14:48:12.558] [67112303.extern] Sent signal 18 to StepId=67112303.extern [2021-01-14T14:48:12.560] [67112303.extern] Sent signal 15 to StepId=67112303.extern [2021-01-14T14:48:12.606] [67112303.extern] done with job As you can see only the slurmd log contains the full data. journalctl and /var/log/messages are not complete. -Paul Edmon- On 1/13/2021 6:10 PM, bugs@schedmd.com wrote: > > *Comment # 3 <https://bugs.schedmd.com/show_bug.cgi?id=10625#c3> on > bug 10625 <https://bugs.schedmd.com/show_bug.cgi?id=10625> from > Michael Hinton <mailto:hinton@schedmd.com> * > Hello Paul, > > My local testing shows that the stepd logs reach /var/log/syslog, albiet with > different labels prepended to the log line. In your case, I believe that Slurm > actually is logging everything to /var/log/syslog, including stepd logs, but > that journalctl is somehow filtering this. Can you attach your syslog covering > the time period when you were running the jobs and getting the output from > journalctl, so we can verify if this is correct? > > Thanks, > -Michael > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Paul, (In reply to Paul Edmon from comment #5) > As an experiment I added: > > ### Logging > SlurmdSyslogDebug=verbose > SlurmdDebug=verbose > SlurmdLogFile=/var/log/slurmd.log > > for holyitc03. So it should be logging to three locations journalctl, > /var/log/messages (per the syslog conf I sent) and /var/log/slurmd.log There is a known issue we have decided not to fix where the stepd does not log to syslog while the SlurmdLogFile is specified. So to truly test this, you will need to comment out SlurmdLogFile, reconfigure/restart Slurm, then check /var/log/messages to see if it has similar output as the slurmd.log. Can you provide more information about the Linux distribution you are using? Also, it appears you are using syslog-ng and journalctld. Can you explain how you have set up journalctld to pull Slurm logs from syslog-ng? -Michael
The slurmstepd information hasn't been showing up in our logs at all with SlurmdLogFile disabled as that's the normal state we run in. I wasn't even aware that we were missing anything until the recent bug we hit where I was asked to pull the slurmstepd info and then was confused that I didn't see it. We are running syslog-ng-3.5.6-3, with CentOS 7.8. We don't do any specific filtering for Slurm. The syslog-ng conf is as I sent. Unless syslog-ng does something by default different with the text string that is produced by slurmstepd (which does not conform with what slurmd or slurmctld does) it should end up in journalctl and thus in /var/log/messages. -Paul Edmon- On 1/15/2021 5:34 PM, bugs@schedmd.com wrote: > > *Comment # 6 <https://bugs.schedmd.com/show_bug.cgi?id=10625#c6> on > bug 10625 <https://bugs.schedmd.com/show_bug.cgi?id=10625> from > Michael Hinton <mailto:hinton@schedmd.com> * > Paul, > > (In reply to Paul Edmon fromcomment #5 <show_bug.cgi?id=10625#c5>) > > As an experiment I added: > > ### Logging > SlurmdSyslogDebug=verbose > SlurmdDebug=verbose > > SlurmdLogFile=/var/log/slurmd.log > > for holyitc03. So it should be > logging to three locations journalctl, > /var/log/messages (per the > syslog conf I sent) and /var/log/slurmd.log > There is a known issue we have decided not to fix where the stepd does not log > to syslog while the SlurmdLogFile is specified. So to truly test this, you will > need to comment out SlurmdLogFile, reconfigure/restart Slurm, then check > /var/log/messages to see if it has similar output as the slurmd.log. > > Can you provide more information about the Linux distribution you are using? > Also, it appears you are using syslog-ng and journalctld. Can you explain how > you have set up journalctld to pull Slurm logs from syslog-ng? > > -Michael > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Can you attach your slurmd.service systemd service file? I'm wondering if they way you are running slurmd is complicating the issue (foreground vs. daemonized).
[root@holy7c22501 system]# cat slurmd.service [Unit] Description=Slurm node daemon After=munge.service network.target remote-fs.target #ConditionPathExists=/etc/slurm/slurm.conf [Service] Type=simple EnvironmentFile=-/etc/sysconfig/slurmd ExecStart=/usr/sbin/slurmd -D $SLURMD_OPTIONS ExecReload=/bin/kill -HUP $MAINPID KillMode=process LimitNOFILE=131072 LimitMEMLOCK=infinity LimitSTACK=infinity Delegate=yes [Install] WantedBy=multi-user.target On 1/19/2021 12:20 PM, bugs@schedmd.com wrote: > > *Comment # 8 <https://bugs.schedmd.com/show_bug.cgi?id=10625#c8> on > bug 10625 <https://bugs.schedmd.com/show_bug.cgi?id=10625> from > Michael Hinton <mailto:hinton@schedmd.com> * > Can you attach your slurmd.service systemd service file? I'm wondering if they > way you are running slurmd is complicating the issue (foreground vs. > daemonized). > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Paul, I am finally able to reproduce this. It looks like a recent change in the slurmd.service file template in 20.11 from forking to simple and running slurmd from background to foreground (-D) is causing the stepd logs to be omitted from the syslog. Here's the commit where this changed: https://github.com/SchedMD/slurm/commit/9249173482a8fe23d52b4f2fbf41482941fead77 We'll look into this. For now, though, my advice is to change your slurmd.service file to what it would have been in 20.02 (i.e. before the commit above), as in my testing I was able to see stepd logs with that configuration. So just change `Type` back to `forking`, remove `-D` from the ExecStart command, and add `PIDFile=/var/run/slurmd.pid` back in.
Thanks for figuring this out. I will look into changing the slurmd.service on our end and look forward to this getting patched. -Paul Edmon- On 1/19/2021 1:44 PM, bugs@schedmd.com wrote: > > *Comment # 10 <https://bugs.schedmd.com/show_bug.cgi?id=10625#c10> on > bug 10625 <https://bugs.schedmd.com/show_bug.cgi?id=10625> from > Michael Hinton <mailto:hinton@schedmd.com> * > Paul, I am finally able to reproduce this. > > It looks like a recent change in the slurmd.service file template in 20.11 from > forking to simple and running slurmd from background to foreground (-D) is > causing the stepd logs to be omitted from the syslog. > > Here's the commit where this changed: > https://github.com/SchedMD/slurm/commit/9249173482a8fe23d52b4f2fbf41482941fead77 <https://github.com/SchedMD/slurm/commit/9249173482a8fe23d52b4f2fbf41482941fead77> > > We'll look into this. > > For now, though, my advice is to change your slurmd.service file to what it > would have been in 20.02 (i.e. before the commit above), as in my testing I was > able to see stepd logs with that configuration. So just change `Type` back to > `forking`, remove `-D` from the ExecStart command, and add > `PIDFile=/var/run/slurmd.pid` back in. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Paul, (In reply to Paul Edmon from comment #11) > Thanks for figuring this out. I will look into changing the > slurmd.service on our end and look forward to this getting patched. I believe this may have been fixed in 21.08.2 with commit https://github.com/SchedMD/slurm/commit/59b4866fa7.
*** Ticket 12249 has been marked as a duplicate of this ticket. ***
So I'm looking at the docs, and this is what they say: SlurmdSyslogDebug The slurmd daemon will log events to the syslog file at the specified level of detail. If not set, the slurmd daemon will log to syslog at level fatal, unless there is no SlurmdLogFile and it is running in the background, in which case it will log to syslog at the level specified by SlurmdDebug (at fatal in the case that SlurmdDebug is set to quiet) or it is run in the foreground, when it will be set to quiet. However, this is what the stepd actually does: If slurmd/stepd is in foreground (-D), set stepd logging to syslog as quiet. If slurmd/stepd is in daemonized , set stepd logging to syslog as quiet, but only if there is a regular logfile. Also set stepd logging to stderr as quiet. Here's the stepd syslog logging code in question, as of 21.08.2: https://github.com/SchedMD/slurm/blob/slurm-21-08-2-1/src/slurmd/slurmstepd/slurmstepd.c#L313-L325 This is in contrast to the slurmd syslog logging code, here: https://github.com/SchedMD/slurm/blob/slurm-21-08-2-1/src/slurmd/slurmd/slurmd.c#L2135-L2147 The behavior of the stepd doesn't match what the slurmd does or what the SlurmdSyslogDebug docs say. Notably, if SlurmdSyslogDebug is set, I would expect the stepd to log to syslog regardless of if there is a regular logfile, as the docs above indicate.
Created attachment 21900 [details] v1 20.11 Paul (and David from bug 12249), Here is a v1 for 20.11 that I should fix the problem. Can you please try this out and see if it works as expected? Make sure to revert your changes to your slurmd.service file so that it uses -D again to truly test if this works. Thanks, -Michael
Actually we are on the cusp of upgrading to 21.08.2 on November 1st. So I'd rather have a patch for that version if you wouldn't mind.
Created attachment 21905 [details] v1 21.08.2 Ok, here is v1 but rebased onto 21.08.2.
Thanks I will try it out and let you know if I see any issues. -Paul Edmon- On 10/22/2021 9:04 PM, bugs@schedmd.com wrote: > > *Comment # 22 <https://bugs.schedmd.com/show_bug.cgi?id=10625#c22> on > bug 10625 <https://bugs.schedmd.com/show_bug.cgi?id=10625> from > Michael Hinton <mailto:hinton@schedmd.com> * > Createdattachment 21905 <attachment.cgi?id=21905&action=diff> [details] > <attachment.cgi?id=21905&action=edit> > v1 21.08.2 > > Ok, here is v1 but rebased onto 21.08.2. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Just tested this patch and I'm now seeing slurmstepd processes in way more detail than I did before. So this is working. Thanks for the patch! -Paul Edmon- On 10/22/2021 9:04 PM, bugs@schedmd.com wrote: > > *Comment # 22 <https://bugs.schedmd.com/show_bug.cgi?id=10625#c22> on > bug 10625 <https://bugs.schedmd.com/show_bug.cgi?id=10625> from > Michael Hinton <mailto:hinton@schedmd.com> * > Createdattachment 21905 <attachment.cgi?id=21905&action=diff> [details] > <attachment.cgi?id=21905&action=edit> > v1 21.08.2 > > Ok, here is v1 but rebased onto 21.08.2. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Hey Paul, This has been fixed in 21.08.5 with the following commits: https://github.com/SchedMD/slurm/compare/0bc4e8dc7030...add21ce516. Thanks! -Michael