We are seeing intermittent failures from logrotate on the slurm across our cluster. The message from anacron looks like this: /etc/cron.daily/logrotate: slurm_reconfigure error: Operation now in progress error: error running shared postrotate script for '/var/log/slurmd /var/log/slurmctld ' It appears to be caused when the logrotate script runs the following command: /cm/shared/apps/slurm/current/bin/scontrol reconfig > /dev/null Is there something we can do to avoid this? We are running the following rpm: slurm-client-16.05.2-406_cm7.3.x86_64 which came from Bright Cluster Manager.
(In reply to George Garrett from comment #0) > We are seeing intermittent failures from logrotate on the slurm across our > cluster. > > The message from anacron looks like this: > /etc/cron.daily/logrotate: > > slurm_reconfigure error: Operation now in progress > error: error running shared postrotate script for '/var/log/slurmd > /var/log/slurmctld ' > > It appears to be caused when the logrotate script runs the following command: > /cm/shared/apps/slurm/current/bin/scontrol reconfig > /dev/null > > Is there something we can do to avoid this? We are running the following rpm: > slurm-client-16.05.2-406_cm7.3.x86_64 > which came from Bright Cluster Manager. Can you attach the full logrotate script? I wouldn't suggest running 'scontrol reconfig' during logrotate - this reconfigures the _entire cluster_, and having that run almost simultaneously on all the nodes is what presumably caused that message. All that the postrotate command should need to do is send a SIGHUP to the appropriate daemon (slurmctld / slurmd / slurmdbd) for the log file being rotated. If that's the configuration that Bright ships with their installation I'd suggest filing a bug report with them.
George - Anything else I can help address on this? I'm hoping that based on my last response you've been able to change this in your environment; the messages you're seeing are what I'd expect if you're reconfiguring the entire cluster from every node simultaneously. If there's nothing else, I'll go ahead and mark this as resolved. - Tim
Tim, Below is the logrotate script which you requested. We have contacted Bright regarding this configuration and will see what their thoughts are. $ cat /etc/logrotate.d/slurm /var/log/slurmd /var/log/slurmctld { compress missingok nocopytruncate delaycompress nomail notifempty noolddir dateext rotate 5 sharedscripts size=10M create 640 slurm root postrotate /cm/shared/apps/slurm/current/bin/scontrol reconfig > /dev/null endscript }
(In reply to George Garrett from comment #3) > Tim, > > Below is the logrotate script which you requested. We have contacted Bright > regarding this configuration and will see what their thoughts are. Thank you. That is certainly not how we'd recommend rotating the logs. As previously mentioned, 'scontrol reconfigure' causes the entire cluster to reconfigure. Having this in the logrotate script on each individual node is causing the each node on the cluster to reconfigure itself once per node, all roughly around the same time. While this shouldn't cause problems, it's certainyl not recommended. The postrotate script should either send SIGHUP directly to the daemon (the PID files can be used for this), or use the init scripts or systemd service files to accomplish that. Feel free to reference this bug in your discussion with Bright if you haven't already. I will go ahead and mark this as resolved/infogiven; feel free to reopen this if there are any further issues. - Tim