Bug 3402

Summary: Intermittent logrotate failures with slurm
Product: Slurm Reporter: George Garrett <gsg8>
Component: slurmdAssignee: Tim Wickberg <tim>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 16.05.2   
Hardware: Linux   
OS: Linux   
Site: Columbia University Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description George Garrett 2017-01-13 08:32:43 MST
We are seeing intermittent failures from logrotate on the slurm across our cluster.

The message from anacron looks like this:
/etc/cron.daily/logrotate:

slurm_reconfigure error: Operation now in progress
error: error running shared postrotate script for '/var/log/slurmd /var/log/slurmctld '

It appears to be caused when the logrotate script runs the following command:
  /cm/shared/apps/slurm/current/bin/scontrol reconfig > /dev/null

Is there something we can do to avoid this? We are running the following rpm:
slurm-client-16.05.2-406_cm7.3.x86_64
which came from Bright Cluster Manager.
Comment 1 Tim Wickberg 2017-01-13 09:01:22 MST
(In reply to George Garrett from comment #0)
> We are seeing intermittent failures from logrotate on the slurm across our
> cluster.
> 
> The message from anacron looks like this:
> /etc/cron.daily/logrotate:
> 
> slurm_reconfigure error: Operation now in progress
> error: error running shared postrotate script for '/var/log/slurmd
> /var/log/slurmctld '
> 
> It appears to be caused when the logrotate script runs the following command:
>   /cm/shared/apps/slurm/current/bin/scontrol reconfig > /dev/null
> 
> Is there something we can do to avoid this? We are running the following rpm:
> slurm-client-16.05.2-406_cm7.3.x86_64
> which came from Bright Cluster Manager.

Can you attach the full logrotate script?

I wouldn't suggest running 'scontrol reconfig' during logrotate - this reconfigures the _entire cluster_, and having that run almost simultaneously on all the nodes is what presumably caused that message.

All that the postrotate command should need to do is send a SIGHUP to the appropriate daemon (slurmctld / slurmd / slurmdbd) for the log file being rotated.

If that's the configuration that Bright ships with their installation I'd suggest filing a bug report with them.
Comment 2 Tim Wickberg 2017-01-19 19:35:32 MST
George -

Anything else I can help address on this? I'm hoping that based on my last response you've been able to change this in your environment; the messages you're seeing are what I'd expect if you're reconfiguring the entire cluster from every node simultaneously.

If there's nothing else, I'll go ahead and mark this as resolved.

- Tim
Comment 3 George Garrett 2017-02-01 10:48:53 MST
Tim,

Below is the logrotate script which you requested. We have contacted Bright regarding this configuration and will see what their thoughts are.

$ cat /etc/logrotate.d/slurm
/var/log/slurmd /var/log/slurmctld {
 compress
 missingok
 nocopytruncate
 delaycompress
 nomail
 notifempty
 noolddir
 dateext
 rotate 5
 sharedscripts
 size=10M
 create 640 slurm root
 postrotate
 /cm/shared/apps/slurm/current/bin/scontrol reconfig > /dev/null
 endscript
}
Comment 4 Tim Wickberg 2017-02-01 15:39:28 MST
(In reply to George Garrett from comment #3)
> Tim,
> 
> Below is the logrotate script which you requested. We have contacted Bright
> regarding this configuration and will see what their thoughts are.

Thank you. That is certainly not how we'd recommend rotating the logs. As previously mentioned, 'scontrol reconfigure' causes the entire cluster to reconfigure. Having this in the logrotate script on each individual node is causing the each node on the cluster to reconfigure itself once per node, all roughly around the same time. While this shouldn't cause problems, it's certainyl not recommended.

The postrotate script should either send SIGHUP directly to the daemon (the PID files can be used for this), or use the init scripts or systemd service files to accomplish that.

Feel free to reference this bug in your discussion with Bright if you haven't already.

I will go ahead and mark this as resolved/infogiven; feel free to reopen this if there are any further issues.

- Tim