3402 – Intermittent logrotate failures with slurm

Ticket 3402 - Intermittent logrotate failures with slurm

Summary: Intermittent logrotate failures with slurm

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmd (show other tickets)
Version:	16.05.2
Hardware:	Linux Linux

Importance:	--- 4 - Minor Issue
Assignee:	Tim Wickberg
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2017-01-13 08:32 MST by George Garrett
Modified:	2017-02-01 15:39 MST (History)
CC List:	0 users

See Also:
Site:	Columbia University
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description George Garrett 2017-01-13 08:32:43 MST

We are seeing intermittent failures from logrotate on the slurm across our cluster.

The message from anacron looks like this:
/etc/cron.daily/logrotate:

slurm_reconfigure error: Operation now in progress
error: error running shared postrotate script for '/var/log/slurmd /var/log/slurmctld '

It appears to be caused when the logrotate script runs the following command:
  /cm/shared/apps/slurm/current/bin/scontrol reconfig > /dev/null

Is there something we can do to avoid this? We are running the following rpm:
slurm-client-16.05.2-406_cm7.3.x86_64
which came from Bright Cluster Manager.

Comment 1 Tim Wickberg 2017-01-13 09:01:22 MST

(In reply to George Garrett from comment #0)
> We are seeing intermittent failures from logrotate on the slurm across our
> cluster.
> 
> The message from anacron looks like this:
> /etc/cron.daily/logrotate:
> 
> slurm_reconfigure error: Operation now in progress
> error: error running shared postrotate script for '/var/log/slurmd
> /var/log/slurmctld '
> 
> It appears to be caused when the logrotate script runs the following command:
>   /cm/shared/apps/slurm/current/bin/scontrol reconfig > /dev/null
> 
> Is there something we can do to avoid this? We are running the following rpm:
> slurm-client-16.05.2-406_cm7.3.x86_64
> which came from Bright Cluster Manager.

Can you attach the full logrotate script?

I wouldn't suggest running 'scontrol reconfig' during logrotate - this reconfigures the _entire cluster_, and having that run almost simultaneously on all the nodes is what presumably caused that message.

All that the postrotate command should need to do is send a SIGHUP to the appropriate daemon (slurmctld / slurmd / slurmdbd) for the log file being rotated.

If that's the configuration that Bright ships with their installation I'd suggest filing a bug report with them.

Comment 2 Tim Wickberg 2017-01-19 19:35:32 MST

George -

Anything else I can help address on this? I'm hoping that based on my last response you've been able to change this in your environment; the messages you're seeing are what I'd expect if you're reconfiguring the entire cluster from every node simultaneously.

If there's nothing else, I'll go ahead and mark this as resolved.

- Tim

Comment 3 George Garrett 2017-02-01 10:48:53 MST

Tim,

Below is the logrotate script which you requested. We have contacted Bright regarding this configuration and will see what their thoughts are.

$ cat /etc/logrotate.d/slurm
/var/log/slurmd /var/log/slurmctld {
 compress
 missingok
 nocopytruncate
 delaycompress
 nomail
 notifempty
 noolddir
 dateext
 rotate 5
 sharedscripts
 size=10M
 create 640 slurm root
 postrotate
 /cm/shared/apps/slurm/current/bin/scontrol reconfig > /dev/null
 endscript
}

Comment 4 Tim Wickberg 2017-02-01 15:39:28 MST

(In reply to George Garrett from comment #3)
> Tim,
> 
> Below is the logrotate script which you requested. We have contacted Bright
> regarding this configuration and will see what their thoughts are.

Thank you. That is certainly not how we'd recommend rotating the logs. As previously mentioned, 'scontrol reconfigure' causes the entire cluster to reconfigure. Having this in the logrotate script on each individual node is causing the each node on the cluster to reconfigure itself once per node, all roughly around the same time. While this shouldn't cause problems, it's certainyl not recommended.

The postrotate script should either send SIGHUP directly to the daemon (the PID files can be used for this), or use the init scripts or systemd service files to accomplish that.

Feel free to reference this bug in your discussion with Bright if you haven't already.

I will go ahead and mark this as resolved/infogiven; feel free to reopen this if there are any further issues.

- Tim