Bug 9264 - SIGUSR2 signal has no impact on logging to job completion log
Summary: SIGUSR2 signal has no impact on logging to job completion log
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other bugs)
Version: 20.02.3
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Oriol Vilarrubi
QA Contact:
URL:
: 10528 (view as bug list)
Depends on:
Blocks:
 
Reported: 2020-06-22 10:33 MDT by Jim Long
Modified: 2022-03-31 15:57 MDT (History)
4 users (show)

See Also:
Site: NCSA
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 22.05
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jim Long 2020-06-22 10:33:40 MDT
This is similar to bug #4041, which has been open since 2017

slurmctld man page says -

SIGUSR2
              Reread the log level from the configs, and then reopen  the  log
              file.  This should be used when setting up logrotate(8).


In addition, the Slurm Training Manual includes this paragraph -

Storing the accounting information into text files is very simple. Just configure the appropriate plugin (e.g. AccountingStorageType=accounting_storage/filetxt and/or JobCompType=jobcomp/filetxt) and then specify the pathname of the file (e.g. AccountingStorageLoc=/var/log/slurm/accounting and/or JobCompLoc=/var/log/slurm/job_completions). Use the logrotate or similar tool to prevent the log files from getting too large. Send a SIGUSR2 signal to the slurmctld daemon after moving the files, but before compressing them so that new log files will be created.


However, slurmctld continues to send log completion records to the original
file after logrotate runs and the SIGUSR2 signal has been received.

My slurm.conf includes this line -

[root@iforge148 log]# grep -i comp /usr/slurm/etc/slurm.conf |grep -v "^#"

JobCompLoc=/usr/slurm/log/job_completions
JobCompType=jobcomp/filetxt


My slurm logrotate configuration looks like this -

[root@iforgehn2 logrotate.d]# cat /etc/logrotate.d/slurm
##
       # Slurm Logrotate Configuration
       ##
       /usr/slurm/log/job_completions {
            compress
            missingok
            nocopytruncate
            delaycompress
            nomail
            ifempty
            noolddir
            rotate 99
            sharedscripts
            size=0M
            create 640 svcifslm svcifslm
            postrotate
                 pkill -x --signal SIGUSR2 slurmctld
                 pkill -x --signal SIGUSR2 slurmd
                 pkill -x --signal SIGUSR2 slurmdbd
                 exit 0
            endscript
       }

I have a simple job script that looks like this -

[jlong@iforgehn2 slurm]$ cat /ui/ncsa/jlong/slurm/p

#!/bin/bash
date
echo $SLURM_JOB_ID

I run the job via srun -

[jlong@iforgehn2 slurm]$  srun -N 1 -n 1  /ui/ncsa/jlong/slurm/p
Mon Jun 22 11:15:40 CDT 2020
132


And the completed job record shows up in job_completions -

[root@iforgehn2 log]# ls -l /usr/slurm/log/job_completions*
-rw-r--r-- 1 svcifslm svcifslm 397 Jun 22 11:15 /usr/slurm/log/job_completions

[root@iforgehn2 log]# cat /usr/slurm/log/job_completions
JobId=132 UserId=jlong(8788) GroupId=aaa(1010) Name=p JobState=COMPLETED Partition=normal TimeLimit=10080 StartTime=2020-06-22T11:15:40 EndTime=2020-06-22T11:15:40 NodeList=iforge146 NodeCnt=1 ProcCnt=20 WorkDir=/ui/ncsa/jlong/slurm ReservationName= Gres= Account= QOS=normal WcKey= Cluster=unknown SubmitTime=2020-06-22T11:15:40 EligibleTime=2020-06-22T11:15:40 DerivedExitCode=0:0 ExitCode=0:0


Now I run logrotate -

[root@iforgehn2 log]# /usr/sbin/logrotate -s /var/lib/logrotate/logrotate.status /etc/logrotate.conf

And I can see that the SIGUSR2 signal is received by slurmctld

Jun 22 11:19:01 iforgehn2 slurmctld[30639]: Logrotate signal (SIGUSR2) received

And I can see that the job completion log was rotated -

[root@iforgehn2 log]# ls -l /usr/slurm/log/job_completions*
-rw-r----- 1 svcifslm svcifslm   0 Jun 22 11:19 /usr/slurm/log/job_completions
-rw-r--r-- 1 svcifslm svcifslm 397 Jun 22 11:15 /usr/slurm/log/job_completions-20200622

But if I run another job, it continues to get logged to the original file 
(job_completions-20200622 in this case.


[jlong@iforgehn2 slurm]$  srun -N 1 -n 1  /ui/ncsa/jlong/slurm/p
Mon Jun 22 11:23:27 CDT 2020
133

[root@iforgehn2 log]# ls -l /usr/slurm/log/job_completions*
-rw-r----- 1 svcifslm svcifslm   0 Jun 22 11:19 /usr/slurm/log/job_completions
-rw-r--r-- 1 svcifslm svcifslm 794 Jun 22 11:23 /usr/slurm/log/job_completions-20200622

[root@iforgehn2 log]# cat /usr/slurm/log/job_completions-20200622
JobId=132 UserId=jlong(8788) GroupId=aaa(1010) Name=p JobState=COMPLETED Partition=normal TimeLimit=10080 StartTime=2020-06-22T11:15:40 EndTime=2020-06-22T11:15:40 NodeList=iforge146 NodeCnt=1 ProcCnt=20 WorkDir=/ui/ncsa/jlong/slurm ReservationName= Gres= Account= QOS=normal WcKey= Cluster=unknown SubmitTime=2020-06-22T11:15:40 EligibleTime=2020-06-22T11:15:40 DerivedExitCode=0:0 ExitCode=0:0
JobId=133 UserId=jlong(8788) GroupId=aaa(1010) Name=p JobState=COMPLETED Partition=normal TimeLimit=10080 StartTime=2020-06-22T11:23:27 EndTime=2020-06-22T11:23:27 NodeList=iforge146 NodeCnt=1 ProcCnt=20 WorkDir=/ui/ncsa/jlong/slurm ReservationName= Gres= Account= QOS=normal WcKey= Cluster=unknown SubmitTime=2020-06-22T11:23:27 EligibleTime=2020-06-22T11:23:27 DerivedExitCode=0:0 ExitCode=0:0


The only way to get slurmctld to use the new file seems to be a hard reset.





I also notice that the slurm.conf man page recommends using "nodelaycompress"
in the logrotate configuration file, which could result in a time issue
between slurmctld and logrotate.   If slurmctld was actually switching to a
new log file as documented, a job that completed while the compression was
being performed, but before the SIGUSR2 was received could cause some issues.
"delaycompress" would be a safer option.
Comment 3 Jason Booth 2020-06-22 16:04:32 MDT
Thank you for reporting this. I have looked over the code and it seems this has never worked. I have a patch proposal to review with our team. I will let you know the outcome once we look over these code changes.
Comment 9 Chad Vizino 2020-09-11 10:53:03 MDT
Per another site's request to avoid confusion, we'd like to note that the issue in this ticket relates to storing accounting information into text files when that has been configured and slurmctld receives a SIGUSR2 (log rotation signal). There is not a problem with slurm daemons rotating their logs (slurmctld.log, slurmdbd.log and slurmd.log) when they receive a SIGUSR2.
Comment 10 Tim Wickberg 2020-12-31 00:51:20 MST
*** Bug 10528 has been marked as a duplicate of this bug. ***
Comment 28 Oriol Vilarrubi 2022-02-24 09:51:00 MST
Hi Jim,

We've fixed this in the master branch, so this will ship with slurm 22.05.

I've also measured how much time does it take to handle the SIGUSR2 signal sent by logrotate to slurm, in case of the controller, which is the one that potentially has more log files to handle, it took in the worst test I did 130 microseconds. I've also did a test in which I was sending jobs continuously with an script and at the same time doing logrotates, and not a single job information was lost.
I agree with you that delaycompress would be the safest route to go, but that also makes the disk space used by the logs to increase heavily. As said in the documentation this is just a logrotate configuration sample and we encourage to make all the site modifications as needed.

Greetings.
Comment 29 Jason Booth 2022-03-31 15:57:56 MDT
Resolving as fixed.

commit 6355f4cebea36a1285bf59e30abd556521970282
Author: Alejandro Sanchez <alex@schedmd.com>
Date:   Wed Feb 23 17:05:33 2022 +0100

    jobcomp/elasticsearch - protect log_url access with location_mutex.
    
    Bug 9264

commit e2793bc14404ac5642c4e45665fb038b09db4849
Author: Alejandro Sanchez <alex@schedmd.com>
Date:   Wed Feb 23 16:58:48 2022 +0100

    jobcomp/elasticsearch - fix log_url potential memory leak.
    
    If jobcomp_p_set_location() is called more than once, log_url needs to
    be freed before xstrdup()'ing again to it.
    
    Bug 9264

commit a512ffc99a427630f02c65cbb3bcde9069aeb2fa
Author: Oriol Vilarrubi <jvilarru@schedmd.com>
Date:   Fri Feb 18 16:26:27 2022 +0100

    Make slurmctld call the JobCompPlugin set location operation on SIGUSR2.
    
    A relevant consequence is that the filetxt plugin reopens the file.
    
    Bug 9264

commit 9b92cb3f9b94d41723951dc41fee624a929d646e
Author: Oriol Vilarrubi <jvilarru@schedmd.com>
Date:   Tue Nov 16 11:57:22 2021 +0100

    common/slurm_jobcomp - add jobcomp_g_set_location() function.
    
    This will be useful to be able to perform the jobcomp_p_set_location()
    plugin operation independently of jobcomp_g_init().
    
    Bug 9264