Bug 9330

Summary: Some scripts using Perl API (e.g. qstat) hang reading config in configless environments
Product: Slurm Reporter: Troy Baer <troy>
Component: ConfigurationAssignee: Marcin Stolarek <cinek>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: Ole.H.Nielsen, tdockendorf
Version: 20.02.2   
Hardware: Linux   
OS: Linux   
Site: Ohio State OSC Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: RHEL
Machine Name: CLE Version:
Version Fixed: 20.02.6 20.11pre1 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: v2

Description Troy Baer 2020-07-02 12:05:52 MDT
In the course of testing the TORQUE compatibility scripts, I've run into what appears to be a bug in the Perl API in configless environments:

troy@pitzer-login04:~$ qstat -f 232
perl: error: s_p_parse_file: unable to status file /etc/slurm/slurm.conf: No such file or directory, retrying in 1sec up to 60sec
perl: error: ClusterName needs to be specified
perl: error: Unable to establish controller machine
Problem loading jobs.

This host doesn't have /etc/slurm/slurm.conf, as it's a configless host and its slurm.conf is in /var/run/slurm/conf.  We can work around this with a symlink of course, but this really seems like a bug to me.
Comment 4 Marcin Stolarek 2020-07-06 03:04:16 MDT
Troy,

I reproduced the issue and prepared a patch that I'm passing to our QA now.

The patch introduces changes in contribs - perlapi and torque/qstat,qalter perl scripts. I can share it with you before the review if you want to give it a try, knowing that it's not yet scheduled for release.

cheers,
Marcin
Comment 5 Troy Baer 2020-07-06 07:23:16 MDT
Thanks.  We are not yet in production with Slurm and in any case this isn't a critical bug, so I'm OK with waiting on QA.
Comment 8 Troy Baer 2020-07-23 12:15:10 MDT
Any updates on this?
Comment 9 Ole.H.Nielsen@fysik.dtu.dk 2020-09-09 04:08:31 MDT
We changed to Configless Slurm 20.02.4 yesterday, and now a number of users are complaining that the qstat command (from the slurm-torque RPM for CentOS 7) has stopped working, as was described by Troy.

We would appreciate it if Marcin's patch could be included in the upcoming 20.02.5 release!

Thanks,
Ole
Comment 10 Marcin Stolarek 2020-09-11 10:49:45 MDT
*** Bug 9804 has been marked as a duplicate of this bug. ***
Comment 11 Marcin Stolarek 2020-09-11 10:51:36 MDT
Ole,

Do you want to apply the patch locally - before QA completion?

cheers,
Marcin
Comment 12 Ole.H.Nielsen@fysik.dtu.dk 2020-09-11 11:41:20 MDT
(In reply to Marcin Stolarek from comment #11)
> Ole,
> 
> Do you want to apply the patch locally - before QA completion?

The easy workaround is to restore the /etc/slurm/ directory and thus avoid the Configless configuration for the time being.

To test the mentioned patch, how does one go about it?  

Thanks,
Ole
Comment 13 Trey Dockendorf 2020-09-11 11:44:42 MDT
OSC would be able to test any patches as we have test systems available we can utilize.
Comment 14 Marcin Stolarek 2020-09-11 13:10:30 MDT
Comment on attachment 14990 [details]
v2

Making the patch public. If you can run it and verify in your environment feedback is always appreciated.
As mentioned before, the patch didn't pass SchedMD QA and is not yet scheduled for release.

cheers,
Marcin
Comment 15 Troy Baer 2020-09-11 13:57:16 MDT
> As mentioned before, the patch didn't pass SchedMD QA and is not yet scheduled for release.

Can you elaborate on why the patch didn't pass QA?
Comment 16 Trey Dockendorf 2020-09-11 13:58:36 MDT
I applied the patch to our test environment and verified that past work arounds for config-less to work with Torque wrapper like qstat are not needed.  The first time I ran qstat there was a slight delay of about 3-4 seconds when only 1 job was in the queues, but subsequent executions seemed fine.
Comment 17 Ole.H.Nielsen@fysik.dtu.dk 2020-09-16 12:26:14 MDT
I tried to make our login nodes configless (bug 9832) and removed the /etc/slurm directory, but immediately I got user complaints that the qstat command is broken as reported above.  For some reason the users must use qstat in their automated scripts.

It would be really great if priority could be given to getting the patch in this bug report included in the next Slurm release.

Thanks a lot,
Ole
Comment 19 Marcin Stolarek 2020-10-20 07:28:31 MDT
Trey,
Ole,

The issue should be fixed in Slurm 20.02.6 by the following commits:
>commit c888ee827d179f9e54c09c4be8f282cf886e2c11
>Author:     Marcin Stolarek <cinek@schedmd.com>
>AuthorDate: Mon Jul 13 11:51:38 2020 +0000 
> 
>    Perl API - call slurm_conf_init(NULL) before any API calls
>     
>    Add slurm_conf_init() in BOOT: section.
>     
>    Bug 9330.
> 
>commit 34163061104cc2dec4d5e4371359d9af2fb38afb
>Author:     Marcin Stolarek <cinek@schedmd.com>
>AuthorDate: Fri Jul 3 14:08:53 2020 +0000
>
>    Perl API - use slurm_conf_init() not slurm_conf_reinit in Slurm::new()
>     
>    slurm_conf_reinit() ends with call to _init_slurm_conf(), which should only
>    be used internaly. External tools should call slurm_conf_init to
>    correctly establish configuration source.
>     
>    Bug 9330.

that were merged into our public repository.

cheers,
Marcin
Comment 20 Ole.H.Nielsen@fysik.dtu.dk 2020-10-20 23:56:20 MDT
Hi Marcin,

Thanks very much for the patch!  I'm looking forward to 20.02.6!

Best regards,
Ole


(In reply to Marcin Stolarek from comment #19)
> Trey,
> Ole,
> 
> The issue should be fixed in Slurm 20.02.6 by the following commits:
> >commit c888ee827d179f9e54c09c4be8f282cf886e2c11
> >Author:     Marcin Stolarek <cinek@schedmd.com>
> >AuthorDate: Mon Jul 13 11:51:38 2020 +0000 
> > 
> >    Perl API - call slurm_conf_init(NULL) before any API calls
> >     
> >    Add slurm_conf_init() in BOOT: section.
> >     
> >    Bug 9330.
> > 
> >commit 34163061104cc2dec4d5e4371359d9af2fb38afb
> >Author:     Marcin Stolarek <cinek@schedmd.com>
> >AuthorDate: Fri Jul 3 14:08:53 2020 +0000
> >
> >    Perl API - use slurm_conf_init() not slurm_conf_reinit in Slurm::new()
> >     
> >    slurm_conf_reinit() ends with call to _init_slurm_conf(), which should only
> >    be used internaly. External tools should call slurm_conf_init to
> >    correctly establish configuration source.
> >     
> >    Bug 9330.
> 
> that were merged into our public repository.
> 
> cheers,
> Marcin