Ticket 861 - SLURM JobAcctGatherType=jobacct_gather/linux reflection procedure of sstat command.
Summary: SLURM JobAcctGatherType=jobacct_gather/linux reflection procedure of sstat c...
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 2.6.2
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Danny Auble
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2014-06-04 19:35 MDT by toru matsuoka
Modified: 2014-06-13 04:31 MDT (History)
1 user (show)

See Also:
Site: CRAY
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description toru matsuoka 2014-06-04 19:35:49 MDT
Hello,SLURM Support Team!

This is a continuation of SLURM Bug#.853. 

There were the following check requests from a customer. 

In order to add and reflect JobAcctGatherType=jobacct_gather/linux
on a slurm.conf file, Slurm support are required a slurm demon's restart
and slurmctrld restart.

Is it possible to done the following processings without stopping a 
execution job? 

For example, Is the work which does not affect the job under execution
by done in the following procedures possible? 

■ I will done SLURM setting procedure at following action) 
  ========================================================

1. Add describe JobAcctGatherType=jobacct_gather/linux in Slurm.Conf file.

2. Service Slurmctld Restart done at management node.
   (Execution Job Should Not be Affected at this Time) 

3. Slurmd restart(Service Slurm Restart) for Idle Node.
   (slurmd on the compute node is rebooted one by one for all node)  

◎Is the stop of slurmdbd on mgmt unnecessary at this time? 

◎isn't it necessary to perform the stop of slurmctld and the stop of slurmd simultaneously? 

If the point taking into consideration and other good work procedures occur, please let me know. 

Best Regards..
Toru Matsuoka
Comment 1 toru matsuoka 2014-06-06 11:38:20 MDT
Hello,SLURM Support !

I am waiting for the reply to this inquiry. 

Best Regards..
Toru Matsuoka
Comment 2 Moe Jette 2014-06-08 13:53:42 MDT
The JobAcctGatherType determines the plugin to be used to collect resource usage information about user jobs, which are run by the slurmd daemon (on the compute nodes) and the slurmstepd daemon (which starts the application, one is started for each srun command).

While the JobAcctGatherType configuration parameter is not used by the slurmctld daemon, we strongly recommend the same slurm.conf file be used on every node. If the slurmctld and slurmd daemons are running with different slurm.conf files, the slurmctld will report the error 
error: Node <name> appears to have a different slurm.conf than the slurmctld.  This could cause issues with communication and functionality.  Please review both files and make sure they are the same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.

The slurmdbd does not use the slurm.conf file and will not need to be restarted.

You can change the slurm.conf file and restart the daemons in any order without losing any running or pending jobs, however the sstat program will fail until both the slurmd and slurmstepd are running with the desired JobAcctGatherType.

Slurm version 14.03 works better with JobAcctGatherType=jobacct_gather/none (the sstat command just returns zeros).
Comment 3 toru matsuoka 2014-06-08 21:30:39 MDT
Hello,Slurm Support Team!

>The JobAcctGatherType determines the plugin to be used to collect resource >usage information about user jobs, which are run by the slurmd daemon (on the >compute nodes) and the slurmstepd daemon (which starts the application, one is >started for each srun command).

→Thank you.I understood it.

>While the JobAcctGatherType configuration parameter is not used by the >slurmctld daemon, we strongly recommend the same slurm.conf file be used on >every node. If the slurmctld and slurmd daemons are running with different >slurm.conf files, the slurmctld will report the error 
>error: Node <name> appears to have a different slurm.conf than the slurmctld.  >This could cause issues with communication and functionality.  Please review >both files and make sure they are the same.  If this is expected ignore, and >set DebugFlags=NO_CONF_HASH in your slurm.conf.

→The Customer can same slurm.conf file.Thus,It look likes a no problem state.  

The slurmdbd does not use the slurm.conf file and will not need to be restarted.

→Thank you. I understood it.

You can change the slurm.conf file and restart the daemons in any order without losing any running or pending jobs, however the sstat program will fail until both the slurmd and slurmstepd are running with the desired JobAcctGatherType.

→In Case,Can JobAcctGatherType parameter JobAcctGatherType=jobacct_gather/linux?
 And Are both the slurmd and slurmstepd always running?

>Slurm version 14.03 works better with JobAcctGatherType=jobacct_gather/none (the sstat command just returns zeros).

→ We can not Slurm version up from 2.6.2 to 14.03.
  If Slurm version is 2.6.2, is there any problem concern with  
  JobAcctGatherType parameter? 


Best Regards..
Toru Matsuoka
Comment 4 Moe Jette 2014-06-09 02:31:10 MDT
(In reply to toru matsuoka from comment #3)
> You can change the slurm.conf file and restart the daemons in any order
> without losing any running or pending jobs, however the sstat program will
> fail until both the slurmd and slurmstepd are running with the desired
> JobAcctGatherType.
> 
> →In Case,Can JobAcctGatherType parameter
> JobAcctGatherType=jobacct_gather/linux?

If you want to collect accounting information about jobs on a linux cluster then JobAcctGatherType=jobacct_gather/linux must be set.

>  And Are both the slurmd and slurmstepd always running?

The slurmd should always be running on every compute node.
A slurmstepd is running whenever a job step is running on the compute node. One slurmstepd for each job step

> >Slurm version 14.03 works better with JobAcctGatherType=jobacct_gather/none (the sstat command just returns zeros).
> 
> → We can not Slurm version up from 2.6.2 to 14.03.
>   If Slurm version is 2.6.2, is there any problem concern with  
>   JobAcctGatherType parameter? 

The only problem with Slurm version 2.6 is the sstat errors when the JobAcctGatherType value for sstat is different than what the slurmd is running with.

You should at least consider upgrading from version 2.6.2 to 2.6.8. Version 2.6.2 is known to contain about 100 bugs that were fixed in later releases of version 2.6. There will be no loss of jobs or command changes when upgrading, only bug fixes.

> Best Regards..
> Toru Matsuoka
Comment 5 toru matsuoka 2014-06-09 12:19:23 MDT
Hello,Slurm Support Team!

I understood about this contents.

I want plan version up from Slurm 2.6.2 to 2.6.8.

If enable , Please teach me how to simple Slurm update procedure from 2.6.2 to 2.6.8.   

Best Regards..
Toru Matsuoka
Comment 6 Moe Jette 2014-06-09 15:28:38 MDT
(In reply to toru matsuoka from comment #5)
> Hello,Slurm Support Team!
> 
> I understood about this contents.
> 
> I want plan version up from Slurm 2.6.2 to 2.6.8.
> 
> If enable , Please teach me how to simple Slurm update procedure from 2.6.2
> to 2.6.8.   

Slurm is upgraded the same way as any other Linux package. Just install the new RPMs and restart daeemons. There is some more information here:
http://slurm.schedmd.com/quickstart_admin.html#upgrade
Comment 7 toru matsuoka 2014-06-09 16:55:49 MDT
Hello,Slurm Supoort Team!

Thanks for slurm supoort.

I verified at this URL.

I understood about Slurm version up procedure.

But,
Under the customer's situation, while the upgrade to 
2.6.8 from 2.6.2 is for a while, it is difficult. 

I would like to ask you for support, 
when 2.6.2 is done and sstat does not use.

Best Regards..
Toru Matsuoka
Comment 8 Moe Jette 2014-06-09 17:23:57 MDT
(In reply to toru matsuoka from comment #7)
> Hello,Slurm Supoort Team!
> 
> Thanks for slurm supoort.
> 
> I verified at this URL.
> 
> I understood about Slurm version up procedure.
> 
> But,
> Under the customer's situation, while the upgrade to 
> 2.6.8 from 2.6.2 is for a while, it is difficult. 
> 
> I would like to ask you for support, 
> when 2.6.2 is done and sstat does not use.
> 
> Best Regards..
> Toru Matsuoka

Then just change the configuration to collect accounting information. Set:
JobAcctGatherType=jobacct_gather/linux
Comment 9 Moe Jette 2014-06-13 04:29:26 MDT
Please open a new ticket if you need more information
Comment 10 toru matsuoka 2014-06-13 04:31:34 MDT
It understood. 
Please close this case.