Ticket 1662 - Configuration problems
Summary: Configuration problems
Status: RESOLVED CANNOTREPRODUCE
Alias: None
Product: Slurm
Classification: Unclassified
Component: Configuration (show other tickets)
Version: 14.11.6
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Brian Christiansen
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2015-05-11 05:34 MDT by Moe Jette
Modified: 2015-08-28 11:44 MDT (History)
3 users (show)

See Also:
Site: University of Auckland
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Moe Jette 2015-05-11 05:34:37 MDT
This ticket is based upon configuration issues found in the investigation of bug 1660:

I see several odd things right away in the log file to investigate:
=================================
May  8 16:39:30 slurm-001-p slurmctld[6595]: error: Registered job 13770315.0 on wrong node compute-c1-003 
error: Registered job 13770315.0 on wrong node compute-c1-003 

Perhaps you could search for "13770315" in an earlier slurmctld log file. I'm wondering where slurm expects to find the job. It would probably be worthwhile to save the slurmd log(s) from the relevant nodes too.
==============================================
The other thing of note is a bunch of messages of this sort for various nodes:
May  8 16:39:30 slurm-001-p slurmctld[6595]: error: Node compute-a1-017 appears to have a different slurm.conf than the slurmctld.  This could cause issues with communication and functionality.  Please review both files and make sure they are the same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.

Are you expecting different slurm.conf files on the different nodes? That is almost always a bad idea.
========================================================
Then it appears there is some problem in your topology.conf file:

May  8 16:45:44 slurm-001-p slurmctld[9983]: TOPOLOGY: warning -- no switch can reach all nodes through its descendants.Do not use route/topology
TOPOLOGY: warning -- no switch can reach all nodes through its descendants.Do not use route/topology

If it's not clear to you what the issue is, you can send me your topology.conf and slurm.conf files to advise.
==================================
This might indicate a really old version of slurm on node compute-c1-05:
error: _slurm_rpc_node_registration node=compute-c1-058: Invalid argument
Comment 1 Moe Jette 2015-08-28 05:49:38 MDT
Additional information is required to pursue, which we don't have. I suspect the customer made the specified configuration changes and that fixed the problems. Please reopen this ticket with requested information if desired.
Comment 2 Gene Soudlenkov 2015-08-28 11:44:47 MDT
Hi, Moe

It seems to be OK now - we have no definite confirmation until sometime later but for now I would keep the status as resolved - thanks for your work, guys!

Cheers,
Gene Soudlenkov