Ticket 6553

Summary: Changes to MaxArraySize require slurmctld restart
Product: Slurm Reporter: Jeff Frey <frey>
Component: SchedulingAssignee: Jacob Jenson <jacob>
Status: RESOLVED INVALID QA Contact:
Severity: 6 - No support contract    
Priority: ---    
Version: 17.11.8   
Hardware: Linux   
OS: Linux   
Site: -Other- Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Jeff Frey 2019-02-21 08:03:00 MST
MaxArraySize was increased from 1000 to 10000 in slurm.conf; the new config file was distributed to all nodes; an "scontrol reconfigure" returned with no errors or warnings.  Thereafter, "scontrol show config" indeed showed "MaxArraySize=10000" when queried.  However, users submitting array jobs with indices higher than 1000 (e.g. "0-2449%108") were denied with "Invalid job array specification" errors.

The problem is the _valid_array_inx() function.  The first time that function is called it caches in the static global "max_array_size" (declared in slurmctld/job_mgr.c) the value of MaxArraySize AT THAT TIME.  In this case, that was the original value of 1000.  Later in _valid_array_inx() some code checks the slurmctl config timestamp and pulls the new value of MaxArraySize if the config is newer as the static local variable "max_task_cnt".  The function subsequently uses "max_array_size" to allocate and index-check the array specification, effectively ignoring the new value of MaxArraySize.  The error cited by our users stems from


		valid = _parse_array_tok(tok, job_desc->array_bitmap,
					 max_array_size);


for which the third argument continues to be the originally-configured MaxArraySize (1000) despite the successful "scontrol reconfigure" etc.


Other code inside slurmctld seems to be doing the same thing -- caching the initial value of MaxArraySize and ignoring updates.  E.g. _xlate_array_dep() in slurmctld/job_scheduler.c.  This implies that without rewriting all such functions "scontrol reconfigure" should at least be warning that the configuration change will require a restart of slurmctld.
Comment 1 Jeff Frey 2019-02-21 08:03:44 MST
I should also mention that the code has not changed at the head of the source tree in git, so this issue is still present.