Bug 6553 - Changes to MaxArraySize require slurmctld restart
Summary: Changes to MaxArraySize require slurmctld restart
Status: RESOLVED INVALID
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other bugs)
Version: 17.11.8
Hardware: Linux Linux
: --- 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-02-21 08:03 MST by Jeff Frey
Modified: 2019-02-21 08:03 MST (History)
0 users

See Also:
Site: -Other-
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jeff Frey 2019-02-21 08:03:00 MST
MaxArraySize was increased from 1000 to 10000 in slurm.conf; the new config file was distributed to all nodes; an "scontrol reconfigure" returned with no errors or warnings.  Thereafter, "scontrol show config" indeed showed "MaxArraySize=10000" when queried.  However, users submitting array jobs with indices higher than 1000 (e.g. "0-2449%108") were denied with "Invalid job array specification" errors.

The problem is the _valid_array_inx() function.  The first time that function is called it caches in the static global "max_array_size" (declared in slurmctld/job_mgr.c) the value of MaxArraySize AT THAT TIME.  In this case, that was the original value of 1000.  Later in _valid_array_inx() some code checks the slurmctl config timestamp and pulls the new value of MaxArraySize if the config is newer as the static local variable "max_task_cnt".  The function subsequently uses "max_array_size" to allocate and index-check the array specification, effectively ignoring the new value of MaxArraySize.  The error cited by our users stems from


		valid = _parse_array_tok(tok, job_desc->array_bitmap,
					 max_array_size);


for which the third argument continues to be the originally-configured MaxArraySize (1000) despite the successful "scontrol reconfigure" etc.


Other code inside slurmctld seems to be doing the same thing -- caching the initial value of MaxArraySize and ignoring updates.  E.g. _xlate_array_dep() in slurmctld/job_scheduler.c.  This implies that without rewriting all such functions "scontrol reconfigure" should at least be warning that the configuration change will require a restart of slurmctld.
Comment 1 Jeff Frey 2019-02-21 08:03:44 MST
I should also mention that the code has not changed at the head of the source tree in git, so this issue is still present.