6553 – Changes to MaxArraySize require slurmctld restart

Bug 6553 - Changes to MaxArraySize require slurmctld restart

Summary: Changes to MaxArraySize require slurmctld restart

Status:	RESOLVED INVALID

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other bugs)
Version:	17.11.8
Hardware:	Linux Linux

Importance:	--- 6 - No support contract
Assignee:	Jacob Jenson
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2019-02-21 08:03 MST by Jeff Frey
Modified:	2019-02-21 08:03 MST (History)
CC List:	0 users

See Also:
Site:	-Other-
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jeff Frey 2019-02-21 08:03:00 MST

MaxArraySize was increased from 1000 to 10000 in slurm.conf; the new config file was distributed to all nodes; an "scontrol reconfigure" returned with no errors or warnings.  Thereafter, "scontrol show config" indeed showed "MaxArraySize=10000" when queried.  However, users submitting array jobs with indices higher than 1000 (e.g. "0-2449%108") were denied with "Invalid job array specification" errors.

The problem is the _valid_array_inx() function.  The first time that function is called it caches in the static global "max_array_size" (declared in slurmctld/job_mgr.c) the value of MaxArraySize AT THAT TIME.  In this case, that was the original value of 1000.  Later in _valid_array_inx() some code checks the slurmctl config timestamp and pulls the new value of MaxArraySize if the config is newer as the static local variable "max_task_cnt".  The function subsequently uses "max_array_size" to allocate and index-check the array specification, effectively ignoring the new value of MaxArraySize.  The error cited by our users stems from


		valid = _parse_array_tok(tok, job_desc->array_bitmap,
					 max_array_size);


for which the third argument continues to be the originally-configured MaxArraySize (1000) despite the successful "scontrol reconfigure" etc.


Other code inside slurmctld seems to be doing the same thing -- caching the initial value of MaxArraySize and ignoring updates.  E.g. _xlate_array_dep() in slurmctld/job_scheduler.c.  This implies that without rewriting all such functions "scontrol reconfigure" should at least be warning that the configuration change will require a restart of slurmctld.

Comment 1 Jeff Frey 2019-02-21 08:03:44 MST

I should also mention that the code has not changed at the head of the source tree in git, so this issue is still present.