Bug 3560 - Apparently Not Possible to Expand Partition Without Loss Of Running Jobs
Summary: Apparently Not Possible to Expand Partition Without Loss Of Running Jobs
Status: RESOLVED TIMEDOUT
Alias: None
Product: Slurm
Classification: Unclassified
Component: Configuration (show other bugs)
Version: 16.05.0
Hardware: Linux Linux
: --- 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2017-03-09 05:07 MST by Sam Agnew
Modified: 2017-11-09 13:54 MST (History)
0 users

See Also:
Site: -Other-
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Sam Agnew 2017-03-09 05:07:07 MST
I have a job running on a partition with n nodes. The job is urgent so I add o new nodes to the slurm cluster so that I will now have n+o nodes. The existing nodes are not interfereced with. 

Slurm requires a "scontrol reconfigure" to see the new nodes. When slurmctld restarts I see:

slurmctld: Purged files for defunct batch job 15

One line for each running job. This defeats the purpose of being able to add nodes because I now have to restart the jobs that were running. 

We are a genomics research department and often have very long running jobs. We also often have equipment repurposed in and out of the Slurm cluster. 

How can I add nodes without losing work currently running and queued? As far as I can see it doesn't look like it is possible.

Sam
Comment 1 Jacob Jenson 2017-11-09 13:54:50 MST
If this is in an issue in 17.02 or 17.11 please let me know and we can discuss support contract options which will enable you to work with the support team to work with you to resolve this issue.

Jacob