I have a job running on a partition with n nodes. The job is urgent so I add o new nodes to the slurm cluster so that I will now have n+o nodes. The existing nodes are not interfereced with.
Slurm requires a "scontrol reconfigure" to see the new nodes. When slurmctld restarts I see:
slurmctld: Purged files for defunct batch job 15
One line for each running job. This defeats the purpose of being able to add nodes because I now have to restart the jobs that were running.
We are a genomics research department and often have very long running jobs. We also often have equipment repurposed in and out of the Slurm cluster.
How can I add nodes without losing work currently running and queued? As far as I can see it doesn't look like it is possible.
If this is in an issue in 17.02 or 17.11 please let me know and we can discuss support contract options which will enable you to work with the support team to work with you to resolve this issue.