So we found a "feature" in slurm today which is a bit concerning. If you have a job which requests a specific node and it is at the top of the queue it ends up blocking the primary scheduler for scheduling even though jobs after the top job may have space and not use the node specifically requested by the top job. Instead those other jobs need to wait for the backfill loop which is quite a bit slower than the primary loop. Needless to say this is not optimal. It would be best for the primary loop to note that the job in question is requesting a specific node, block it off, and then move on to schedule the other jobs behind it knowing that it can't schedule to the specific node that first job is looking for. That way the primary loop can penetrate deeper in cases when you have specific nodes requested. We've had this block up our scheduler a couple of times now. First we have a series of jobs which were sent out to rebuild specific nodes. Those ended up blocking the primary scheduler from scheduling the jobs lower down the priority tree due to the nodes that were to be reimaged not being available yet. We also had another case where a user requested a specific node that wouldn't come available for 3 days. As you can imagine having the primary scheduling loop only go one job deep for a few days when there are a ton of job slots open on other nodes and jobs that can fill them is suboptimal. Anyways, if we could get a fix to this put in for the next release that would be great. All the main loop needs to do is if it sees a job requesting a specific node to note that and move on, not just fail. If it gets to a job that it can't schedule and it isn't requesting specific cores then it is fine for it to stop. But we can't have it blocking on requests for specific nodes.
Created attachment 636 [details] block job's required nodes rather than entire queue Commit is here: https://github.com/SchedMD/slurm/commit/eafc0a4fd99a2c7d5edfe6df118a96ca038af4c2
closing bug, patch provided 2/20/14