10697 – submit wrong core-gpu binding job, it will be pending with "resources" reason which impact the main scheduler.

Ticket 10697 - submit wrong core-gpu binding job, it will be pending with "resources" reason which impact the main scheduler.

Summary: submit wrong core-gpu binding job, it will be pending with "resources" reason...

Status:	RESOLVED INVALID

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	- Unsupported Older Versions
Hardware:	Linux Linux

Importance:	--- 6 - No support contract
Assignee:	Jacob Jenson
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2021-01-25 22:50 MST by wenxiaoll
Modified:	2021-01-27 19:54 MST (History)
CC List:	1 user (show)

See Also:
Site:	-Other-
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description wenxiaoll 2021-01-25 22:50:10 MST

Hi，

I am a learner of SLURM, now encountered one issue in the slurm19.05 version.
when I submit a job with 16 cores and 1 GPU, the job will be in PD state with reason "Resources", which will impact the main scheduler to deal with lower priority jobs(PD reason is Priority) in the same partition.

my gres.conf likes below:
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia0 Cores=0-7
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia1 Cores=8-15
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia2 Cores=16-23
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia3 Cores=24-31

After checked the problem found the reasons:
1. why this job can be submitted?
In my env, there are several nodes in a power-down state, so these nodes will not send the register message to the controller, the slurmctld will not have the cores-GPU binding info. so this job can be submitted in SLURM and with the reason "ReqNotAvail", which will not impact the main scheduler.
2. why this job was in PD with "Resources"?
because there was a running job with option --exclusive, so the global variant share_node_bitmap was partly cleared according to the running job's nodes. when I run the problem job, in function _pick_best_nodes it will set the nodes_busy to be true because of exclusive job, after executing select_g_job_test()
against the list of nodes that exist in any state, found it ok to submit, then go to the logic1 to below, found it nodes_busy is true, so it will go through logic 2. That is the reason for this question.
        
        //logic 1 in _pick_best_nodes
        else if (!runable_avail && !nodes_busy) {
		error_code = ESLURM_NODE_NOT_AVAIL;
	}
        //logic 2 in _pick_best_nodes
	if (error_code == SLURM_SUCCESS) {
		error_code = ESLURM_NODES_BUSY;
		*select_bitmap = possible_bitmap;
	} else {
		FREE_NULL_BITMAP(possible_bitmap);
	}
	return error_code;

To avoid problems, the following solutions are proposed, please help to give your advice.
(1) I think "nodes_busy" variant is not good to as a checking condition. What about changing the "!nodes_busy" parameter to "bit_super_set(possible_bitmap, share_node_bitmap)" in logic 1. which was checked in local env, this is an available way to resolve this problem.
(2) what about removing the gres.conf to remove the resources binding?
(3) what about using cli_filter to filter out jobs with not-right bindings?


It will be appreciated for receiving your response.