Ticket 13165 - srun: job xxx queued and waiting for resources
Summary: srun: job xxx queued and waiting for resources
Status: RESOLVED TIMEDOUT
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: 20.11.7
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Oriol Vilarrubi
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2022-01-11 07:30 MST by Yann
Modified: 2022-02-24 03:07 MST (History)
0 users

See Also:
Site: Université de Genève
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Yann 2022-01-11 07:30:12 MST
Hi,

some users aren't able to run srun or salloc jobs on our cluster from the login node.


For example, this one is pending forever:

[root@login1.yggdrasil ~]# srun --uid=raine  --export=NONE /bin/hostname
srun: job 8049086 queued and waiting for resources


Related to this job on slurm1.yggdrasil slurmctld.log:

[2022-01-11T15:14:54.367] sched: _slurm_rpc_allocate_resources JobId=8049086 NodeList=(null) usec=334
[2022-01-11T15:15:07.485] sched/backfill: _start_job: Started JobId=8049086 in debug-cpu on cpu002
[2022-01-11T15:15:18.711] Killing interactive JobId=8049086: Communication connection failure
[2022-01-11T15:15:18.711] _job_complete: JobId=8049086 WEXITSTATUS 1
[2022-01-11T15:15:18.711] _job_complete: JobId=8049086 done


If I do that from another user, the result is immediate and please not that the line "queued and waiting for resources" isn't even there:

[root@login1.yggdrasil ~]# srun --uid=sagon  --export=NONE /bin/hostname
cpu002.yggdrasil


If we do the same from the admin node (user don't have access to this server) this works, but still, with some username it is immediate and with other, the line "queued and waiting for resources" appears. We aren't able to see why.

Thanks for your help.
Comment 1 Jason Booth 2022-01-11 10:08:07 MST
Do you have a firewall between the login node and the controller/compute nodes?

This issue looks similar to others we have seen in the past where SrunPortRange was needed as well as an opening in the firewall for that port range.

https://slurm.schedmd.com/slurm.conf.html#OPT_SrunPortRange
Comment 3 Yann 2022-01-12 07:27:27 MST
Hi thanks for the answer.

problem solved! Thanks for the hing: this wasn't a firewall, but a routing issue.

It was impossible to reach login1 from slurm1, as the "tcp answer" was coming from another server due to an error in the routing table of login1. But it was still possible to reach slurm1 from login1.

The reason why for some users the answer while using srun is immediate and for some others the line "queued and waiting for resources" appears, is still a mystery for us, but at least, not blocking.
Comment 4 Oriol Vilarrubi 2022-01-12 08:54:29 MST
Hi Yann,

I'm glad that the most blocking issue is solved, regarding the other topic (that some users do not execute immediatly) I've thought about it and I came with the following possibilities:

a) The users that get "blocked" have some default qos that limit the number of jobs/resources that they can use.

b) The node is not ready to be executed.

c) There is a jobsubmit plugin or a cli_filter doing things under the hood.

In order to check these possibilities please do the following:

First submit a job as you did in the Comment 0

srun --uid=raine  --export=NONE /bin/hostname

Using another shell get the jobid from this srun (normally is the biggest jobid, because is the newest)

squeue --user=raine

Let's supose that the jobid is 1234 in this example

With the jobid then you can use "scontrol show job <jobid>" to get more information about why the job is queued

scontrol show job 1234

That will also show with which qos the job is submitted, in case that the job is stucked because of the QoS, then you can also check that with sacctmgw show qos <qosname>, this will show the limits imposed by the qos.

And to check the last one (c) you can check your slurm.conf or directly the loaded config with this command:

scontrol show conf | egrep "JobSubmitPlugins|CliFilterPlugins"


I've also lowered the severity of the bug to 3 because this only affects some users and does not crash the system.

Greetings.
Comment 5 Yann 2022-01-27 06:37:07 MST
Hi,

in fact the small (order of seconds) delay is not blocking at all, it is just strange.

You can see, I launched 6 time the job under user "ressegai" and the job is queued, for ~1-2 sec.
I did the same with my username and the job is launched immediately. This is always reproducible.

(yggdrasil)-[root@login1 ~]$ srun --uid=ressegai  --export=NONE /bin/hostname
srun: job 8206899 queued and waiting for resources
srun: job 8206899 has been allocated resources
cpu001.yggdrasil
(yggdrasil)-[root@login1 ~]$ srun --uid=ressegai  --export=NONE /bin/hostname
srun: job 8206900 queued and waiting for resources
srun: job 8206900 has been allocated resources
cpu001.yggdrasil
(yggdrasil)-[root@login1 ~]$ srun --uid=ressegai  --export=NONE /bin/hostname
srun: job 8206901 queued and waiting for resources
srun: job 8206901 has been allocated resources
cpu001.yggdrasil
(yggdrasil)-[root@login1 ~]$ srun --uid=ressegai  --export=NONE /bin/hostname
srun: job 8206902 queued and waiting for resources
srun: job 8206902 has been allocated resources
cpu001.yggdrasil
(yggdrasil)-[root@login1 ~]$ srun --uid=ressegai  --export=NONE /bin/hostname
srun: job 8206903 queued and waiting for resources
srun: job 8206903 has been allocated resources
cpu001.yggdrasil
(yggdrasil)-[root@login1 ~]$ srun --uid=ressegai  --export=NONE /bin/hostname
srun: job 8206904 queued and waiting for resources
srun: job 8206904 has been allocated resources
cpu001.yggdrasil
(yggdrasil)-[root@login1 ~]$ srun --uid=sagon  --export=NONE /bin/hostname
cpu001.yggdrasil
(yggdrasil)-[root@login1 ~]$ srun --uid=sagon  --export=NONE /bin/hostname
cpu001.yggdrasil
(yggdrasil)-[root@login1 ~]$ srun --uid=sagon  --export=NONE /bin/hostname
cpu001.yggdrasil
(yggdrasil)-[root@login1 ~]$ srun --uid=sagon  --export=NONE /bin/hostname
cpu001.yggdrasil
(yggdrasil)-[root@login1 ~]$ srun --uid=sagon  --export=NONE /bin/hostname
cpu001.yggdrasil
(yggdrasil)-[root@login1 ~]$ srun --uid=sagon  --export=NONE /bin/hostname
cpu001.yggdrasil
(yggdrasil)-[root@login1 ~]$ srun --uid=sagon  --export=NONE /bin/hostname
cpu001.yggdrasil
Comment 6 Oriol Vilarrubi 2022-01-28 07:29:57 MST
Hi Yann,

Most probably this is happening because you have some configuration done at jobsubmit plugin or cli filter or the defaultqos that makes that instead of directly running the jobs it sends them to the schedulling queue. But just to be 100% sure of that Could you please attach the following things to this bug:
- slurm.conf
- your jobsubmit file and/or cli filter file if you have any of those
- the output of the following commands:
  - sacctmgr show user ressegai withassoc
  - sacctmgr show user sagon withassoc

For the defaultqos (last field of sacctmgr show user) execute the following commands:
- sacctmgr show qos <qos_name>


Greetings.
Comment 7 Oriol Vilarrubi 2022-02-15 05:58:28 MST
Hi Yann,

Did you find the time to execute the commands stated in my previous comment?

Greetings.
Comment 8 Oriol Vilarrubi 2022-02-24 03:07:56 MST
Hello Yann,

I'm closing this bug as timedout, if you want to reopen it you just need to reply back to the mail.

Greetings.