Hi, some users aren't able to run srun or salloc jobs on our cluster from the login node. For example, this one is pending forever: [root@login1.yggdrasil ~]# srun --uid=raine --export=NONE /bin/hostname srun: job 8049086 queued and waiting for resources Related to this job on slurm1.yggdrasil slurmctld.log: [2022-01-11T15:14:54.367] sched: _slurm_rpc_allocate_resources JobId=8049086 NodeList=(null) usec=334 [2022-01-11T15:15:07.485] sched/backfill: _start_job: Started JobId=8049086 in debug-cpu on cpu002 [2022-01-11T15:15:18.711] Killing interactive JobId=8049086: Communication connection failure [2022-01-11T15:15:18.711] _job_complete: JobId=8049086 WEXITSTATUS 1 [2022-01-11T15:15:18.711] _job_complete: JobId=8049086 done If I do that from another user, the result is immediate and please not that the line "queued and waiting for resources" isn't even there: [root@login1.yggdrasil ~]# srun --uid=sagon --export=NONE /bin/hostname cpu002.yggdrasil If we do the same from the admin node (user don't have access to this server) this works, but still, with some username it is immediate and with other, the line "queued and waiting for resources" appears. We aren't able to see why. Thanks for your help.
Do you have a firewall between the login node and the controller/compute nodes? This issue looks similar to others we have seen in the past where SrunPortRange was needed as well as an opening in the firewall for that port range. https://slurm.schedmd.com/slurm.conf.html#OPT_SrunPortRange
Hi thanks for the answer. problem solved! Thanks for the hing: this wasn't a firewall, but a routing issue. It was impossible to reach login1 from slurm1, as the "tcp answer" was coming from another server due to an error in the routing table of login1. But it was still possible to reach slurm1 from login1. The reason why for some users the answer while using srun is immediate and for some others the line "queued and waiting for resources" appears, is still a mystery for us, but at least, not blocking.
Hi Yann, I'm glad that the most blocking issue is solved, regarding the other topic (that some users do not execute immediatly) I've thought about it and I came with the following possibilities: a) The users that get "blocked" have some default qos that limit the number of jobs/resources that they can use. b) The node is not ready to be executed. c) There is a jobsubmit plugin or a cli_filter doing things under the hood. In order to check these possibilities please do the following: First submit a job as you did in the Comment 0 srun --uid=raine --export=NONE /bin/hostname Using another shell get the jobid from this srun (normally is the biggest jobid, because is the newest) squeue --user=raine Let's supose that the jobid is 1234 in this example With the jobid then you can use "scontrol show job <jobid>" to get more information about why the job is queued scontrol show job 1234 That will also show with which qos the job is submitted, in case that the job is stucked because of the QoS, then you can also check that with sacctmgw show qos <qosname>, this will show the limits imposed by the qos. And to check the last one (c) you can check your slurm.conf or directly the loaded config with this command: scontrol show conf | egrep "JobSubmitPlugins|CliFilterPlugins" I've also lowered the severity of the bug to 3 because this only affects some users and does not crash the system. Greetings.
Hi, in fact the small (order of seconds) delay is not blocking at all, it is just strange. You can see, I launched 6 time the job under user "ressegai" and the job is queued, for ~1-2 sec. I did the same with my username and the job is launched immediately. This is always reproducible. (yggdrasil)-[root@login1 ~]$ srun --uid=ressegai --export=NONE /bin/hostname srun: job 8206899 queued and waiting for resources srun: job 8206899 has been allocated resources cpu001.yggdrasil (yggdrasil)-[root@login1 ~]$ srun --uid=ressegai --export=NONE /bin/hostname srun: job 8206900 queued and waiting for resources srun: job 8206900 has been allocated resources cpu001.yggdrasil (yggdrasil)-[root@login1 ~]$ srun --uid=ressegai --export=NONE /bin/hostname srun: job 8206901 queued and waiting for resources srun: job 8206901 has been allocated resources cpu001.yggdrasil (yggdrasil)-[root@login1 ~]$ srun --uid=ressegai --export=NONE /bin/hostname srun: job 8206902 queued and waiting for resources srun: job 8206902 has been allocated resources cpu001.yggdrasil (yggdrasil)-[root@login1 ~]$ srun --uid=ressegai --export=NONE /bin/hostname srun: job 8206903 queued and waiting for resources srun: job 8206903 has been allocated resources cpu001.yggdrasil (yggdrasil)-[root@login1 ~]$ srun --uid=ressegai --export=NONE /bin/hostname srun: job 8206904 queued and waiting for resources srun: job 8206904 has been allocated resources cpu001.yggdrasil (yggdrasil)-[root@login1 ~]$ srun --uid=sagon --export=NONE /bin/hostname cpu001.yggdrasil (yggdrasil)-[root@login1 ~]$ srun --uid=sagon --export=NONE /bin/hostname cpu001.yggdrasil (yggdrasil)-[root@login1 ~]$ srun --uid=sagon --export=NONE /bin/hostname cpu001.yggdrasil (yggdrasil)-[root@login1 ~]$ srun --uid=sagon --export=NONE /bin/hostname cpu001.yggdrasil (yggdrasil)-[root@login1 ~]$ srun --uid=sagon --export=NONE /bin/hostname cpu001.yggdrasil (yggdrasil)-[root@login1 ~]$ srun --uid=sagon --export=NONE /bin/hostname cpu001.yggdrasil (yggdrasil)-[root@login1 ~]$ srun --uid=sagon --export=NONE /bin/hostname cpu001.yggdrasil
Hi Yann, Most probably this is happening because you have some configuration done at jobsubmit plugin or cli filter or the defaultqos that makes that instead of directly running the jobs it sends them to the schedulling queue. But just to be 100% sure of that Could you please attach the following things to this bug: - slurm.conf - your jobsubmit file and/or cli filter file if you have any of those - the output of the following commands: - sacctmgr show user ressegai withassoc - sacctmgr show user sagon withassoc For the defaultqos (last field of sacctmgr show user) execute the following commands: - sacctmgr show qos <qos_name> Greetings.
Hi Yann, Did you find the time to execute the commands stated in my previous comment? Greetings.
Hello Yann, I'm closing this bug as timedout, if you want to reopen it you just need to reply back to the mail. Greetings.