Bug 13385

Summary:	nss_slurm plugin unable to resolve user name
Product:	Slurm	Reporter:	Francesco De Martino <fdm>
Component:	nss_slurm	Assignee:	Tim McMullan <mcmullan>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	mcmullan, schedmd-contacts, tim
Version:	21.08.5
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=13271
Site:	DS9	Alineos Sites:	---
Atos/Eviden Sites:	---	Confidential Site:	---
Coreweave sites:	---	Cray Sites:	---
DS9 clusters:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Linux Distro:	---
Machine Name:		CLE Version:
Version Fixed:	21.08.6 22.05pre1	Target Release:	---
DevPrio:	---	Emory-Cloud Sites:	---

Description Francesco De Martino 2022-02-09 02:22:19 MST

Hi,

after enabling the nss_plugin we are experiencing the following behavior:

* srun whoami returns the correct username.
* sbatch --wrap="whoami" returns nobody
* sbatch --wrap="srun whoami" returns the correct username.

Our nsswitch.conf looks like:
```
...
passwd:         slurm files sss
group:            slurm files sss
shadow:         files sss
...
```

Additional details and logs:

Submitting job from a local user (ec2-user), looks good
```shell
[ec2-user@ip-192-168-39-241 ~]$ sbatch --wrap="id"
Submitted batch job 1
[ec2-user@ip-192-168-39-241 ~]$ cat slurm-1.out 
uid=1000(ec2-user) gid=1000(ec2-user) groups=1000(ec2-user),4(adm),10(wheel),190(systemd-journal)
```

Submitting job with sbatch from a domain user (PclusterUser1), get nobody
```shell
[PclusterUser1@ip-192-168-39-241 ~]$ sbatch --wrap="id"
Submitted batch job 2
[PclusterUser1@ip-192-168-39-241 ~]$ cat slurm-2.out 
uid=1896801142(nobody) gid=1896800513(Domain Users) groups=1896800513(Domain Users)
```

Submitting job with sbatch+srun from a domain user (PclusterUser1), looks good
```shell
[PclusterUser1@ip-192-168-39-241 ~]$ sbatch --wrap="srun id"
Submitted batch job 3
[PclusterUser1@ip-192-168-39-241 ~]$ cat slurm-3.out 
uid=1896801142(PclusterUser1) gid=1896800513(Domain Users) groups=1896800513(Domain Users)
```

Submitting job with srun from a domain user (PclusterUser1), looks good
```shell
[PclusterUser1@ip-192-168-39-241 ~]$ srun id
uid=1896801142(PclusterUser1) gid=1896800513(Domain Users) groups=1896800513(Domain Users)
```

slurmd log on the compute node:
```
[2022-02-02T14:12:07.154] error: Node configuration differs from hardware: CPUs=4:4(hw) Boards=1:1(hw) SocketsPerBoard=4:1(hw) CoresPerSocket=1:2(hw) ThreadsPerCore=1:2(hw)
[2022-02-02T14:12:07.160] CPU frequency setting not configured for this node
[2022-02-02T14:12:07.165] slurmd version 21.08.5 started
[2022-02-02T14:12:07.170] slurmd started on Wed, 02 Feb 2022 14:12:07 +0000
[2022-02-02T14:12:07.170] CPUs=4 Boards=1 Sockets=4 Cores=1 Threads=1 Memory=7623 TmpDisk=35827 Uptime=70 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2022-02-02T14:40:34.888] task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 1
[2022-02-02T14:40:34.888] task/affinity: batch_bind: job 1 CPU input mask for node: 0x1
[2022-02-02T14:40:34.888] task/affinity: batch_bind: job 1 CPU final HW mask for node: 0x1
[2022-02-02T14:40:34.888] Launching batch job 1 for UID 1000
[2022-02-02T14:40:34.979] [1.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:0
[2022-02-02T14:40:34.981] [1.batch] done with job
[2022-02-02T14:42:20.026] task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 2
[2022-02-02T14:42:20.026] task/affinity: batch_bind: job 2 CPU input mask for node: 0x1
[2022-02-02T14:42:20.026] task/affinity: batch_bind: job 2 CPU final HW mask for node: 0x1
[2022-02-02T14:42:20.027] Launching batch job 2 for UID 1896801142
[2022-02-02T14:42:20.065] [2.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:0
[2022-02-02T14:42:20.067] [2.batch] done with job
[2022-02-02T14:43:06.094] task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 3
[2022-02-02T14:43:06.094] task/affinity: batch_bind: job 3 CPU input mask for node: 0x1
[2022-02-02T14:43:06.094] task/affinity: batch_bind: job 3 CPU final HW mask for node: 0x1
[2022-02-02T14:43:06.094] Launching batch job 3 for UID 1896801142
[2022-02-02T14:43:06.735] launch task StepId=3.0 request from UID:1896801142 GID:1896800513 HOST:192.168.102.117 PORT:45550
[2022-02-02T14:43:06.736] task/affinity: lllp_distribution: JobId=3 implicit auto binding: sockets,one_thread, dist 1
[2022-02-02T14:43:06.736] task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic 
[2022-02-02T14:43:06.736] task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [3]: mask_cpu,one_thread, 0x1
[2022-02-02T14:43:06.753] [3.0] done with job
[2022-02-02T14:43:06.760] [3.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:0
[2022-02-02T14:43:06.762] [3.batch] done with job
[2022-02-02T14:55:00.947] launch task StepId=4.0 request from UID:1896801142 GID:1896800513 HOST:192.168.39.241 PORT:45796
[2022-02-02T14:55:00.947] task/affinity: lllp_distribution: JobId=4 implicit auto binding: sockets,one_thread, dist 8192
[2022-02-02T14:55:00.947] task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic 
[2022-02-02T14:55:00.947] task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [4]: mask_cpu,one_thread, 0x1
[2022-02-02T14:55:00.987] [4.0] done with job
```

Comment 4 Tim McMullan 2022-02-09 13:39:23 MST

Hi Francesco,

We've landed a patch that will be available starting in 21.08.6 that fixes this issue. (https://github.com/SchedMD/slurm/commit/d567b0c).

Please let us know if you have any other issues!  I'll resolve this ticket for now, but if you find that the problem persists please let us know!

Thanks!
--Tim