Bug 13385

Summary: nss_slurm plugin unable to resolve user name
Product: Slurm Reporter: Francesco De Martino <fdm>
Component: nss_slurmAssignee: Tim McMullan <mcmullan>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: mcmullan, schedmd-contacts, tim
Version: 21.08.5   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=13271
Site: DS9 Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 21.08.6 22.05pre1 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Francesco De Martino 2022-02-09 02:22:19 MST
Hi,

after enabling the nss_plugin we are experiencing the following behavior:

* srun whoami returns the correct username.
* sbatch --wrap="whoami" returns nobody
* sbatch --wrap="srun whoami" returns the correct username.

Our nsswitch.conf looks like:
```
...
passwd:         slurm files sss
group:            slurm files sss
shadow:         files sss
...
```

Additional details and logs:

Submitting job from a local user (ec2-user), looks good
```shell
[ec2-user@ip-192-168-39-241 ~]$ sbatch --wrap="id"
Submitted batch job 1
[ec2-user@ip-192-168-39-241 ~]$ cat slurm-1.out 
uid=1000(ec2-user) gid=1000(ec2-user) groups=1000(ec2-user),4(adm),10(wheel),190(systemd-journal)
```

Submitting job with sbatch from a domain user (PclusterUser1), get nobody
```shell
[PclusterUser1@ip-192-168-39-241 ~]$ sbatch --wrap="id"
Submitted batch job 2
[PclusterUser1@ip-192-168-39-241 ~]$ cat slurm-2.out 
uid=1896801142(nobody) gid=1896800513(Domain Users) groups=1896800513(Domain Users)
```

Submitting job with sbatch+srun from a domain user (PclusterUser1), looks good
```shell
[PclusterUser1@ip-192-168-39-241 ~]$ sbatch --wrap="srun id"
Submitted batch job 3
[PclusterUser1@ip-192-168-39-241 ~]$ cat slurm-3.out 
uid=1896801142(PclusterUser1) gid=1896800513(Domain Users) groups=1896800513(Domain Users)
```

Submitting job with srun from a domain user (PclusterUser1), looks good
```shell
[PclusterUser1@ip-192-168-39-241 ~]$ srun id
uid=1896801142(PclusterUser1) gid=1896800513(Domain Users) groups=1896800513(Domain Users)
```

slurmd log on the compute node:
```
[2022-02-02T14:12:07.154] error: Node configuration differs from hardware: CPUs=4:4(hw) Boards=1:1(hw) SocketsPerBoard=4:1(hw) CoresPerSocket=1:2(hw) ThreadsPerCore=1:2(hw)
[2022-02-02T14:12:07.160] CPU frequency setting not configured for this node
[2022-02-02T14:12:07.165] slurmd version 21.08.5 started
[2022-02-02T14:12:07.170] slurmd started on Wed, 02 Feb 2022 14:12:07 +0000
[2022-02-02T14:12:07.170] CPUs=4 Boards=1 Sockets=4 Cores=1 Threads=1 Memory=7623 TmpDisk=35827 Uptime=70 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2022-02-02T14:40:34.888] task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 1
[2022-02-02T14:40:34.888] task/affinity: batch_bind: job 1 CPU input mask for node: 0x1
[2022-02-02T14:40:34.888] task/affinity: batch_bind: job 1 CPU final HW mask for node: 0x1
[2022-02-02T14:40:34.888] Launching batch job 1 for UID 1000
[2022-02-02T14:40:34.979] [1.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:0
[2022-02-02T14:40:34.981] [1.batch] done with job
[2022-02-02T14:42:20.026] task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 2
[2022-02-02T14:42:20.026] task/affinity: batch_bind: job 2 CPU input mask for node: 0x1
[2022-02-02T14:42:20.026] task/affinity: batch_bind: job 2 CPU final HW mask for node: 0x1
[2022-02-02T14:42:20.027] Launching batch job 2 for UID 1896801142
[2022-02-02T14:42:20.065] [2.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:0
[2022-02-02T14:42:20.067] [2.batch] done with job
[2022-02-02T14:43:06.094] task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 3
[2022-02-02T14:43:06.094] task/affinity: batch_bind: job 3 CPU input mask for node: 0x1
[2022-02-02T14:43:06.094] task/affinity: batch_bind: job 3 CPU final HW mask for node: 0x1
[2022-02-02T14:43:06.094] Launching batch job 3 for UID 1896801142
[2022-02-02T14:43:06.735] launch task StepId=3.0 request from UID:1896801142 GID:1896800513 HOST:192.168.102.117 PORT:45550
[2022-02-02T14:43:06.736] task/affinity: lllp_distribution: JobId=3 implicit auto binding: sockets,one_thread, dist 1
[2022-02-02T14:43:06.736] task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic 
[2022-02-02T14:43:06.736] task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [3]: mask_cpu,one_thread, 0x1
[2022-02-02T14:43:06.753] [3.0] done with job
[2022-02-02T14:43:06.760] [3.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:0
[2022-02-02T14:43:06.762] [3.batch] done with job
[2022-02-02T14:55:00.947] launch task StepId=4.0 request from UID:1896801142 GID:1896800513 HOST:192.168.39.241 PORT:45796
[2022-02-02T14:55:00.947] task/affinity: lllp_distribution: JobId=4 implicit auto binding: sockets,one_thread, dist 8192
[2022-02-02T14:55:00.947] task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic 
[2022-02-02T14:55:00.947] task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [4]: mask_cpu,one_thread, 0x1
[2022-02-02T14:55:00.987] [4.0] done with job
```
Comment 4 Tim McMullan 2022-02-09 13:39:23 MST
Hi Francesco,

We've landed a patch that will be available starting in 21.08.6 that fixes this issue. (https://github.com/SchedMD/slurm/commit/d567b0c).

Please let us know if you have any other issues!  I'll resolve this ticket for now, but if you find that the problem persists please let us know!

Thanks!
--Tim