Hello After upgrade of slurm from 16.05-10 to 17.11.0-0rc2, jobs launched via sbatch are failing in slurmstepd:error pam_open_session "User not known to the underlying authentication module" The following traces on the compute node are similar to: computenode slurmstepd[jobid]: task/cgroup: /slurm/uid_no/job_jobid: alloc=... computenode slurmstepd[jobid]: task/cgroup: /slurm/uid_no/job_jobid/steep_batch: alloc... computenode [jobid]: [jobid.batch][jobid]: pam_succeed_if(slurm:session): unexpected response from failed conversation function computenode [jobid]: [jobid.batch][jobid]: pam_succeed_if(slurm:session): error retrieving user name: Conversation error computenode [jobid]: [jobid.batch][jobid]: pam_mkdir(slurm:session): Failed to load module computenode [jobid]: [jobid.batch][jobid]: pam_mklink(slurm:session): Failed to load module computenode [jobid]: [jobid.batch][jobid]: pam_unix(slurm:session): opensession - error recovering username computenode slurmstepd[jobid]: error: pam_open_session: User not known to the underlying authentication module computenode slurmstepd[jobid]: error: error in pam setup computenode slurmstepd[jobid]: error: job_manager_exiting_abnormally, rc = 4020 computenode slurmstepd[jobid]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4020 status 0 computenode slurmstepd[jobid]: done with job Source investigations showed these modifications : job_scheduler.c: job_ptr->user_name is set only if (slurmctld_config.send_groups_in_cred) controller.c slurmctld_config.send_groups_in_cred is true only if xstrcasestr(slurmctld_conf.launch_params, "send_gids") So if slurmctld_conf.launch_params is not set to send_gids, job_ptr->user_name remains null. As we are using pam authentification, the sending of user_name is mandatory We have used the following war of LaunchParameters=send_gids in our controller slurm.conf It seems that the problem is still in rc3 ******************************************************************************** slurm-17.11.0-0rc2/src/slurmctld/controller.c if (xstrcasestr(slurmctld_conf.launch_params, "send_gids")) slurmctld_config.send_groups_in_cred = true; else slurmctld_config.send_groups_in_cred = false; slurm-17.11.0-0rc2/src/slurmctld/job_scheduler.c /* Given a scheduled job, return a pointer to it batch_job_launch_msg_t data */ static batch_job_launch_msg_t *_build_launch_job_msg(struct job_record *job_ptr, uint16_t protocol_version) { batch_job_launch_msg_t *launch_msg_ptr; /* Initialization of data structures */ launch_msg_ptr = (batch_job_launch_msg_t *) xmalloc(sizeof(batch_job_launch_msg_t)); launch_msg_ptr->job_id = job_ptr->job_id; launch_msg_ptr->step_id = NO_VAL; launch_msg_ptr->array_job_id = job_ptr->array_job_id; launch_msg_ptr->array_task_id = job_ptr->array_task_id; launch_msg_ptr->uid = job_ptr->user_id; launch_msg_ptr->gid = job_ptr->group_id; if (slurmctld_config.send_groups_in_cred) { /* fill in the job_record field if not yet filled in */ "memo.txt" 65L, 3384C 34,1 Haut
This was previously incomplete: slurm-17.11.0-0rc2/src/slurmctld/job_scheduler.c /* Given a scheduled job, return a pointer to it batch_job_launch_msg_t data */ static batch_job_launch_msg_t *_build_launch_job_msg(struct job_record *job_ptr, uint16_t protocol_version) { batch_job_launch_msg_t *launch_msg_ptr; /* Initialization of data structures */ launch_msg_ptr = (batch_job_launch_msg_t *) xmalloc(sizeof(batch_job_launch_msg_t)); launch_msg_ptr->job_id = job_ptr->job_id; launch_msg_ptr->step_id = NO_VAL; launch_msg_ptr->array_job_id = job_ptr->array_job_id; launch_msg_ptr->array_task_id = job_ptr->array_task_id; launch_msg_ptr->uid = job_ptr->user_id; launch_msg_ptr->gid = job_ptr->group_id; if (slurmctld_config.send_groups_in_cred) { /* fill in the job_record field if not yet filled in */ if (!job_ptr->user_name) job_ptr->user_name = uid_to_string_or_null(job_ptr->user_id);
Not sending the username as part of the launch message is an intentional choice; without send_gids set, the slurmd process is supposed to fill in that field instead. It appears that, for the batch step, I've omitted the code to handle that though, and will need to fix that.
We hit the same bug with some other error message with 17.11.0 if UsePAM=1. Importance should be raised. Thank you. slurmd.log: [2017-12-05T09:59:05.074] _run_prolog: run job script took usec=5 [2017-12-05T09:59:05.074] _run_prolog: prolog with lock for job 68570 ran for 0 seconds [2017-12-05T09:59:05.142] [68570.extern] task/cgroup: /slurm/uid_20005/job_68570: alloc=512MB mem.limit=512MB memsw.limit=512MB [2017-12-05T09:59:05.142] [68570.extern] task/cgroup: /slurm/uid_20005/job_68570/step_extern: alloc=512MB mem.limit=512MB memsw.limit=512MB [2017-12-05T09:59:05.280] Launching batch job 68570 for UID 20005 [2017-12-05T09:59:05.296] [68570.batch] task/cgroup: /slurm/uid_20005/job_68570: alloc=512MB mem.limit=512MB memsw.limit=512MB [2017-12-05T09:59:05.297] [68570.batch] task/cgroup: /slurm/uid_20005/job_68570/step_batch: alloc=512MB mem.limit=512MB memsw.limit=512MB [2017-12-05T09:59:05.301] [68570.batch] error: pam_open_session: Cannot make/remove an entry for the specified session [2017-12-05T09:59:05.301] [68570.batch] error: error in pam_setup [2017-12-05T09:59:05.322] [68570.batch] error: job_manager exiting abnormally, rc = 4020 [2017-12-05T09:59:05.322] [68570.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4020 status 0 [2017-12-05T09:59:05.324] [68570.batch] done with job [2017-12-05T09:59:05.376] [68570.extern] done with job slurmctld.log: [2017-12-05T09:59:04.657] _slurm_rpc_submit_batch_job: JobId=68570 InitPrio=29879 usec=1944 [2017-12-05T09:59:05.072] backfill: Started JobID=68570 in short on cstd01-002 [2017-12-05T09:59:05.323] error: slurmd error running JobId=68570 on node(s)=cstd01-002: Slurmd could not execve job [2017-12-05T09:59:05.323] drain_nodes: node cstd01-002 state set to DRAIN [2017-12-05T09:59:05.323] _job_complete: JobID=68570 State=0x1 NodeCnt=1 WEXITSTATUS 0 [2017-12-05T09:59:05.323] _job_complete: JobID=68570 State=0x8003 NodeCnt=1 done
Hi René - Higher severity levels are only available to SchedMD customers. I am resetting this back to Sev3. If you have any questions about this please feel free to get in touch directly. - Tim
*** Bug 4510 has been marked as a duplicate of this bug. ***
This is fixed in commit 45be4cc0d10, and will be in 17.11.1 which we anticipate releasing later today. LaunchParameters=send_gids is still strongly suggested; if that option is used then this fix is not needed in most cases. - Tim