Bug 4412

Summary: sbatch jobs fail in slurmstepd:error pam_open_session "User not known to the underlying authentication module"
Product: Slurm Reporter: Regine Gaudin <regine.gaudin>
Component: slurmctldAssignee: Tim Wickberg <tim>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: kaizaad, kilian, matthieu.hautreux, rene.oertel
Version: 17.11.x   
Hardware: Linux   
OS: Linux   
Site: CEA Alineos Sites: ---
Bull/Atos Sites: --- Confidential Site: ---
Cray Sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
SFW Sites: --- SNIC sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 17.11.1
Target Release: --- DevPrio: ---

Description Regine Gaudin 2017-11-22 03:58:09 MST
Hello

After upgrade of slurm from 16.05-10 to 17.11.0-0rc2, jobs launched via 
sbatch are failing in  slurmstepd:error
pam_open_session "User not known to the underlying authentication module"

The following traces  on the compute node are similar to:
computenode  slurmstepd[jobid]: task/cgroup: 
/slurm/uid_no/job_jobid: alloc=...

computenode  slurmstepd[jobid]: task/cgroup: 
/slurm/uid_no/job_jobid/steep_batch: alloc...

computenode  [jobid]: [jobid.batch][jobid]: 
pam_succeed_if(slurm:session):
 unexpected response from failed conversation function

computenode  [jobid]: [jobid.batch][jobid]: 
pam_succeed_if(slurm:session): error retrieving user name: Conversation error

computenode  [jobid]: [jobid.batch][jobid]: 
pam_mkdir(slurm:session): Failed to load module

computenode  [jobid]: [jobid.batch][jobid]:
 pam_mklink(slurm:session): Failed to load module

computenode  [jobid]: [jobid.batch][jobid]: 
pam_unix(slurm:session): opensession - error recovering username

computenode  slurmstepd[jobid]: 
error: pam_open_session: User not known to the underlying authentication module

computenode  slurmstepd[jobid]: error: error in pam setup

computenode  slurmstepd[jobid]: 
error: job_manager_exiting_abnormally, rc = 4020

computenode  slurmstepd[jobid]: 
sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4020 status 0

computenode  slurmstepd[jobid]: done with job

Source investigations showed these modifications :

job_scheduler.c:
job_ptr->user_name is set only if (slurmctld_config.send_groups_in_cred)

controller.c
slurmctld_config.send_groups_in_cred is true only if xstrcasestr(slurmctld_conf.launch_params, "send_gids")

So if slurmctld_conf.launch_params is not set to send_gids, 
job_ptr->user_name remains null.
As we are using pam authentification, the sending of user_name is mandatory

We have used the following war of
LaunchParameters=send_gids in our controller slurm.conf

It seems that the problem is still in rc3

********************************************************************************

slurm-17.11.0-0rc2/src/slurmctld/controller.c
if (xstrcasestr(slurmctld_conf.launch_params, "send_gids"))
                slurmctld_config.send_groups_in_cred = true;
        else
                slurmctld_config.send_groups_in_cred = false;

slurm-17.11.0-0rc2/src/slurmctld/job_scheduler.c
/* Given a scheduled job, return a pointer to it batch_job_launch_msg_t data */
static batch_job_launch_msg_t *_build_launch_job_msg(struct job_record *job_ptr,
                                                     uint16_t protocol_version)
{
        batch_job_launch_msg_t *launch_msg_ptr;

        /* Initialization of data structures */
        launch_msg_ptr = (batch_job_launch_msg_t *)
                                xmalloc(sizeof(batch_job_launch_msg_t));
        launch_msg_ptr->job_id = job_ptr->job_id;
        launch_msg_ptr->step_id = NO_VAL;
        launch_msg_ptr->array_job_id = job_ptr->array_job_id;
        launch_msg_ptr->array_task_id = job_ptr->array_task_id;
        launch_msg_ptr->uid = job_ptr->user_id;
        launch_msg_ptr->gid = job_ptr->group_id;

        if (slurmctld_config.send_groups_in_cred) {
                /* fill in the job_record field if not yet filled in */
"memo.txt" 65L, 3384C                                                                                                                                             34,1         Haut
Comment 1 Regine Gaudin 2017-11-23 08:47:18 MST
This was previously incomplete:

slurm-17.11.0-0rc2/src/slurmctld/job_scheduler.c
/* Given a scheduled job, return a pointer to it batch_job_launch_msg_t data */
static batch_job_launch_msg_t *_build_launch_job_msg(struct job_record *job_ptr,
                                                     uint16_t protocol_version)
{
        batch_job_launch_msg_t *launch_msg_ptr;

        /* Initialization of data structures */
        launch_msg_ptr = (batch_job_launch_msg_t *)
                                xmalloc(sizeof(batch_job_launch_msg_t));
        launch_msg_ptr->job_id = job_ptr->job_id;
        launch_msg_ptr->step_id = NO_VAL;
        launch_msg_ptr->array_job_id = job_ptr->array_job_id;
        launch_msg_ptr->array_task_id = job_ptr->array_task_id;
        launch_msg_ptr->uid = job_ptr->user_id;
        launch_msg_ptr->gid = job_ptr->group_id;

        if (slurmctld_config.send_groups_in_cred) {
                /* fill in the job_record field if not yet filled in */
                if (!job_ptr->user_name)
                        job_ptr->user_name = uid_to_string_or_null(job_ptr->user_id);
Comment 2 Tim Wickberg 2017-11-23 12:34:54 MST
Not sending the username as part of the launch message is an intentional choice; without send_gids set, the slurmd process is supposed to fill in that field instead. It appears that, for the batch step, I've omitted the code to handle that though, and will need to fix that.
Comment 3 René Oertel 2017-12-05 04:15:05 MST
We hit the same bug with some other error message with 17.11.0 if UsePAM=1. Importance should be raised. Thank you.

slurmd.log:

[2017-12-05T09:59:05.074] _run_prolog: run job script took usec=5
[2017-12-05T09:59:05.074] _run_prolog: prolog with lock for job 68570 ran for 0 seconds
[2017-12-05T09:59:05.142] [68570.extern] task/cgroup: /slurm/uid_20005/job_68570: alloc=512MB mem.limit=512MB memsw.limit=512MB
[2017-12-05T09:59:05.142] [68570.extern] task/cgroup: /slurm/uid_20005/job_68570/step_extern: alloc=512MB mem.limit=512MB memsw.limit=512MB
[2017-12-05T09:59:05.280] Launching batch job 68570 for UID 20005
[2017-12-05T09:59:05.296] [68570.batch] task/cgroup: /slurm/uid_20005/job_68570: alloc=512MB mem.limit=512MB memsw.limit=512MB
[2017-12-05T09:59:05.297] [68570.batch] task/cgroup: /slurm/uid_20005/job_68570/step_batch: alloc=512MB mem.limit=512MB memsw.limit=512MB
[2017-12-05T09:59:05.301] [68570.batch] error: pam_open_session: Cannot make/remove an entry for the specified session
[2017-12-05T09:59:05.301] [68570.batch] error: error in pam_setup
[2017-12-05T09:59:05.322] [68570.batch] error: job_manager exiting abnormally, rc = 4020
[2017-12-05T09:59:05.322] [68570.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4020 status 0
[2017-12-05T09:59:05.324] [68570.batch] done with job
[2017-12-05T09:59:05.376] [68570.extern] done with job

slurmctld.log:
[2017-12-05T09:59:04.657] _slurm_rpc_submit_batch_job: JobId=68570 InitPrio=29879 usec=1944
[2017-12-05T09:59:05.072] backfill: Started JobID=68570 in short on cstd01-002
[2017-12-05T09:59:05.323] error: slurmd error running JobId=68570 on node(s)=cstd01-002: Slurmd could not execve job
[2017-12-05T09:59:05.323] drain_nodes: node cstd01-002 state set to DRAIN
[2017-12-05T09:59:05.323] _job_complete: JobID=68570 State=0x1 NodeCnt=1 WEXITSTATUS 0
[2017-12-05T09:59:05.323] _job_complete: JobID=68570 State=0x8003 NodeCnt=1 done
Comment 4 Tim Wickberg 2017-12-05 09:27:38 MST
Hi René -

Higher severity levels are only available to SchedMD customers. I am resetting this back to Sev3.

If you have any questions about this please feel free to get in touch directly.

- Tim
Comment 5 Tim Wickberg 2017-12-12 22:18:07 MST
*** Bug 4510 has been marked as a duplicate of this bug. ***
Comment 8 Tim Wickberg 2017-12-20 10:28:09 MST
This is fixed in commit 45be4cc0d10, and will be in 17.11.1 which we anticipate releasing later today.

LaunchParameters=send_gids is still strongly suggested; if that option is used then this fix is not needed in most cases.

- Tim