Bug 13217

Summary:	missing group users has on login node
Product:	Slurm	Reporter:	Xing Huang <x.huang>
Component:	Other	Assignee:	Tim McMullan <mcmullan>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---
Version:	21.08.2
Hardware:	Linux
OS:	Linux
Site:	WA St. Louis	Alineos Sites:	---
Atos/Eviden Sites:	---	Confidential Site:	---
Coreweave sites:	---	Cray Sites:	---
DS9 clusters:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Linux Distro:	---
Machine Name:		CLE Version:
Version Fixed:		Target Release:	---
DevPrio:	---	Emory-Cloud Sites:	---

Description Xing Huang 2022-01-18 14:25:27 MST

Dear to whom it may concern,

I find a weird thing about slurm.
Basically, the groups that a user belonged to disappeared to different degrees after sbatch batch jobs and run on a compute node.

Take user as an example.
On our login node, this user belong to the following groups.
[chen.ruiqi@login02 ~]$ id chen.ruiqi |grep nrg-mirrir-ukb-neuro
uid=2005565(chen.ruiqi) gid=1000070(domain users) groups=1000070(domain users),1193110(la_papercut),1000363(students_artsci_pri),1359173(nrg-mirrir-ukb-neuro) ...
Once he submits an interactive job and assigned him to a compute node node18
[chen.ruiqi@node18 ~]$ id chen.ruiqi |grep nrg-mirrir-ukb-neuro
The group that he belong this particular one we are interested is gone.
This prevents him to access the directory that this group can access.
However, if he directly ssh into this node18 and run id command. He would see the same information as he saw on the login node.

This is not happening to one user, but to many users.

Any idea what is happening here? Much appreciated!

Best,
Xing

Following is my slurm configuration.

ClusterName=chpc3
ControlMachine=mgt
ControlAddr=mgt.cluster
#BackupController=
#BackupAddr=
#
SlurmUser=slurm
#SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
StateSaveLocation=/var/spool/slurm/ctld
SlurmdSpoolDir=/var/spool/slurm/d
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/cgroup
PrologFlags=x11
#PluginDir=
#FirstJobId=
ReturnToService=2
#MaxJobCount=
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
Prolog=/opt/slurm/prologue
Epilog=/opt/slurm/epilogue
JobSubmitPlugins=lua
#SrunProlog=
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
TaskPlugin=task/affinity,task/cgroup
#TrackWCKey=no
#TreeWidth=50
#TmpFS=
#UsePAM=
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
SchedulerParameters=enable_user_top
#SchedulerAuth=
### These two options will 'pack' jobs on nodes - Xing 03/19/21
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory,CR_CORE_DEFAULT_DIST_BLOCK
###
PriorityType=priority/multifactor
PriorityDecayHalfLife=14
PriorityUsageResetPeriod=Monthly
# Fairshare Factor
PriorityWeightFairshare=10000
# Age Factor
PriorityWeightAge=5000
PriorityMaxAge=7-0
# Job Factor
PriorityFavorSmall=YES
PriorityWeightJobSize=2000
# Partition Factor
PriorityWeightPartition=1000
#
# LOGGING
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm/slurmd.log
JobCompType=jobcomp/none
#JobCompLoc=
#
# ACCOUNTING
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=30
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=mgt.cluster
#AccountingStorageLoc=
#AccountingStoragePass=
AccountingStorageTRES=cpu,mem,gres/gpu
AccountingStorageUser=slurm
AccountingStorageEnforce=limits,qos
#
# COMPUTE NODES
AccountingStoreFlags=job_comment
GresTypes=gpu,vmem

# == Generated by ClusterVisor plugin 'slurmserver' ==

# plugin slurmclient nodes
### Add weights to nodes for the priority of job scheduling purposes.
### The higher weights, the lower prioroty. - Xing 03/22/21
NodeName=node[01-14] CoresPerSocket=16 Sockets=2 RealMemory=770000 Weight=200 State=UNKNOWN
NodeName=node[15-32] CoresPerSocket=16 Sockets=2 RealMemory=385000 Weight=100 State=UNKNOWN
NodeName=gpu01 CoresPerSocket=16 RealMemory=385000 Sockets=2 Weight=2000 State=UNKNOWN Gres=gpu:tesla_a100:4,vmem:40gb:4
NodeName=gpu02 CoresPerSocket=16 RealMemory=770000 Sockets=2 Weight=1500 State=UNKNOWN Gres=gpu:tesla_v100S:4,vmem:32gb:4
NodeName=gpu03 CoresPerSocket=16 RealMemory=770000 Sockets=2 Weight=1500 State=UNKNOWN Gres=gpu:tesla_v100S:2,vmem:32gb:2
NodeName=gpu[04-05] CoresPerSocket=16 RealMemory=385000 Sockets=2 Weight=1200 State=UNKNOWN Gres=gpu:tesla_v100S:2,vmem:32gb:2
NodeName=gpu06 CoresPerSocket=12 RealMemory=385000 Sockets=2 Weight=900 State=UNKNOWN Gres=gpu:tesla_v100:4,vmem:32gb:4
NodeName=gpu07 CoresPerSocket=12 RealMemory=385000 Sockets=2 Weight=900 State=UNKNOWN Gres=gpu:tesla_v100:3,vmem:32gb:3
NodeName=gpu08 CoresPerSocket=12 RealMemory=385000 Sockets=2 Weight=600 State=UNKNOWN Gres=gpu:tesla_t4:2,vmem:15gb:2
NodeName=highmem01 CoresPerSocket=18 RealMemory=2984000 Sockets=2 Weight=500 State=UNKNOWN
NodeName=highmem02 CoresPerSocket=12 RealMemory=2984000 Sockets=4 Weight=400 State=UNKNOWN

# plugin slurmserver partitions
### Created new partitions to better allocate jobs. - Xing 05/05/21
PartitionName=test State=UP Default=True Nodes=node[01-02] DefaultTime=5 MaxTime=240 DefMemPerCPU=6000 MaxCPUsPerNode=32 MinNodes=1 MaxNodes=2 Priority=100
PartitionName=small State=UP Default=False Nodes=node[03,07-32] DefaultTime=5 MaxTime=1440 DefMemPerCPU=6000 MaxCPUsPerNode=32 MinNodes=1 MaxNodes=1 Priority=80
PartitionName=medium State=UP Default=False Nodes=node[03-32] DefaultTime=5 MaxTime=10080 DefMemPerCPU=12000 MaxCPUsPerNode=32 MinNodes=1 MaxNodes=4 Priority=60
PartitionName=large State=UP Default=False Nodes=node[01-02] DefaultTime=5 MaxTime=10080 DefMemPerCPU=12000 MaxCPUsPerNode=32 MinNodes=4 MaxNodes=8 Priority=40
PartitionName=gpu State=UP Default=False Nodes=gpu[01-08] DefaultTime=5 MaxTime=10080 DefMemPerCPU=6000 MinNodes=1 Priority=60
PartitionName=highmem State=UP Default=False Nodes=highmem[01-02] DefaultTime=5 MaxTime=10080 DefMemPerCPU=48000 MinNodes=1 MaxNodes=1 Priority=40
# == Generated by ClusterVisor plugin 'slurmserver' ==

Comment 1 Tim McMullan 2022-01-19 06:10:30 MST

Hi Xing,

How many groups is this user a member of?  Based on those groups, it looks like you are authenticating against a windows domain, are you using sssd to accomplish this?  If so, do you have "Enumerate=yes" set in /etc/sssd/sssd.conf for the domain?

At the moment it looks like we aren't getting the full list of groups internally, but I'm looking for some other options as well.

Thanks!
--Tim

Comment 2 Xing Huang 2022-01-20 10:25:33 MST

Hi Tim,

This user is a member of 54 groups, but this is different for each user.  Another user is a member of 76 groups so this can be quite large.

We are using SSSD to authenticate against a Windows Active Directory domain.  We are not using the
Enumerate=yes
option, however, everything appears to be working properly outside of SLURM.  It's only inside of a SLURM job where the group list appears to be truncated.

Best,
Xing
________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Wednesday, January 19, 2022 7:10 AM
To: Huang, Xing <x.huang@wustl.edu>
Subject: [Bug 13217] missing group users has on login node


* External Email - Caution *

Comment # 1<https://bugs.schedmd.com/show_bug.cgi?id=13217#c1> on bug 13217<https://bugs.schedmd.com/show_bug.cgi?id=13217> from Tim McMullan<mailto:mcmullan@schedmd.com>

Hi Xing,

How many groups is this user a member of?  Based on those groups, it looks like
you are authenticating against a windows domain, are you using sssd to
accomplish this?  If so, do you have "Enumerate=yes" set in /etc/sssd/sssd.conf
for the domain?

At the moment it looks like we aren't getting the full list of groups
internally, but I'm looking for some other options as well.

Thanks!
--Tim

________________________________
You are receiving this mail because:

  *   You reported the bug.

________________________________
The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.

Comment 3 Tim McMullan 2022-01-20 11:28:53 MST

(In reply to Xing Huang from comment #2)
> Hi Tim,
> 
> This user is a member of 54 groups, but this is different for each user. 
> Another user is a member of 76 groups so this can be quite large.

Ok, those are somewhat long lists but looking at how we handle this and your configuration, I don't expect this number of groups to be an issue.

> We are using SSSD to authenticate against a Windows Active Directory domain.
> We are not using the
> Enumerate=yes
> option, however, everything appears to be working properly outside of SLURM.
> It's only inside of a SLURM job where the group list appears to be truncated.

The way Slurm and (for example) id handle picking up group lists are different by necessity.  There are off and on again reports of problems with when enumeration is disabled which may or may not apply to you in this case.  What version of SSSD are you currently running?  Is enabling enumeration something you could try to see if the issue is resolved with that setting enabled?

Thanks!
--Tim

Comment 4 Xing Huang 2022-01-20 11:54:40 MST

Tim,

We tried and it did not fix the issue.

Best,
Xing
________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Thursday, January 20, 2022 12:28 PM
To: Huang, Xing <x.huang@wustl.edu>
Subject: [Bug 13217] missing group users has on login node

* External Email - Caution *

Comment # 3<https://bugs.schedmd.com/show_bug.cgi?id=13217#c3> on bug 13217<https://bugs.schedmd.com/show_bug.cgi?id=13217> from Tim McMullan<mailto:mcmullan@schedmd.com>

(In reply to Xing Huang from comment #2<show_bug.cgi?id=13217#c2>)
> Hi Tim,
>
> This user is a member of 54 groups, but this is different for each user.
> Another user is a member of 76 groups so this can be quite large.

Ok, those are somewhat long lists but looking at how we handle this and your
configuration, I don't expect this number of groups to be an issue.

> We are using SSSD to authenticate against a Windows Active Directory domain.
> We are not using the
> Enumerate=yes
> option, however, everything appears to be working properly outside of SLURM.
> It's only inside of a SLURM job where the group list appears to be truncated.

The way Slurm and (for example) id handle picking up group lists are different
by necessity.  There are off and on again reports of problems with when
enumeration is disabled which may or may not apply to you in this case.  What
version of SSSD are you currently running?  Is enabling enumeration something
you could try to see if the issue is resolved with that setting enabled?

Thanks!
--Tim

________________________________
You are receiving this mail because:

  *   You reported the bug.

________________________________
The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.

Comment 5 Tim McMullan 2022-01-24 06:32:22 MST

Hi Xing,

Is there any other similarities between the groups that are missing?  Are they groups the users were recently added to?  Is it the same missing group for multiple people?

Would you also provide an example of how they are starting the interactive session?

Thanks!
--Tim

Comment 6 Xing Huang 2022-01-24 11:11:54 MST

Tim,

You would see the comparison of difference in IDs reported from ACL before and after launching the interactive job in slurm on the same node. The file with slurm in the name is the one after launching the interactive job and assigned to a particular node while the one with ssh in the name is the one we act as normal user to directly ssh into the same node that was slurm assigned an interactive job to.
The example below is shown for two users on our cluster.

[chen.ruiqi@node17 chen.ruiqi]$ diff id_ssh_node17 id_slurm_node17
30a31
> 1220779
37a39
> 1255818
53,54d54
< 1359170(nrg-mirrir-biobank)
< 1359173(nrg-mirrir-ukb-neuro)

[janine.bijsterbosch@node15 ~]$ diff /tmp/id_ssh_node15 /tmp/id_slurm_node15
35a36
> 1220779
50a52,54
> 1255817
> 1255818
> 1255819
56a61
> 1304581
63a69
> 1336310(wuit_eus_9999_user_securew2certificate_targeted)
67,71d72
< 1359170(nrg-mirrir-biobank)
< 1359171(nrg-mirrir-ukb-cardiac)
< 1359172(nrg-mirrir-ukb-genomic)
< 1359173(nrg-mirrir-ukb-neuro)
< 1359174(nrg-mirrir-ukb-pheno)

So it seems like the NRG groups are the ones most likely to be missing from SLURM.  These are the vary ones that we need to control access to the storage.

I'm first launching an interactive job using srun:

[janine.bijsterbosch@login01 ~]$ srun -N 1 -n 1 --nodelist node15 --mem 100M --time=00:20:00 --pty bash

Then I will SSH into the node as the user, and compare the results of the id command.  In this way we're doing the comparison on the very same node.

We have no way of knowing what order in time that the user was added to the various groups.  It does seem that SLURM is  'masking' out some of the groups though.

Just let me know if I can provide anything else to help with the debugging.
________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Monday, January 24, 2022 7:32 AM
To: Huang, Xing <x.huang@wustl.edu>
Subject: [Bug 13217] missing group users has on login node


* External Email - Caution *

Comment # 5<https://bugs.schedmd.com/show_bug.cgi?id=13217#c5> on bug 13217<https://bugs.schedmd.com/show_bug.cgi?id=13217> from Tim McMullan<mailto:mcmullan@schedmd.com>

Hi Xing,

Is there any other similarities between the groups that are missing?  Are they
groups the users were recently added to?  Is it the same missing group for
multiple people?

Would you also provide an example of how they are starting the interactive
session?

Thanks!
--Tim

________________________________
You are receiving this mail because:

  *   You reported the bug.

________________________________
The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.

Comment 7 Tim McMullan 2022-01-25 09:13:50 MST

Thank you for the additional information, I'm doing some more digging to see what might go wrong.

Something I'd like to try as a debugging step is to add "LaunchParameters=disable_send_gids" to your slurm.conf.  This should force the groups to come from a more local lookup instead of from the ctld.  If this does fix the issue it will narrow down the places that the error likely is being introduced.

Thanks!
--Tim

Comment 8 Xing Huang 2022-01-25 10:15:57 MST

Tim,

Thank you. I just did the test for one of the users, chen.ruiqi.
[chen.ruiqi@node15 tmp]$ diff id_ssh_node15_new id_slurm_node15_new
Looks like add the parameter you suggested fixed the problem.

Best,
Xing
________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, January 25, 2022 10:13 AM
To: Huang, Xing <x.huang@wustl.edu>
Subject: [Bug 13217] missing group users has on login node


* External Email - Caution *

Comment # 7<https://bugs.schedmd.com/show_bug.cgi?id=13217#c7> on bug 13217<https://bugs.schedmd.com/show_bug.cgi?id=13217> from Tim McMullan<mailto:mcmullan@schedmd.com>

Thank you for the additional information, I'm doing some more digging to see
what might go wrong.

Something I'd like to try as a debugging step is to add
"LaunchParameters=disable_send_gids" to your slurm.conf.  This should force the
groups to come from a more local lookup instead of from the ctld.  If this does
fix the issue it will narrow down the places that the error likely is being
introduced.

Thanks!
--Tim

________________________________
You are receiving this mail because:

  *   You reported the bug.

________________________________
The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.

Comment 9 Tim McMullan 2022-01-25 10:20:59 MST

(In reply to Xing Huang from comment #8)
> Thank you. I just did the test for one of the users, chen.ruiqi.
> [chen.ruiqi@node15 tmp]$ diff id_ssh_node15_new id_slurm_node15_new
> Looks like add the parameter you suggested fixed the problem.

Thanks for testing that! You can leave that option enabled for now, but I'd still like to track down why its not working before.  Having that option specified can generate more load on the domain controller since we do more lookup operations, but as long as you don't see problems you should be OK.

I'll let you know if I need any more information to help track that down!

Thanks again!
--Tim

Comment 10 Tim McMullan 2022-01-31 07:26:36 MST

Hi Xing,

I've been looking around for the source of the error and one question has come to mind - is this only happening with interactive sessions?  If you run a batch job with the same "id" command does that also return an incorrect group list?

Thanks!
--Tim

Comment 11 Xing Huang 2022-01-31 14:55:16 MST

Tim,

Yes, we saw the issue in both cases.

Best,
Xing
________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Monday, January 31, 2022 8:26 AM
To: Huang, Xing <x.huang@wustl.edu>
Subject: [Bug 13217] missing group users has on login node

* External Email - Caution *

Comment # 10<https://bugs.schedmd.com/show_bug.cgi?id=13217#c10> on bug 13217<https://bugs.schedmd.com/show_bug.cgi?id=13217> from Tim McMullan<mailto:mcmullan@schedmd.com>

Hi Xing,

I've been looking around for the source of the error and one question has come
to mind - is this only happening with interactive sessions?  If you run a batch
job with the same "id" command does that also return an incorrect group list?

Thanks!
--Tim

________________________________
You are receiving this mail because:

  *   You reported the bug.

________________________________
The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.

Comment 12 Tim McMullan 2022-02-02 06:56:43 MST

Thanks for the clarification!  I'm still looking into this.
--Tim

Comment 13 Xing Huang 2022-02-09 12:23:52 MST

Hi Tim,

Any progress on your side on this issue?

Best,
Xing
________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Wednesday, February 2, 2022 7:56 AM
To: Huang, Xing <x.huang@wustl.edu>
Subject: [Bug 13217] missing group users has on login node

* External Email - Caution *

Comment # 12<https://bugs.schedmd.com/show_bug.cgi?id=13217#c12> on bug 13217<https://bugs.schedmd.com/show_bug.cgi?id=13217> from Tim McMullan<mailto:mcmullan@schedmd.com>

Thanks for the clarification!  I'm still looking into this.
--Tim

________________________________
You are receiving this mail because:

  *   You reported the bug.

________________________________
The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.

Comment 14 Tim McMullan 2022-02-10 06:27:08 MST

Hi Xing,

I've been digging around for what might cause this, and so far it seems most likely that either the host the slurmctld is running on isn't returning the full list of groups, or somehow the group cache isn't getting flushed properly... which it really should be.  I'm not seeing any changes in your config related to it, and by default the cache refreshes every 10 minutes.

To confirm the settings in the controller, can you run "scontrol show config | grep GroupUpdate"?

Would you mind running as root "groups $user" where $user is one of the users you were seeing missing groups with on the slurmctld node and seeing if that group list is complete?

Thanks,
--Tim

Comment 15 Xing Huang 2022-02-10 08:32:17 MST

Tim,

This is what I get when checking config setting on the mgt node.
[root@mgt slurm]# scontrol show config | grep GroupUpdate
GroupUpdateForce        = 1
GroupUpdateTime         = 600 sec

On the mgt node, using groups command gets the same result as using id command (from active directory). However, I remember the problem is not on the mgt node, but on the compute node after launching batch jobs or interactive jobs.
[root@mgt slurm]# groups chen.ruiqi
chen.ruiqi : domain users wuit_eus_9999_user_securew2certificate_targeted storage-jdquirk-small_animal_mr_facility-ro wuit_eus_9999_sccm_microsoft_office_mix wuit_eus_9999_netaccess_users_high idm-netaccess-high-clients-hc_cm storage-engineering-licenses-ro wuit_eus_2620_gp_dbbs_shortcut wuit_eus_2620_jss_appexclusion idm-staff-studentworkers-dbbs wuit_eus_2620_files_dbbs_list wuit_eus_9999_printing_access storage-wucci-visiopharm-rw wuit_eus_2620_jss_printers storage-dspencer-shared-ro storage-engineering-bin-ro storage-mcallawa-shared-ro storage-bga-site-locks-rw wuit-si-basicauth-bypass storage-wucci-scratch-rw wuit_eus_2620_printers storage-bga-gmsroot-ro wuitglobal require mfa wustlkey_active_users storage-bga-shared-ro storage-home1-home-ro nrg-mirrir-ukb-neuro students_artsci_pri nrg-mirrir-biobank sharepointauthonly storage-ris-sas-ro ad.adm.wukey.auth danforth_students crm stage access wustlkeystudents crm prod access crm test access compute-shinung storage-shinung wustlkeygroups cc_artsci_vphd univcreditonly wustlkeystaff wuit_eus_9999_sccm_microsoft_expression_encoder la_papercut la_students papercut students spwukey compute staff pwp2 wuit_eus_9999_mdm_users wuit_eus_2620_dbbs_all_users janine_bijsterbosch
[root@mgt slurm]# id chen.ruiqi
uid=2005565(chen.ruiqi) gid=1000070(domain users) groups=1000070(domain users),1336310(wuit_eus_9999_user_securew2certificate_targeted),1208168(storage-jdquirk-small_animal_mr_facility-ro),1022021(wuit_eus_9999_sccm_microsoft_office_mix),1189246(wuit_eus_9999_netaccess_users_high),1259428(idm-netaccess-high-clients-hc_cm),1358928(storage-engineering-licenses-ro),1228112(wuit_eus_2620_gp_dbbs_shortcut),1228113(wuit_eus_2620_jss_appexclusion),1314237(idm-staff-studentworkers-dbbs),1228111(wuit_eus_2620_files_dbbs_list),1021875(wuit_eus_9999_printing_access),1305070(storage-wucci-visiopharm-rw),1228114(wuit_eus_2620_jss_printers),1327201(storage-dspencer-shared-ro),1358962(storage-engineering-bin-ro),1304906(storage-mcallawa-shared-ro),1304616(storage-bga-site-locks-rw),1358910(wuit-si-basicauth-bypass),1305068(storage-wucci-scratch-rw),1228115(wuit_eus_2620_printers),1304231(storage-bga-gmsroot-ro),1000996(wuitglobal require mfa),1004319(wustlkey_active_users),1304619(storage-bga-shared-ro),1254277(storage-home1-home-ro),1255818(nrg-mirrir-ukb-neuro),1000363(students_artsci_pri),1220779(nrg-mirrir-biobank),1182034(sharepointauthonly),1313191(storage-ris-sas-ro),1002932(ad.adm.wukey.auth),1000123(danforth_students),1201356(crm stage access),1181899(wustlkeystudents),1201355(crm prod access),1201357(crm test access),1356700(compute-shinung),1204283(storage-shinung),1182075(wustlkeygroups),1000248(cc_artsci_vphd),1000050(univcreditonly),1181924(wustlkeystaff),1021904(wuit_eus_9999_sccm_microsoft_expression_encoder),1193110(la_papercut),1000030(la_students),1193107(papercut),1000009(students),1004164(spwukey),1208826(compute),1000007(staff),1000083(pwp2),1022213(wuit_eus_9999_mdm_users),1228107(wuit_eus_2620_dbbs_all_users),1012(janine_bijsterbosch)

Best,
Xing
________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Thursday, February 10, 2022 7:27 AM
To: Huang, Xing <x.huang@wustl.edu>
Subject: [Bug 13217] missing group users has on login node


* External Email - Caution *

Comment # 14<https://bugs.schedmd.com/show_bug.cgi?id=13217#c14> on bug 13217<https://bugs.schedmd.com/show_bug.cgi?id=13217> from Tim McMullan<mailto:mcmullan@schedmd.com>

Hi Xing,

I've been digging around for what might cause this, and so far it seems most
likely that either the host the slurmctld is running on isn't returning the
full list of groups, or somehow the group cache isn't getting flushed
properly... which it really should be.  I'm not seeing any changes in your
config related to it, and by default the cache refreshes every 10 minutes.

To confirm the settings in the controller, can you run "scontrol show config |
grep GroupUpdate"?

Would you mind running as root "groups $user" where $user is one of the users
you were seeing missing groups with on the slurmctld node and seeing if that
group list is complete?

Thanks,
--Tim

________________________________
You are receiving this mail because:

  *   You reported the bug.

________________________________
The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.

Comment 16 Tim McMullan 2022-02-10 08:42:13 MST

(In reply to Xing Huang from comment #15)
> Tim,
> 
> This is what I get when checking config setting on the mgt node.
> [root@mgt slurm]# scontrol show config | grep GroupUpdate
> GroupUpdateForce        = 1
> GroupUpdateTime         = 600 sec
> 
> On the mgt node, using groups command gets the same result as using id
> command (from active directory). However, I remember the problem is not on
> the mgt node, but on the compute node after launching batch jobs or
> interactive jobs.

Thank you!  Yes, I'm aware that the issue appears on the nodes, however the ctld is involved in the actual user lookup and sends the gids it thinks the user has along with the job (which is the feature we disabled to handle the problem).  I wanted to make sure that the ctld and the compute/front end nodes are all giving us the same group list from the system. If the output on the ctld and the nodes matches its more likely that something is happening in the ctld itself.

I'm continuing to look for a source of the problem!
Thanks,
--Tim

> ________________________________
> From: bugs@schedmd.com <bugs@schedmd.com>
> Sent: Thursday, February 10, 2022 7:27 AM
> To: Huang, Xing <x.huang@wustl.edu>
> Subject: [Bug 13217] missing group users has on login node
> 
> 
> * External Email - Caution *
> 
> Comment # 14<https://bugs.schedmd.com/show_bug.cgi?id=13217#c14> on bug
> 13217<https://bugs.schedmd.com/show_bug.cgi?id=13217> from Tim
> McMullan<mailto:mcmullan@schedmd.com>
> 
> Hi Xing,
> 
> I've been digging around for what might cause this, and so far it seems most
> likely that either the host the slurmctld is running on isn't returning the
> full list of groups, or somehow the group cache isn't getting flushed
> properly... which it really should be.  I'm not seeing any changes in your
> config related to it, and by default the cache refreshes every 10 minutes.
> 
> To confirm the settings in the controller, can you run "scontrol show config
> |
> grep GroupUpdate"?
> 
> Would you mind running as root "groups $user" where $user is one of the users
> you were seeing missing groups with on the slurmctld node and seeing if that
> group list is complete?
> 
> Thanks,
> --Tim
> 
> ________________________________
> You are receiving this mail because:
> 
>   *   You reported the bug.
> 
> ________________________________
> The materials in this message are private and may contain Protected
> Healthcare Information or other information of a sensitive nature. If you
> are not the intended recipient, be advised that any unauthorized use,
> disclosure, copying or the taking of any action in reliance on the contents
> of this information is strictly prohibited. If you have received this email
> in error, please immediately notify the sender via telephone or return mail.

Comment 17 Xing Huang 2022-02-14 11:29:08 MST

Tim,

I know it would take time to find out the best solution for the bug I reported. However, this is preventing our users to use our queuing system and affecting their research projects. It has been a month since I reported the issue. Is there a temporary solution for me to implement before we find out the ultimate solution? Meanwhile, is it possible to escalate the severity of the ticket? Thanks for your time and help!

Best,
Xing
________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Thursday, February 10, 2022 9:42 AM
To: Huang, Xing <x.huang@wustl.edu>
Subject: [Bug 13217] missing group users has on login node

* External Email - Caution *

Comment # 16<https://bugs.schedmd.com/show_bug.cgi?id=13217#c16> on bug 13217<https://bugs.schedmd.com/show_bug.cgi?id=13217> from Tim McMullan<mailto:mcmullan@schedmd.com>

(In reply to Xing Huang from comment #15<show_bug.cgi?id=13217#c15>)
> Tim,
>
> This is what I get when checking config setting on the mgt node.
> [root@mgt slurm]# scontrol show config | grep GroupUpdate
> GroupUpdateForce        = 1
> GroupUpdateTime         = 600 sec
>
> On the mgt node, using groups command gets the same result as using id
> command (from active directory). However, I remember the problem is not on
> the mgt node, but on the compute node after launching batch jobs or
> interactive jobs.

Thank you!  Yes, I'm aware that the issue appears on the nodes, however the
ctld is involved in the actual user lookup and sends the gids it thinks the
user has along with the job (which is the feature we disabled to handle the
problem).  I wanted to make sure that the ctld and the compute/front end nodes
are all giving us the same group list from the system. If the output on the
ctld and the nodes matches its more likely that something is happening in the
ctld itself.

I'm continuing to look for a source of the problem!
Thanks,
--Tim

> ________________________________
> From: bugs@schedmd.com<mailto:bugs@schedmd.com> <bugs@schedmd.com<mailto:bugs@schedmd.com>>
> Sent: Thursday, February 10, 2022 7:27 AM
> To: Huang, Xing <x.huang@wustl.edu<mailto:x.huang@wustl.edu>>
> Subject: [Bug 13217<show_bug.cgi?id=13217>] missing group users has on login node
>
>
> * External Email - Caution *
>
> Comment # 14<show_bug.cgi?id=13217#c14><https://bugs.schedmd.com/show_bug.cgi?id=13217#c14<show_bug.cgi?id=13217#c14>> on bug
> 13217<https://bugs.schedmd.com/show_bug.cgi?id=13217<show_bug.cgi?id=13217>> from Tim
> McMullan<mailto:mcmullan@schedmd.com>
>
> Hi Xing,
>
> I've been digging around for what might cause this, and so far it seems most
> likely that either the host the slurmctld is running on isn't returning the
> full list of groups, or somehow the group cache isn't getting flushed
> properly... which it really should be.  I'm not seeing any changes in your
> config related to it, and by default the cache refreshes every 10 minutes.
>
> To confirm the settings in the controller, can you run "scontrol show config
> |
> grep GroupUpdate"?
>
> Would you mind running as root "groups $user" where $user is one of the users
> you were seeing missing groups with on the slurmctld node and seeing if that
> group list is complete?
>
> Thanks,
> --Tim
>
> ________________________________
> You are receiving this mail because:
>
>   *   You reported the bug.
>
> ________________________________
> The materials in this message are private and may contain Protected
> Healthcare Information or other information of a sensitive nature. If you
> are not the intended recipient, be advised that any unauthorized use,
> disclosure, copying or the taking of any action in reliance on the contents
> of this information is strictly prohibited. If you have received this email
> in error, please immediately notify the sender via telephone or return mail.

________________________________
You are receiving this mail because:

  *   You reported the bug.

________________________________
The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.

Comment 18 Tim McMullan 2022-02-14 11:32:46 MST

Hey Xing, I thought that adding "LaunchParameters=disable_send_gids" had fixed the problem?  Are you not running with that now?  If you aren't please do run with it, I had assumed it was left in place since it seemed to fix the issue.

Comment 19 Xing Huang 2022-02-14 11:39:07 MST

Tim,

Yes, I did try it and it once worked. However, after the test, you asked me to comment out this parameter and you would continue dig out the root cause for the problem.
Is this a temporary solution or a permanent fix?

Best,
Xing
________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Monday, February 14, 2022 12:32 PM
To: Huang, Xing <x.huang@wustl.edu>
Subject: [Bug 13217] missing group users has on login node


* External Email - Caution *

Comment # 18<https://bugs.schedmd.com/show_bug.cgi?id=13217#c18> on bug 13217<https://bugs.schedmd.com/show_bug.cgi?id=13217> from Tim McMullan<mailto:mcmullan@schedmd.com>

Hey Xing, I thought that adding "LaunchParameters=disable_send_gids" had fixed
the problem?  Are you not running with that now?  If you aren't please do run
with it, I had assumed it was left in place since it seemed to fix the issue.

________________________________
You are receiving this mail because:

  *   You reported the bug.

________________________________
The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.

Comment 20 Tim McMullan 2022-02-14 11:59:11 MST

(In reply to Xing Huang from comment #19)
> Tim,
> 
> Yes, I did try it and it once worked. However, after the test, you asked me
> to comment out this parameter and you would continue dig out the root cause
> for the problem.
> Is this a temporary solution or a permanent fix?

I'm so sorry I wasn't clear on this!  My intentions on this are as follows:

If running with "LaunchParameters=disable_send_gids" is working, I'm happy for you to be running with that option.

I would like to understand why that option is necessary in your environment. It doesn't seem like it should be, but apparently is... however I don't want you to be in a broken state until we figure that out.  As you say, it can take some time.

If you are able to work with me for a while on why its required I'd certainly appreciate it.  I might ask you to disable it and run with a debugging patch or try some other settings, then re-enable if things don't work.  If you have a test system that exhibits the same behavior that's much better for testing when I can't reproduce the issue myself.

Thanks!
--Tim

Comment 21 Tim McMullan 2022-02-17 06:59:06 MST

I just wanted to reach out and see if you have re added "LaunchParameters=disable_send_gids" and if it was still working for you.

Thanks,
--Tim

Comment 22 Xing Huang 2022-02-17 08:56:02 MST

Tim,

Thanks for reaching out to me! We're good now with this option added.
If you want to close the ticket, please go ahead.
Again, thanks a lot for your help.

Best,
Xing
________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Thursday, February 17, 2022 7:59 AM
To: Huang, Xing <x.huang@wustl.edu>
Subject: [Bug 13217] missing group users has on login node


* External Email - Caution *

Comment # 21<https://bugs.schedmd.com/show_bug.cgi?id=13217#c21> on bug 13217<https://bugs.schedmd.com/show_bug.cgi?id=13217> from Tim McMullan<mailto:mcmullan@schedmd.com>

I just wanted to reach out and see if you have re added
"LaunchParameters=disable_send_gids" and if it was still working for you.

Thanks,
--Tim

________________________________
You are receiving this mail because:

  *   You reported the bug.

________________________________
The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.

Comment 23 Tim McMullan 2022-02-17 14:38:55 MST

Thank you Xing!  I'm glad to hear that everything is working with that option in place.

I'll resolve this now, thanks again!
--Tim