Dear to whom it may concern, I find a weird thing about slurm. Basically, the groups that a user belonged to disappeared to different degrees after sbatch batch jobs and run on a compute node. Take user as an example. On our login node, this user belong to the following groups. [chen.ruiqi@login02 ~]$ id chen.ruiqi |grep nrg-mirrir-ukb-neuro uid=2005565(chen.ruiqi) gid=1000070(domain users) groups=1000070(domain users),1193110(la_papercut),1000363(students_artsci_pri),1359173(nrg-mirrir-ukb-neuro) ... Once he submits an interactive job and assigned him to a compute node node18 [chen.ruiqi@node18 ~]$ id chen.ruiqi |grep nrg-mirrir-ukb-neuro The group that he belong this particular one we are interested is gone. This prevents him to access the directory that this group can access. However, if he directly ssh into this node18 and run id command. He would see the same information as he saw on the login node. This is not happening to one user, but to many users. Any idea what is happening here? Much appreciated! Best, Xing Following is my slurm configuration. ClusterName=chpc3 ControlMachine=mgt ControlAddr=mgt.cluster #BackupController= #BackupAddr= # SlurmUser=slurm #SlurmdUser=root SlurmctldPort=6817 SlurmdPort=6818 AuthType=auth/munge #JobCredentialPrivateKey= #JobCredentialPublicCertificate= StateSaveLocation=/var/spool/slurm/ctld SlurmdSpoolDir=/var/spool/slurm/d SwitchType=switch/none MpiDefault=none SlurmctldPidFile=/var/run/slurmctld.pid SlurmdPidFile=/var/run/slurmd.pid ProctrackType=proctrack/cgroup PrologFlags=x11 #PluginDir= #FirstJobId= ReturnToService=2 #MaxJobCount= #PlugStackConfig= #PropagatePrioProcess= #PropagateResourceLimits= #PropagateResourceLimitsExcept= Prolog=/opt/slurm/prologue Epilog=/opt/slurm/epilogue JobSubmitPlugins=lua #SrunProlog= #SrunEpilog= #TaskProlog= #TaskEpilog= TaskPlugin=task/affinity,task/cgroup #TrackWCKey=no #TreeWidth=50 #TmpFS= #UsePAM= # # TIMERS SlurmctldTimeout=300 SlurmdTimeout=300 InactiveLimit=0 MinJobAge=300 KillWait=30 Waittime=0 # # SCHEDULING SchedulerType=sched/backfill SchedulerParameters=enable_user_top #SchedulerAuth= ### These two options will 'pack' jobs on nodes - Xing 03/19/21 SelectType=select/cons_tres SelectTypeParameters=CR_Core_Memory,CR_CORE_DEFAULT_DIST_BLOCK ### PriorityType=priority/multifactor PriorityDecayHalfLife=14 PriorityUsageResetPeriod=Monthly # Fairshare Factor PriorityWeightFairshare=10000 # Age Factor PriorityWeightAge=5000 PriorityMaxAge=7-0 # Job Factor PriorityFavorSmall=YES PriorityWeightJobSize=2000 # Partition Factor PriorityWeightPartition=1000 # # LOGGING SlurmctldDebug=info SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdDebug=info SlurmdLogFile=/var/log/slurm/slurmd.log JobCompType=jobcomp/none #JobCompLoc= # # ACCOUNTING JobAcctGatherType=jobacct_gather/linux JobAcctGatherFrequency=30 AccountingStorageType=accounting_storage/slurmdbd AccountingStorageHost=mgt.cluster #AccountingStorageLoc= #AccountingStoragePass= AccountingStorageTRES=cpu,mem,gres/gpu AccountingStorageUser=slurm AccountingStorageEnforce=limits,qos # # COMPUTE NODES AccountingStoreFlags=job_comment GresTypes=gpu,vmem # == Generated by ClusterVisor plugin 'slurmserver' == # plugin slurmclient nodes ### Add weights to nodes for the priority of job scheduling purposes. ### The higher weights, the lower prioroty. - Xing 03/22/21 NodeName=node[01-14] CoresPerSocket=16 Sockets=2 RealMemory=770000 Weight=200 State=UNKNOWN NodeName=node[15-32] CoresPerSocket=16 Sockets=2 RealMemory=385000 Weight=100 State=UNKNOWN NodeName=gpu01 CoresPerSocket=16 RealMemory=385000 Sockets=2 Weight=2000 State=UNKNOWN Gres=gpu:tesla_a100:4,vmem:40gb:4 NodeName=gpu02 CoresPerSocket=16 RealMemory=770000 Sockets=2 Weight=1500 State=UNKNOWN Gres=gpu:tesla_v100S:4,vmem:32gb:4 NodeName=gpu03 CoresPerSocket=16 RealMemory=770000 Sockets=2 Weight=1500 State=UNKNOWN Gres=gpu:tesla_v100S:2,vmem:32gb:2 NodeName=gpu[04-05] CoresPerSocket=16 RealMemory=385000 Sockets=2 Weight=1200 State=UNKNOWN Gres=gpu:tesla_v100S:2,vmem:32gb:2 NodeName=gpu06 CoresPerSocket=12 RealMemory=385000 Sockets=2 Weight=900 State=UNKNOWN Gres=gpu:tesla_v100:4,vmem:32gb:4 NodeName=gpu07 CoresPerSocket=12 RealMemory=385000 Sockets=2 Weight=900 State=UNKNOWN Gres=gpu:tesla_v100:3,vmem:32gb:3 NodeName=gpu08 CoresPerSocket=12 RealMemory=385000 Sockets=2 Weight=600 State=UNKNOWN Gres=gpu:tesla_t4:2,vmem:15gb:2 NodeName=highmem01 CoresPerSocket=18 RealMemory=2984000 Sockets=2 Weight=500 State=UNKNOWN NodeName=highmem02 CoresPerSocket=12 RealMemory=2984000 Sockets=4 Weight=400 State=UNKNOWN # plugin slurmserver partitions ### Created new partitions to better allocate jobs. - Xing 05/05/21 PartitionName=test State=UP Default=True Nodes=node[01-02] DefaultTime=5 MaxTime=240 DefMemPerCPU=6000 MaxCPUsPerNode=32 MinNodes=1 MaxNodes=2 Priority=100 PartitionName=small State=UP Default=False Nodes=node[03,07-32] DefaultTime=5 MaxTime=1440 DefMemPerCPU=6000 MaxCPUsPerNode=32 MinNodes=1 MaxNodes=1 Priority=80 PartitionName=medium State=UP Default=False Nodes=node[03-32] DefaultTime=5 MaxTime=10080 DefMemPerCPU=12000 MaxCPUsPerNode=32 MinNodes=1 MaxNodes=4 Priority=60 PartitionName=large State=UP Default=False Nodes=node[01-02] DefaultTime=5 MaxTime=10080 DefMemPerCPU=12000 MaxCPUsPerNode=32 MinNodes=4 MaxNodes=8 Priority=40 PartitionName=gpu State=UP Default=False Nodes=gpu[01-08] DefaultTime=5 MaxTime=10080 DefMemPerCPU=6000 MinNodes=1 Priority=60 PartitionName=highmem State=UP Default=False Nodes=highmem[01-02] DefaultTime=5 MaxTime=10080 DefMemPerCPU=48000 MinNodes=1 MaxNodes=1 Priority=40 # == Generated by ClusterVisor plugin 'slurmserver' ==
Hi Xing, How many groups is this user a member of? Based on those groups, it looks like you are authenticating against a windows domain, are you using sssd to accomplish this? If so, do you have "Enumerate=yes" set in /etc/sssd/sssd.conf for the domain? At the moment it looks like we aren't getting the full list of groups internally, but I'm looking for some other options as well. Thanks! --Tim
Hi Tim, This user is a member of 54 groups, but this is different for each user. Another user is a member of 76 groups so this can be quite large. We are using SSSD to authenticate against a Windows Active Directory domain. We are not using the Enumerate=yes option, however, everything appears to be working properly outside of SLURM. It's only inside of a SLURM job where the group list appears to be truncated. Best, Xing ________________________________ From: bugs@schedmd.com <bugs@schedmd.com> Sent: Wednesday, January 19, 2022 7:10 AM To: Huang, Xing <x.huang@wustl.edu> Subject: [Bug 13217] missing group users has on login node * External Email - Caution * Comment # 1<https://bugs.schedmd.com/show_bug.cgi?id=13217#c1> on bug 13217<https://bugs.schedmd.com/show_bug.cgi?id=13217> from Tim McMullan<mailto:mcmullan@schedmd.com> Hi Xing, How many groups is this user a member of? Based on those groups, it looks like you are authenticating against a windows domain, are you using sssd to accomplish this? If so, do you have "Enumerate=yes" set in /etc/sssd/sssd.conf for the domain? At the moment it looks like we aren't getting the full list of groups internally, but I'm looking for some other options as well. Thanks! --Tim ________________________________ You are receiving this mail because: * You reported the bug. ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.
(In reply to Xing Huang from comment #2) > Hi Tim, > > This user is a member of 54 groups, but this is different for each user. > Another user is a member of 76 groups so this can be quite large. Ok, those are somewhat long lists but looking at how we handle this and your configuration, I don't expect this number of groups to be an issue. > We are using SSSD to authenticate against a Windows Active Directory domain. > We are not using the > Enumerate=yes > option, however, everything appears to be working properly outside of SLURM. > It's only inside of a SLURM job where the group list appears to be truncated. The way Slurm and (for example) id handle picking up group lists are different by necessity. There are off and on again reports of problems with when enumeration is disabled which may or may not apply to you in this case. What version of SSSD are you currently running? Is enabling enumeration something you could try to see if the issue is resolved with that setting enabled? Thanks! --Tim
Tim, We tried and it did not fix the issue. Best, Xing ________________________________ From: bugs@schedmd.com <bugs@schedmd.com> Sent: Thursday, January 20, 2022 12:28 PM To: Huang, Xing <x.huang@wustl.edu> Subject: [Bug 13217] missing group users has on login node * External Email - Caution * Comment # 3<https://bugs.schedmd.com/show_bug.cgi?id=13217#c3> on bug 13217<https://bugs.schedmd.com/show_bug.cgi?id=13217> from Tim McMullan<mailto:mcmullan@schedmd.com> (In reply to Xing Huang from comment #2<show_bug.cgi?id=13217#c2>) > Hi Tim, > > This user is a member of 54 groups, but this is different for each user. > Another user is a member of 76 groups so this can be quite large. Ok, those are somewhat long lists but looking at how we handle this and your configuration, I don't expect this number of groups to be an issue. > We are using SSSD to authenticate against a Windows Active Directory domain. > We are not using the > Enumerate=yes > option, however, everything appears to be working properly outside of SLURM. > It's only inside of a SLURM job where the group list appears to be truncated. The way Slurm and (for example) id handle picking up group lists are different by necessity. There are off and on again reports of problems with when enumeration is disabled which may or may not apply to you in this case. What version of SSSD are you currently running? Is enabling enumeration something you could try to see if the issue is resolved with that setting enabled? Thanks! --Tim ________________________________ You are receiving this mail because: * You reported the bug. ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.
Hi Xing, Is there any other similarities between the groups that are missing? Are they groups the users were recently added to? Is it the same missing group for multiple people? Would you also provide an example of how they are starting the interactive session? Thanks! --Tim
Tim, You would see the comparison of difference in IDs reported from ACL before and after launching the interactive job in slurm on the same node. The file with slurm in the name is the one after launching the interactive job and assigned to a particular node while the one with ssh in the name is the one we act as normal user to directly ssh into the same node that was slurm assigned an interactive job to. The example below is shown for two users on our cluster. [chen.ruiqi@node17 chen.ruiqi]$ diff id_ssh_node17 id_slurm_node17 30a31 > 1220779 37a39 > 1255818 53,54d54 < 1359170(nrg-mirrir-biobank) < 1359173(nrg-mirrir-ukb-neuro) [janine.bijsterbosch@node15 ~]$ diff /tmp/id_ssh_node15 /tmp/id_slurm_node15 35a36 > 1220779 50a52,54 > 1255817 > 1255818 > 1255819 56a61 > 1304581 63a69 > 1336310(wuit_eus_9999_user_securew2certificate_targeted) 67,71d72 < 1359170(nrg-mirrir-biobank) < 1359171(nrg-mirrir-ukb-cardiac) < 1359172(nrg-mirrir-ukb-genomic) < 1359173(nrg-mirrir-ukb-neuro) < 1359174(nrg-mirrir-ukb-pheno) So it seems like the NRG groups are the ones most likely to be missing from SLURM. These are the vary ones that we need to control access to the storage. I'm first launching an interactive job using srun: [janine.bijsterbosch@login01 ~]$ srun -N 1 -n 1 --nodelist node15 --mem 100M --time=00:20:00 --pty bash Then I will SSH into the node as the user, and compare the results of the id command. In this way we're doing the comparison on the very same node. We have no way of knowing what order in time that the user was added to the various groups. It does seem that SLURM is 'masking' out some of the groups though. Just let me know if I can provide anything else to help with the debugging. ________________________________ From: bugs@schedmd.com <bugs@schedmd.com> Sent: Monday, January 24, 2022 7:32 AM To: Huang, Xing <x.huang@wustl.edu> Subject: [Bug 13217] missing group users has on login node * External Email - Caution * Comment # 5<https://bugs.schedmd.com/show_bug.cgi?id=13217#c5> on bug 13217<https://bugs.schedmd.com/show_bug.cgi?id=13217> from Tim McMullan<mailto:mcmullan@schedmd.com> Hi Xing, Is there any other similarities between the groups that are missing? Are they groups the users were recently added to? Is it the same missing group for multiple people? Would you also provide an example of how they are starting the interactive session? Thanks! --Tim ________________________________ You are receiving this mail because: * You reported the bug. ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.
Thank you for the additional information, I'm doing some more digging to see what might go wrong. Something I'd like to try as a debugging step is to add "LaunchParameters=disable_send_gids" to your slurm.conf. This should force the groups to come from a more local lookup instead of from the ctld. If this does fix the issue it will narrow down the places that the error likely is being introduced. Thanks! --Tim
Tim, Thank you. I just did the test for one of the users, chen.ruiqi. [chen.ruiqi@node15 tmp]$ diff id_ssh_node15_new id_slurm_node15_new Looks like add the parameter you suggested fixed the problem. Best, Xing ________________________________ From: bugs@schedmd.com <bugs@schedmd.com> Sent: Tuesday, January 25, 2022 10:13 AM To: Huang, Xing <x.huang@wustl.edu> Subject: [Bug 13217] missing group users has on login node * External Email - Caution * Comment # 7<https://bugs.schedmd.com/show_bug.cgi?id=13217#c7> on bug 13217<https://bugs.schedmd.com/show_bug.cgi?id=13217> from Tim McMullan<mailto:mcmullan@schedmd.com> Thank you for the additional information, I'm doing some more digging to see what might go wrong. Something I'd like to try as a debugging step is to add "LaunchParameters=disable_send_gids" to your slurm.conf. This should force the groups to come from a more local lookup instead of from the ctld. If this does fix the issue it will narrow down the places that the error likely is being introduced. Thanks! --Tim ________________________________ You are receiving this mail because: * You reported the bug. ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.
(In reply to Xing Huang from comment #8) > Thank you. I just did the test for one of the users, chen.ruiqi. > [chen.ruiqi@node15 tmp]$ diff id_ssh_node15_new id_slurm_node15_new > Looks like add the parameter you suggested fixed the problem. Thanks for testing that! You can leave that option enabled for now, but I'd still like to track down why its not working before. Having that option specified can generate more load on the domain controller since we do more lookup operations, but as long as you don't see problems you should be OK. I'll let you know if I need any more information to help track that down! Thanks again! --Tim
Hi Xing, I've been looking around for the source of the error and one question has come to mind - is this only happening with interactive sessions? If you run a batch job with the same "id" command does that also return an incorrect group list? Thanks! --Tim
Tim, Yes, we saw the issue in both cases. Best, Xing ________________________________ From: bugs@schedmd.com <bugs@schedmd.com> Sent: Monday, January 31, 2022 8:26 AM To: Huang, Xing <x.huang@wustl.edu> Subject: [Bug 13217] missing group users has on login node * External Email - Caution * Comment # 10<https://bugs.schedmd.com/show_bug.cgi?id=13217#c10> on bug 13217<https://bugs.schedmd.com/show_bug.cgi?id=13217> from Tim McMullan<mailto:mcmullan@schedmd.com> Hi Xing, I've been looking around for the source of the error and one question has come to mind - is this only happening with interactive sessions? If you run a batch job with the same "id" command does that also return an incorrect group list? Thanks! --Tim ________________________________ You are receiving this mail because: * You reported the bug. ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.
Thanks for the clarification! I'm still looking into this. --Tim
Hi Tim, Any progress on your side on this issue? Best, Xing ________________________________ From: bugs@schedmd.com <bugs@schedmd.com> Sent: Wednesday, February 2, 2022 7:56 AM To: Huang, Xing <x.huang@wustl.edu> Subject: [Bug 13217] missing group users has on login node * External Email - Caution * Comment # 12<https://bugs.schedmd.com/show_bug.cgi?id=13217#c12> on bug 13217<https://bugs.schedmd.com/show_bug.cgi?id=13217> from Tim McMullan<mailto:mcmullan@schedmd.com> Thanks for the clarification! I'm still looking into this. --Tim ________________________________ You are receiving this mail because: * You reported the bug. ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.
Hi Xing, I've been digging around for what might cause this, and so far it seems most likely that either the host the slurmctld is running on isn't returning the full list of groups, or somehow the group cache isn't getting flushed properly... which it really should be. I'm not seeing any changes in your config related to it, and by default the cache refreshes every 10 minutes. To confirm the settings in the controller, can you run "scontrol show config | grep GroupUpdate"? Would you mind running as root "groups $user" where $user is one of the users you were seeing missing groups with on the slurmctld node and seeing if that group list is complete? Thanks, --Tim
Tim, This is what I get when checking config setting on the mgt node. [root@mgt slurm]# scontrol show config | grep GroupUpdate GroupUpdateForce = 1 GroupUpdateTime = 600 sec On the mgt node, using groups command gets the same result as using id command (from active directory). However, I remember the problem is not on the mgt node, but on the compute node after launching batch jobs or interactive jobs. [root@mgt slurm]# groups chen.ruiqi chen.ruiqi : domain users wuit_eus_9999_user_securew2certificate_targeted storage-jdquirk-small_animal_mr_facility-ro wuit_eus_9999_sccm_microsoft_office_mix wuit_eus_9999_netaccess_users_high idm-netaccess-high-clients-hc_cm storage-engineering-licenses-ro wuit_eus_2620_gp_dbbs_shortcut wuit_eus_2620_jss_appexclusion idm-staff-studentworkers-dbbs wuit_eus_2620_files_dbbs_list wuit_eus_9999_printing_access storage-wucci-visiopharm-rw wuit_eus_2620_jss_printers storage-dspencer-shared-ro storage-engineering-bin-ro storage-mcallawa-shared-ro storage-bga-site-locks-rw wuit-si-basicauth-bypass storage-wucci-scratch-rw wuit_eus_2620_printers storage-bga-gmsroot-ro wuitglobal require mfa wustlkey_active_users storage-bga-shared-ro storage-home1-home-ro nrg-mirrir-ukb-neuro students_artsci_pri nrg-mirrir-biobank sharepointauthonly storage-ris-sas-ro ad.adm.wukey.auth danforth_students crm stage access wustlkeystudents crm prod access crm test access compute-shinung storage-shinung wustlkeygroups cc_artsci_vphd univcreditonly wustlkeystaff wuit_eus_9999_sccm_microsoft_expression_encoder la_papercut la_students papercut students spwukey compute staff pwp2 wuit_eus_9999_mdm_users wuit_eus_2620_dbbs_all_users janine_bijsterbosch [root@mgt slurm]# id chen.ruiqi uid=2005565(chen.ruiqi) gid=1000070(domain users) groups=1000070(domain users),1336310(wuit_eus_9999_user_securew2certificate_targeted),1208168(storage-jdquirk-small_animal_mr_facility-ro),1022021(wuit_eus_9999_sccm_microsoft_office_mix),1189246(wuit_eus_9999_netaccess_users_high),1259428(idm-netaccess-high-clients-hc_cm),1358928(storage-engineering-licenses-ro),1228112(wuit_eus_2620_gp_dbbs_shortcut),1228113(wuit_eus_2620_jss_appexclusion),1314237(idm-staff-studentworkers-dbbs),1228111(wuit_eus_2620_files_dbbs_list),1021875(wuit_eus_9999_printing_access),1305070(storage-wucci-visiopharm-rw),1228114(wuit_eus_2620_jss_printers),1327201(storage-dspencer-shared-ro),1358962(storage-engineering-bin-ro),1304906(storage-mcallawa-shared-ro),1304616(storage-bga-site-locks-rw),1358910(wuit-si-basicauth-bypass),1305068(storage-wucci-scratch-rw),1228115(wuit_eus_2620_printers),1304231(storage-bga-gmsroot-ro),1000996(wuitglobal require mfa),1004319(wustlkey_active_users),1304619(storage-bga-shared-ro),1254277(storage-home1-home-ro),1255818(nrg-mirrir-ukb-neuro),1000363(students_artsci_pri),1220779(nrg-mirrir-biobank),1182034(sharepointauthonly),1313191(storage-ris-sas-ro),1002932(ad.adm.wukey.auth),1000123(danforth_students),1201356(crm stage access),1181899(wustlkeystudents),1201355(crm prod access),1201357(crm test access),1356700(compute-shinung),1204283(storage-shinung),1182075(wustlkeygroups),1000248(cc_artsci_vphd),1000050(univcreditonly),1181924(wustlkeystaff),1021904(wuit_eus_9999_sccm_microsoft_expression_encoder),1193110(la_papercut),1000030(la_students),1193107(papercut),1000009(students),1004164(spwukey),1208826(compute),1000007(staff),1000083(pwp2),1022213(wuit_eus_9999_mdm_users),1228107(wuit_eus_2620_dbbs_all_users),1012(janine_bijsterbosch) Best, Xing ________________________________ From: bugs@schedmd.com <bugs@schedmd.com> Sent: Thursday, February 10, 2022 7:27 AM To: Huang, Xing <x.huang@wustl.edu> Subject: [Bug 13217] missing group users has on login node * External Email - Caution * Comment # 14<https://bugs.schedmd.com/show_bug.cgi?id=13217#c14> on bug 13217<https://bugs.schedmd.com/show_bug.cgi?id=13217> from Tim McMullan<mailto:mcmullan@schedmd.com> Hi Xing, I've been digging around for what might cause this, and so far it seems most likely that either the host the slurmctld is running on isn't returning the full list of groups, or somehow the group cache isn't getting flushed properly... which it really should be. I'm not seeing any changes in your config related to it, and by default the cache refreshes every 10 minutes. To confirm the settings in the controller, can you run "scontrol show config | grep GroupUpdate"? Would you mind running as root "groups $user" where $user is one of the users you were seeing missing groups with on the slurmctld node and seeing if that group list is complete? Thanks, --Tim ________________________________ You are receiving this mail because: * You reported the bug. ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.
(In reply to Xing Huang from comment #15) > Tim, > > This is what I get when checking config setting on the mgt node. > [root@mgt slurm]# scontrol show config | grep GroupUpdate > GroupUpdateForce = 1 > GroupUpdateTime = 600 sec > > On the mgt node, using groups command gets the same result as using id > command (from active directory). However, I remember the problem is not on > the mgt node, but on the compute node after launching batch jobs or > interactive jobs. Thank you! Yes, I'm aware that the issue appears on the nodes, however the ctld is involved in the actual user lookup and sends the gids it thinks the user has along with the job (which is the feature we disabled to handle the problem). I wanted to make sure that the ctld and the compute/front end nodes are all giving us the same group list from the system. If the output on the ctld and the nodes matches its more likely that something is happening in the ctld itself. I'm continuing to look for a source of the problem! Thanks, --Tim > ________________________________ > From: bugs@schedmd.com <bugs@schedmd.com> > Sent: Thursday, February 10, 2022 7:27 AM > To: Huang, Xing <x.huang@wustl.edu> > Subject: [Bug 13217] missing group users has on login node > > > * External Email - Caution * > > Comment # 14<https://bugs.schedmd.com/show_bug.cgi?id=13217#c14> on bug > 13217<https://bugs.schedmd.com/show_bug.cgi?id=13217> from Tim > McMullan<mailto:mcmullan@schedmd.com> > > Hi Xing, > > I've been digging around for what might cause this, and so far it seems most > likely that either the host the slurmctld is running on isn't returning the > full list of groups, or somehow the group cache isn't getting flushed > properly... which it really should be. I'm not seeing any changes in your > config related to it, and by default the cache refreshes every 10 minutes. > > To confirm the settings in the controller, can you run "scontrol show config > | > grep GroupUpdate"? > > Would you mind running as root "groups $user" where $user is one of the users > you were seeing missing groups with on the slurmctld node and seeing if that > group list is complete? > > Thanks, > --Tim > > ________________________________ > You are receiving this mail because: > > * You reported the bug. > > ________________________________ > The materials in this message are private and may contain Protected > Healthcare Information or other information of a sensitive nature. If you > are not the intended recipient, be advised that any unauthorized use, > disclosure, copying or the taking of any action in reliance on the contents > of this information is strictly prohibited. If you have received this email > in error, please immediately notify the sender via telephone or return mail.
Tim, I know it would take time to find out the best solution for the bug I reported. However, this is preventing our users to use our queuing system and affecting their research projects. It has been a month since I reported the issue. Is there a temporary solution for me to implement before we find out the ultimate solution? Meanwhile, is it possible to escalate the severity of the ticket? Thanks for your time and help! Best, Xing ________________________________ From: bugs@schedmd.com <bugs@schedmd.com> Sent: Thursday, February 10, 2022 9:42 AM To: Huang, Xing <x.huang@wustl.edu> Subject: [Bug 13217] missing group users has on login node * External Email - Caution * Comment # 16<https://bugs.schedmd.com/show_bug.cgi?id=13217#c16> on bug 13217<https://bugs.schedmd.com/show_bug.cgi?id=13217> from Tim McMullan<mailto:mcmullan@schedmd.com> (In reply to Xing Huang from comment #15<show_bug.cgi?id=13217#c15>) > Tim, > > This is what I get when checking config setting on the mgt node. > [root@mgt slurm]# scontrol show config | grep GroupUpdate > GroupUpdateForce = 1 > GroupUpdateTime = 600 sec > > On the mgt node, using groups command gets the same result as using id > command (from active directory). However, I remember the problem is not on > the mgt node, but on the compute node after launching batch jobs or > interactive jobs. Thank you! Yes, I'm aware that the issue appears on the nodes, however the ctld is involved in the actual user lookup and sends the gids it thinks the user has along with the job (which is the feature we disabled to handle the problem). I wanted to make sure that the ctld and the compute/front end nodes are all giving us the same group list from the system. If the output on the ctld and the nodes matches its more likely that something is happening in the ctld itself. I'm continuing to look for a source of the problem! Thanks, --Tim > ________________________________ > From: bugs@schedmd.com<mailto:bugs@schedmd.com> <bugs@schedmd.com<mailto:bugs@schedmd.com>> > Sent: Thursday, February 10, 2022 7:27 AM > To: Huang, Xing <x.huang@wustl.edu<mailto:x.huang@wustl.edu>> > Subject: [Bug 13217<show_bug.cgi?id=13217>] missing group users has on login node > > > * External Email - Caution * > > Comment # 14<show_bug.cgi?id=13217#c14><https://bugs.schedmd.com/show_bug.cgi?id=13217#c14<show_bug.cgi?id=13217#c14>> on bug > 13217<https://bugs.schedmd.com/show_bug.cgi?id=13217<show_bug.cgi?id=13217>> from Tim > McMullan<mailto:mcmullan@schedmd.com> > > Hi Xing, > > I've been digging around for what might cause this, and so far it seems most > likely that either the host the slurmctld is running on isn't returning the > full list of groups, or somehow the group cache isn't getting flushed > properly... which it really should be. I'm not seeing any changes in your > config related to it, and by default the cache refreshes every 10 minutes. > > To confirm the settings in the controller, can you run "scontrol show config > | > grep GroupUpdate"? > > Would you mind running as root "groups $user" where $user is one of the users > you were seeing missing groups with on the slurmctld node and seeing if that > group list is complete? > > Thanks, > --Tim > > ________________________________ > You are receiving this mail because: > > * You reported the bug. > > ________________________________ > The materials in this message are private and may contain Protected > Healthcare Information or other information of a sensitive nature. If you > are not the intended recipient, be advised that any unauthorized use, > disclosure, copying or the taking of any action in reliance on the contents > of this information is strictly prohibited. If you have received this email > in error, please immediately notify the sender via telephone or return mail. ________________________________ You are receiving this mail because: * You reported the bug. ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.
Hey Xing, I thought that adding "LaunchParameters=disable_send_gids" had fixed the problem? Are you not running with that now? If you aren't please do run with it, I had assumed it was left in place since it seemed to fix the issue.
Tim, Yes, I did try it and it once worked. However, after the test, you asked me to comment out this parameter and you would continue dig out the root cause for the problem. Is this a temporary solution or a permanent fix? Best, Xing ________________________________ From: bugs@schedmd.com <bugs@schedmd.com> Sent: Monday, February 14, 2022 12:32 PM To: Huang, Xing <x.huang@wustl.edu> Subject: [Bug 13217] missing group users has on login node * External Email - Caution * Comment # 18<https://bugs.schedmd.com/show_bug.cgi?id=13217#c18> on bug 13217<https://bugs.schedmd.com/show_bug.cgi?id=13217> from Tim McMullan<mailto:mcmullan@schedmd.com> Hey Xing, I thought that adding "LaunchParameters=disable_send_gids" had fixed the problem? Are you not running with that now? If you aren't please do run with it, I had assumed it was left in place since it seemed to fix the issue. ________________________________ You are receiving this mail because: * You reported the bug. ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.
(In reply to Xing Huang from comment #19) > Tim, > > Yes, I did try it and it once worked. However, after the test, you asked me > to comment out this parameter and you would continue dig out the root cause > for the problem. > Is this a temporary solution or a permanent fix? I'm so sorry I wasn't clear on this! My intentions on this are as follows: If running with "LaunchParameters=disable_send_gids" is working, I'm happy for you to be running with that option. I would like to understand why that option is necessary in your environment. It doesn't seem like it should be, but apparently is... however I don't want you to be in a broken state until we figure that out. As you say, it can take some time. If you are able to work with me for a while on why its required I'd certainly appreciate it. I might ask you to disable it and run with a debugging patch or try some other settings, then re-enable if things don't work. If you have a test system that exhibits the same behavior that's much better for testing when I can't reproduce the issue myself. Thanks! --Tim
I just wanted to reach out and see if you have re added "LaunchParameters=disable_send_gids" and if it was still working for you. Thanks, --Tim
Tim, Thanks for reaching out to me! We're good now with this option added. If you want to close the ticket, please go ahead. Again, thanks a lot for your help. Best, Xing ________________________________ From: bugs@schedmd.com <bugs@schedmd.com> Sent: Thursday, February 17, 2022 7:59 AM To: Huang, Xing <x.huang@wustl.edu> Subject: [Bug 13217] missing group users has on login node * External Email - Caution * Comment # 21<https://bugs.schedmd.com/show_bug.cgi?id=13217#c21> on bug 13217<https://bugs.schedmd.com/show_bug.cgi?id=13217> from Tim McMullan<mailto:mcmullan@schedmd.com> I just wanted to reach out and see if you have re added "LaunchParameters=disable_send_gids" and if it was still working for you. Thanks, --Tim ________________________________ You are receiving this mail because: * You reported the bug. ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.
Thank you Xing! I'm glad to hear that everything is working with that option in place. I'll resolve this now, thanks again! --Tim