Bug 2119 - User cannot submit jobs, invalid account reported
Summary: User cannot submit jobs, invalid account reported
Status: RESOLVED CANNOTREPRODUCE
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other bugs)
Version: 14.11.7
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2015-11-08 10:22 MST by Gene Soudlenkov
Modified: 2018-11-14 13:37 MST (History)
2 users (show)

See Also:
Site: University of Auckland
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Gene Soudlenkov 2015-11-08 10:22:52 MST
Dear Sir:

We have a problem with one of the accounts. The user had had 2 accounts with us and all his jobs were running OK. We added one more account and it does not work, every submission attempt ends with:
sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified

The account list for the user is correct:

sacctmgr show assoc format=account,user,partition where user=fsuz133
             Account       User  Partition 
-------------------- ---------- ---------- 
           nesi00233    fsuz133            
            uoa00380    fsuz133            
            uoa00149    fsuz133            

The problem seems to be related to the accounts, since none of the users associated with this account can submit. We just discovered yet another account with the same problem. All accounts are created through the same procedure, no exceptions, only two demonstrate the problem. Previously restarting Slurm daemon helped but not this time.

Regards,
Gene Soudlenkov
Comment 1 Tim Wickberg 2015-11-08 15:15:10 MST
That certainly sounds odd. I assume they have a default account set, and that you have been running using AccountingStorageEnforce set to assoc or limits for a while? Have there been any changes recently aside from adding this third account for that user?

Can you attach your current slurm.conf to the bug?

And can you try increasing the debug level on slurmctld and then submitting a job under that problematic account? There should be some hints as to what's happening that we can use to track down the issue. 

The relevant command is: "scontrol setdebug debug3"

You'll want to reset this with "scontrol setdebug info" afterwards - the log file can be rather verbose at that level.

- Tim
Comment 2 Gene Soudlenkov 2015-11-08 15:17:11 MST
I will certainly do all these tomorrow. Meanwhile, to answer your account question - yes, there is a default account and it works fine. We just discovered another user with the similar problem.

Cheers,
Gene
Comment 3 Gene Soudlenkov 2015-11-09 07:39:34 MST
Hi,Tim

This is our slurm.conf file:

# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=slurm-001-p
ControlAddr=10.0.111.240
BackupController=slurm-002-p
BackupAddr=10.0.111.239
# 
AuthType=auth/munge
CacheGroups=0
CheckpointType=checkpoint/blcr
CryptoType=crypto/munge
DisableRootJobs=NO 
#EnforcePartLimits=NO 
Epilog=/etc/slurm/epilog/job.sh
#EpilogSlurmctld= 
FirstJobId=15671726
MaxJobId=4100000000
GresTypes=gpu,io,gold
#GroupUpdateForce=0 
#GroupUpdateTime=600 
JobCheckpointDir=/scratch/checkpoint
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
#JobFileAppend=0 
#JobRequeue=1 
JobSubmitPlugins=filter
#JobSubmitPlugins=lua
#KillOnBadExit=0 
#LaunchType=launch/slurm 
Licenses=intel*2,pgi*1,gold*32,fluent*512
MailProg=/bin/mail 
MaxJobCount=30000 
#MaxStepCount=40000 
#MaxTasksPerNode=128 
MpiDefault=none
MpiParams=ports=12000-12099
#PluginDir= 
#PlugStackConfig= 
#PrivateData=jobs 
ProctrackType=proctrack/cgroup
Prolog=/etc/slurm/prolog/job.sh
#PrologSlurmctld= 
#PropagatePrioProcess=0 
PropagateResourceLimits=NONE
#PropagateResourceLimitsExcept=MEMLOCK,CPU 
#RebootProgram= 
ReturnToService=1
#SallocDefaultCommand="srun -n1 -N1 --mem-per-cpu=0 --pty --preserve-env --mpi=none $SHELL" 
SallocDefaultCommand="srun -n1 -N1 --mem-per-cpu=0 --pty --mpi=none $SHELL" 
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
#SlurmUser=slurm
SlurmdUser=root 
SrunEpilog=/etc/slurm/epilog/srun.sh
SrunProlog=/etc/slurm/prolog/srun.sh
StateSaveLocation=/var/spool/slurm
#StateSaveLocation=/var/spool
SwitchType=switch/none
TaskEpilog=/etc/slurm/epilog/task.sh
TaskPlugin=task/cgroup
#TaskPluginParam=
TaskProlog=/etc/slurm/prolog/task.sh
TopologyPlugin=topology/tree 
TmpFS=/tmp
#TrackWCKey=no 
#TreeWidth= 
#UnkillableStepProgram= 
#UsePAM=1
# 
# 
# TIMERS 
BatchStartTimeout=300
#CompleteWait=0 
#EpilogMsgTime=2000 
#GetEnvTimeout=2 
#HealthCheckInterval=300 
#HealthCheckProgram=/usr/sbin/nhc
InactiveLimit=120
KillWait=30
MessageTimeout=20 
#ResvOverRun=0 
MinJobAge=300
#OverTimeLimit=0 
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60 
#VSizeFactor=0 
Waittime=0
# 
# 
# SCHEDULING 
DefMemPerCPU=1024
FastSchedule=1
#MaxMemPerCPU=0 
#SchedulerRootFilter=1 
#SchedulerTimeSlice=30 
SchedulerType=sched/backfill
SchedulerParameters=bf_window=14400,bf_resolution=60,max_job_bf=1000,max_job_start=10000,defer,bf_continue,kill_invalid_depend
SchedulerPort=7321
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
# 
# 
# JOB PRIORITY 
PriorityFlags=DEPTH_OBLIVIOUS,SMALL_RELATIVE_TO_TIME
PriorityType=priority/multifactor
PriorityDecayHalfLife=14-0 
#PriorityCalcPeriod= 
PriorityFavorSmall=NO 
PriorityMaxAge=7-0 
#PriorityUsageResetPeriod=MONTHLY 
PriorityWeightAge=700 
PriorityWeightFairshare=40000 
PriorityWeightJobSize=500 
PriorityWeightPartition=1000
PriorityWeightQOS=0 
#
#
# PREEMPTION
PreemptType=preempt/partition_prio
PreemptMode=suspend,gang
# 
# 
# LOGGING AND ACCOUNTING 
AccountingStorageEnforce=associations,qos
AccountingStorageHost=slurm-db-p
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStoragePort=
#AccountingStorageType=accounting_storage/mysql
AccountingStorageType=accounting_storage/slurmdbd
#AccountingStorageUser=slurm
AccountingStoreJobComment=YES
ClusterName=pancluster
#DebugFlags= 
JobCompHost=slurm-db-p
#JobCompLoc=
#JobCompPass=
#JobCompPort=3306
#JobCompType=jobcomp/mysql
JobCompType=jobcomp/audit
#JobCompUser=slurm
#JobAcctGatherFrequency=network=60,task=60
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=task=30
AcctGatherProfileType = acct_gather_profile/hdf5
#AcctGatherInfinibandType=acct_gather_infiniband/ofed
#SlurmctldDebug=7
#SlurmctldLogFile=/var/log/slurm/slurmctl.log
#SlurmdDebug=7
#SlurmdLogFile=/var/log/slurm/slurmd.log
#SlurmSchedLogFile=/var/log/slurm/slurmsched.log 
#SlurmSchedLogLevel=3
# 
# 
# POWER SAVE SUPPORT FOR IDLE NODES (optional) 
#SuspendProgram= 
#ResumeProgram= 
#SuspendTimeout= 
#ResumeTimeout= 
#ResumeRate= 
#SuspendExcNodes= 
#SuspendExcParts= 
#SuspendRate= 
#SuspendTime= 
# 
# 
# COMPUTE NODES 
include /etc/slurm/nodes.conf
# PARTITIONS
include /etc/slurm/partitions.conf



And this is the debug output:


debug3: dependency=(null) account=uoa00380 qos=(null) comment=(null)
_job_create: invalid account or partition for user 5836, account 'uoa00380', and partition 'high'
debug3: argv="/gpfs1m/projects/uoa00380/ransi.devendra/furitsu/FS0411/test.sl"
debug3: environment=SLURM_JOB_NAME=FS0411,SLURM_PRIO_PROCESS=0,SLURM_SUBMIT_DIR=/gpfs1m/projects/uoa00380/ransi.devendra/furitsu/FS0411,...
debug3: work_dir=/gpfs1m/projects/uoa00380/ransi.devendra/furitsu/FS0411 alloc_node:sid=login-01:31017
debug3: argv="/gpfs1m/projects/uoa00380/ransi.devendra/furitsu/FS0411/test.sl"
debug3: dependency=(null) account=uoa00380 qos=(null) comment=(null)
debug3: environment=SLURM_JOB_NAME=FS0411,SLURM_PRIO_PROCESS=0,SLURM_SUBMIT_DIR=/gpfs1m/projects/uoa00380/ransi.devendra/furitsu/FS0411,...
_job_create: invalid account or partition for user 5836, account 'uoa00380', and partition 'high'
debug3: work_dir=/gpfs1m/projects/uoa00380/ransi.devendra/furitsu/FS0411 alloc_node:sid=login-01:31017


Please, note that uoa00380 account does exist, the partition high does exist and there are no limits or restrictions for both

Cheers,
Gene
Comment 4 Gene Soudlenkov 2015-11-10 06:21:23 MST
Hi,Tim

Just to re-iterate - the same user could submit jobs with different account setting.

Cheers,
Gene
Comment 5 Gene Soudlenkov 2015-11-10 06:21:34 MST
Hi,Tim

Just to re-iterate - the same user could submit jobs with different account setting.

Cheers,
Gene
Comment 6 Tim Wickberg 2015-11-10 11:08:26 MST
Is your "JobSubmitPlugins=filter" changing anything that may be conflicting with Slurm's accounting? I doubt that's the problem but it may be worth checking if things clear up without it enabled.

One other thing to check - I assume you're using LDAP or some other directory to propagate username/UID numbers throughout the cluster? Have there been any connectivity issues with that lately? I usually expect to see a username not a userid number in the debug log slot, although that may be something we changed in a recent version so that may or may not be a symptom of the problem.

If you don't mind, can you attach (or email me directly if you don't want it public on the bug report) the output for "scontrol show assoc" and "sacctmgr show assoc" ?
Comment 7 Gene Soudlenkov 2015-11-10 11:16:07 MST
Hi,Tim

I checked the filter code and found nothing there that could result of this behaviour - every time the filter declines the request we notify the user with the message. Also, this is the only account that does not work (although we discovered one more user with the same problem).

This is the output of the assoc command for this project:

pancluster   uoa00380                             100                                                                                                                                               normal                         
pancluster   uoa00380    brob695                  100                                                                                                                                               normal                         
pancluster   uoa00380    fsuz133                  100        


brob695 also has the same problem with submitting through this account. I already tried deleting it and re-creating it again but the problem persists

Gene
Comment 8 Tim Wickberg 2015-11-10 11:50:07 MST
Can you do "scontrol show assoc" as well? That gives us the cached view 
that slurmctld uses to permit/deny jobs, and may have some clue as to 
what's going on.

On 11/10/2015 05:16 PM, bugs@schedmd.com wrote:
> http://bugs.schedmd.com/show_bug.cgi?id=2119
>
> --- Comment #7 from Gene Soudlenkov <g.soudlenkov@auckland.ac.nz> ---
> Hi,Tim
>
> I checked the filter code and found nothing there that could result of this
> behaviour - every time the filter declines the request we notify the user with
> the message. Also, this is the only account that does not work (although we
> discovered one more user with the same problem).
>
> This is the output of the assoc command for this project:
>
> pancluster   uoa00380                             100
>
>                                      normal
> pancluster   uoa00380    brob695                  100
>
>                                      normal
> pancluster   uoa00380    fsuz133                  100
>
>
> brob695 also has the same problem with submitting through this account. I
> already tried deleting it and re-creating it again but the problem persists
>
> Gene
>
Comment 9 Gene Soudlenkov 2015-11-10 11:57:54 MST
Nope - scontrol show assoc gives error:

invalid entity: assoc for keyword: show

Gene
Comment 10 Tim Wickberg 2015-11-10 13:07:57 MST
Sorry about that, I forgot that command is new to the 15.08 release.

Can you provide a longer chunk of the log? There may be some slurmdbd 
communication errors or something else affecting the slurmctld process. 
You can email that to me directly if you're concerned about keeping 
users/commands private.

Have there been any other accounts created since this one, or is this 
the more recent account? Can you add users to other existing accounts 
without issue and have then run there?

On 11/10/2015 05:57 PM, bugs@schedmd.com wrote:
> http://bugs.schedmd.com/show_bug.cgi?id=2119
>
> --- Comment #9 from Gene Soudlenkov <g.soudlenkov@auckland.ac.nz> ---
> Nope - scontrol show assoc gives error:
>
> invalid entity: assoc for keyword: show
>
> Gene
>
Comment 11 Gene Soudlenkov 2015-11-10 13:23:12 MST
Hmmm.... I think I found why this happened. We have been having a problem with Slurm picking up new accounts and the only way to resolve the problem was to restart slurm - there was a bug filed about it a little while ago. For some reason slurm refused to restart on the login and build nodes from where the submit requests were sent. I just forced the restart of slurmd everywhere, killing everything to force it and it finally picked up the new accounts and started working again. I guess we can close this ticket now. Our plan is to upgrade to v 15 and hopefully this problem will be solved there.

Cheers,
Gene
Comment 12 Tim Wickberg 2015-11-12 08:18:08 MST
15.08 shouldn't have that behavior, although we're not aware of any reason that 14.11 would exhibit it currently. Restarting slurmd on the nodes shouldn't have an effect either.

If you're okay running as-is and expecting that 15.08 will resolve this I'll go ahead and mark this as resolved/worksforme for now, and you can always re-open it at any point.

If you'd like us to continue investigating, can you please send us full slurmctld and slurmdbd logs for at least a few hours around when this behavior happens? It certainly sounds like there's some communication issue which may be hinted at somewhere else in those logs.

- Tim
Comment 13 Gene Soudlenkov 2015-11-12 08:19:33 MST
Thanks, Tim - yesterday we upgraded to 15 so we will run it for a while and see - meanwhile, I would like to close this ticket - thanks for your help!

Cheers,
Gene
Comment 14 Tim Wickberg 2015-11-12 08:27:52 MST
You're welcome, and thanks for your patience on this. Marking as closed now, please let us know if this recurs.

- Tim