Dear Sir: We have a problem with one of the accounts. The user had had 2 accounts with us and all his jobs were running OK. We added one more account and it does not work, every submission attempt ends with: sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified The account list for the user is correct: sacctmgr show assoc format=account,user,partition where user=fsuz133 Account User Partition -------------------- ---------- ---------- nesi00233 fsuz133 uoa00380 fsuz133 uoa00149 fsuz133 The problem seems to be related to the accounts, since none of the users associated with this account can submit. We just discovered yet another account with the same problem. All accounts are created through the same procedure, no exceptions, only two demonstrate the problem. Previously restarting Slurm daemon helped but not this time. Regards, Gene Soudlenkov
That certainly sounds odd. I assume they have a default account set, and that you have been running using AccountingStorageEnforce set to assoc or limits for a while? Have there been any changes recently aside from adding this third account for that user? Can you attach your current slurm.conf to the bug? And can you try increasing the debug level on slurmctld and then submitting a job under that problematic account? There should be some hints as to what's happening that we can use to track down the issue. The relevant command is: "scontrol setdebug debug3" You'll want to reset this with "scontrol setdebug info" afterwards - the log file can be rather verbose at that level. - Tim
I will certainly do all these tomorrow. Meanwhile, to answer your account question - yes, there is a default account and it works fine. We just discovered another user with the similar problem. Cheers, Gene
Hi,Tim This is our slurm.conf file: # slurm.conf file generated by configurator.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ControlMachine=slurm-001-p ControlAddr=10.0.111.240 BackupController=slurm-002-p BackupAddr=10.0.111.239 # AuthType=auth/munge CacheGroups=0 CheckpointType=checkpoint/blcr CryptoType=crypto/munge DisableRootJobs=NO #EnforcePartLimits=NO Epilog=/etc/slurm/epilog/job.sh #EpilogSlurmctld= FirstJobId=15671726 MaxJobId=4100000000 GresTypes=gpu,io,gold #GroupUpdateForce=0 #GroupUpdateTime=600 JobCheckpointDir=/scratch/checkpoint #JobCredentialPrivateKey= #JobCredentialPublicCertificate= #JobFileAppend=0 #JobRequeue=1 JobSubmitPlugins=filter #JobSubmitPlugins=lua #KillOnBadExit=0 #LaunchType=launch/slurm Licenses=intel*2,pgi*1,gold*32,fluent*512 MailProg=/bin/mail MaxJobCount=30000 #MaxStepCount=40000 #MaxTasksPerNode=128 MpiDefault=none MpiParams=ports=12000-12099 #PluginDir= #PlugStackConfig= #PrivateData=jobs ProctrackType=proctrack/cgroup Prolog=/etc/slurm/prolog/job.sh #PrologSlurmctld= #PropagatePrioProcess=0 PropagateResourceLimits=NONE #PropagateResourceLimitsExcept=MEMLOCK,CPU #RebootProgram= ReturnToService=1 #SallocDefaultCommand="srun -n1 -N1 --mem-per-cpu=0 --pty --preserve-env --mpi=none $SHELL" SallocDefaultCommand="srun -n1 -N1 --mem-per-cpu=0 --pty --mpi=none $SHELL" SlurmctldPidFile=/var/run/slurm/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurm/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd #SlurmUser=slurm SlurmdUser=root SrunEpilog=/etc/slurm/epilog/srun.sh SrunProlog=/etc/slurm/prolog/srun.sh StateSaveLocation=/var/spool/slurm #StateSaveLocation=/var/spool SwitchType=switch/none TaskEpilog=/etc/slurm/epilog/task.sh TaskPlugin=task/cgroup #TaskPluginParam= TaskProlog=/etc/slurm/prolog/task.sh TopologyPlugin=topology/tree TmpFS=/tmp #TrackWCKey=no #TreeWidth= #UnkillableStepProgram= #UsePAM=1 # # # TIMERS BatchStartTimeout=300 #CompleteWait=0 #EpilogMsgTime=2000 #GetEnvTimeout=2 #HealthCheckInterval=300 #HealthCheckProgram=/usr/sbin/nhc InactiveLimit=120 KillWait=30 MessageTimeout=20 #ResvOverRun=0 MinJobAge=300 #OverTimeLimit=0 SlurmctldTimeout=120 SlurmdTimeout=300 #UnkillableStepTimeout=60 #VSizeFactor=0 Waittime=0 # # # SCHEDULING DefMemPerCPU=1024 FastSchedule=1 #MaxMemPerCPU=0 #SchedulerRootFilter=1 #SchedulerTimeSlice=30 SchedulerType=sched/backfill SchedulerParameters=bf_window=14400,bf_resolution=60,max_job_bf=1000,max_job_start=10000,defer,bf_continue,kill_invalid_depend SchedulerPort=7321 SelectType=select/cons_res SelectTypeParameters=CR_CPU_Memory # # # JOB PRIORITY PriorityFlags=DEPTH_OBLIVIOUS,SMALL_RELATIVE_TO_TIME PriorityType=priority/multifactor PriorityDecayHalfLife=14-0 #PriorityCalcPeriod= PriorityFavorSmall=NO PriorityMaxAge=7-0 #PriorityUsageResetPeriod=MONTHLY PriorityWeightAge=700 PriorityWeightFairshare=40000 PriorityWeightJobSize=500 PriorityWeightPartition=1000 PriorityWeightQOS=0 # # # PREEMPTION PreemptType=preempt/partition_prio PreemptMode=suspend,gang # # # LOGGING AND ACCOUNTING AccountingStorageEnforce=associations,qos AccountingStorageHost=slurm-db-p #AccountingStorageLoc= #AccountingStoragePass= #AccountingStoragePort= #AccountingStorageType=accounting_storage/mysql AccountingStorageType=accounting_storage/slurmdbd #AccountingStorageUser=slurm AccountingStoreJobComment=YES ClusterName=pancluster #DebugFlags= JobCompHost=slurm-db-p #JobCompLoc= #JobCompPass= #JobCompPort=3306 #JobCompType=jobcomp/mysql JobCompType=jobcomp/audit #JobCompUser=slurm #JobAcctGatherFrequency=network=60,task=60 JobAcctGatherType=jobacct_gather/linux JobAcctGatherFrequency=task=30 AcctGatherProfileType = acct_gather_profile/hdf5 #AcctGatherInfinibandType=acct_gather_infiniband/ofed #SlurmctldDebug=7 #SlurmctldLogFile=/var/log/slurm/slurmctl.log #SlurmdDebug=7 #SlurmdLogFile=/var/log/slurm/slurmd.log #SlurmSchedLogFile=/var/log/slurm/slurmsched.log #SlurmSchedLogLevel=3 # # # POWER SAVE SUPPORT FOR IDLE NODES (optional) #SuspendProgram= #ResumeProgram= #SuspendTimeout= #ResumeTimeout= #ResumeRate= #SuspendExcNodes= #SuspendExcParts= #SuspendRate= #SuspendTime= # # # COMPUTE NODES include /etc/slurm/nodes.conf # PARTITIONS include /etc/slurm/partitions.conf And this is the debug output: debug3: dependency=(null) account=uoa00380 qos=(null) comment=(null) _job_create: invalid account or partition for user 5836, account 'uoa00380', and partition 'high' debug3: argv="/gpfs1m/projects/uoa00380/ransi.devendra/furitsu/FS0411/test.sl" debug3: environment=SLURM_JOB_NAME=FS0411,SLURM_PRIO_PROCESS=0,SLURM_SUBMIT_DIR=/gpfs1m/projects/uoa00380/ransi.devendra/furitsu/FS0411,... debug3: work_dir=/gpfs1m/projects/uoa00380/ransi.devendra/furitsu/FS0411 alloc_node:sid=login-01:31017 debug3: argv="/gpfs1m/projects/uoa00380/ransi.devendra/furitsu/FS0411/test.sl" debug3: dependency=(null) account=uoa00380 qos=(null) comment=(null) debug3: environment=SLURM_JOB_NAME=FS0411,SLURM_PRIO_PROCESS=0,SLURM_SUBMIT_DIR=/gpfs1m/projects/uoa00380/ransi.devendra/furitsu/FS0411,... _job_create: invalid account or partition for user 5836, account 'uoa00380', and partition 'high' debug3: work_dir=/gpfs1m/projects/uoa00380/ransi.devendra/furitsu/FS0411 alloc_node:sid=login-01:31017 Please, note that uoa00380 account does exist, the partition high does exist and there are no limits or restrictions for both Cheers, Gene
Hi,Tim Just to re-iterate - the same user could submit jobs with different account setting. Cheers, Gene
Is your "JobSubmitPlugins=filter" changing anything that may be conflicting with Slurm's accounting? I doubt that's the problem but it may be worth checking if things clear up without it enabled. One other thing to check - I assume you're using LDAP or some other directory to propagate username/UID numbers throughout the cluster? Have there been any connectivity issues with that lately? I usually expect to see a username not a userid number in the debug log slot, although that may be something we changed in a recent version so that may or may not be a symptom of the problem. If you don't mind, can you attach (or email me directly if you don't want it public on the bug report) the output for "scontrol show assoc" and "sacctmgr show assoc" ?
Hi,Tim I checked the filter code and found nothing there that could result of this behaviour - every time the filter declines the request we notify the user with the message. Also, this is the only account that does not work (although we discovered one more user with the same problem). This is the output of the assoc command for this project: pancluster uoa00380 100 normal pancluster uoa00380 brob695 100 normal pancluster uoa00380 fsuz133 100 brob695 also has the same problem with submitting through this account. I already tried deleting it and re-creating it again but the problem persists Gene
Can you do "scontrol show assoc" as well? That gives us the cached view that slurmctld uses to permit/deny jobs, and may have some clue as to what's going on. On 11/10/2015 05:16 PM, bugs@schedmd.com wrote: > http://bugs.schedmd.com/show_bug.cgi?id=2119 > > --- Comment #7 from Gene Soudlenkov <g.soudlenkov@auckland.ac.nz> --- > Hi,Tim > > I checked the filter code and found nothing there that could result of this > behaviour - every time the filter declines the request we notify the user with > the message. Also, this is the only account that does not work (although we > discovered one more user with the same problem). > > This is the output of the assoc command for this project: > > pancluster uoa00380 100 > > normal > pancluster uoa00380 brob695 100 > > normal > pancluster uoa00380 fsuz133 100 > > > brob695 also has the same problem with submitting through this account. I > already tried deleting it and re-creating it again but the problem persists > > Gene >
Nope - scontrol show assoc gives error: invalid entity: assoc for keyword: show Gene
Sorry about that, I forgot that command is new to the 15.08 release. Can you provide a longer chunk of the log? There may be some slurmdbd communication errors or something else affecting the slurmctld process. You can email that to me directly if you're concerned about keeping users/commands private. Have there been any other accounts created since this one, or is this the more recent account? Can you add users to other existing accounts without issue and have then run there? On 11/10/2015 05:57 PM, bugs@schedmd.com wrote: > http://bugs.schedmd.com/show_bug.cgi?id=2119 > > --- Comment #9 from Gene Soudlenkov <g.soudlenkov@auckland.ac.nz> --- > Nope - scontrol show assoc gives error: > > invalid entity: assoc for keyword: show > > Gene >
Hmmm.... I think I found why this happened. We have been having a problem with Slurm picking up new accounts and the only way to resolve the problem was to restart slurm - there was a bug filed about it a little while ago. For some reason slurm refused to restart on the login and build nodes from where the submit requests were sent. I just forced the restart of slurmd everywhere, killing everything to force it and it finally picked up the new accounts and started working again. I guess we can close this ticket now. Our plan is to upgrade to v 15 and hopefully this problem will be solved there. Cheers, Gene
15.08 shouldn't have that behavior, although we're not aware of any reason that 14.11 would exhibit it currently. Restarting slurmd on the nodes shouldn't have an effect either. If you're okay running as-is and expecting that 15.08 will resolve this I'll go ahead and mark this as resolved/worksforme for now, and you can always re-open it at any point. If you'd like us to continue investigating, can you please send us full slurmctld and slurmdbd logs for at least a few hours around when this behavior happens? It certainly sounds like there's some communication issue which may be hinted at somewhere else in those logs. - Tim
Thanks, Tim - yesterday we upgraded to 15 so we will run it for a while and see - meanwhile, I would like to close this ticket - thanks for your help! Cheers, Gene
You're welcome, and thanks for your patience on this. Marking as closed now, please let us know if this recurs. - Tim