Summary: | RPC rate limit support: How to configure jobs per second per user? | ||
---|---|---|---|
Product: | Slurm | Reporter: | Ole.H.Nielsen <Ole.H.Nielsen> |
Component: | slurmctld | Assignee: | Benjamin Witham <benjamin.witham> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | ||
Version: | 23.02.6 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | DTU Physics | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | 23.11.0 | Target Release: | --- |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Description
Ole.H.Nielsen@fysik.dtu.dk
2023-10-31 09:51:13 MDT
The rate limiting feature was developed for a specific site, and there is no one size fits all for the settings. At this moment, there is no easy analog for your jobs-per-second request. I would suggest increasing your rl_refill_rate or the size of your buckets.
> What we're seeing in slurmctld.log is that this configuration permits only 2
> jobs to be submitted every 5 seconds or so
Is this 2 jobs per user, or 2 jobs overall?
Hi Benjamin, (In reply to Benjamin Witham from comment #1) > The rate limiting feature was developed for a specific site, and there is no > one size fits all for the settings. At this moment, there is no easy analog > for your jobs-per-second request. I would suggest increasing your > rl_refill_rate or the size of your buckets. It's too bad that there is no approximate way even to guess at jobs-per-second. This makes it hard to optimize the user experience, while at the same time protecting slurmctld from rogue users :-( I already did increase rl_refill_rate and bucket size (did scontrol reconfig), but it didn't seem to improve jobs-per-second. I wonder what the developers had in mind when they designed the rl_enable related parameters? > > What we're seeing in slurmctld.log is that this configuration permits only 2 > > jobs to be submitted every 5 seconds or so > > Is this 2 jobs per user, or 2 jobs overall? Well, it's the same uid 1035 being throttled while he was submitting 500 jobs from his workflow system. Ole Hi Benjamin, (In reply to Benjamin Witham from comment #3) > The rate limits will not update without a slurmctld restart. Did you issue a > slurmctld restart or just the scontrol reconfigure? I only did "scontrol reconfig". AFAICT there is no documented requirement of restarting slurmctld when modifying the rl_enable family of parameters. If this is indeed the case, the slurm.conf manual page should have a highlighted entry in the "SlurmctldParameters" section that a restart is required. So now I restarted slurmctld, and the rate limit is logged without giving any details: [2023-11-07T23:11:32.184] RPC rate limiting enabled [2023-11-07T23:11:37.191] SchedulerParameters=default_queue_depth=2000,max_rpc_cnt=50,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=50000 Maybe the logging ought to be a bit more verbose? Thanks, Ole My apologies, you are correct, a slurmctld restart is not needed for a rate limit change. Ignore my statement in comment 3. The rate limit parameters are a SlurmctldParameters option, did this not print in your logs? > Maybe the logging ought to be a bit more verbose? What sort of more verbose logging output are you suggesting? (In reply to Benjamin Witham from comment #5) > My apologies, you are correct, a slurmctld restart is not needed for a rate > limit change. Ignore my statement in comment 3. IMHO, you seem to be right: After I restarted slurmctld, the rate limiting was much less frequent than with my original parameters. I'm fairly certain that "scontrol reconfig" didn't update the rl_enable parameters because up to 100.000's of rate limit lines were logged even after the reconfig. This of course varies with user habits. Can you confirm whether or not a slurmctld restart is really required? > The rate limit parameters are a SlurmctldParameters option, did this not > print in your logs? I currently have this setting: $ scontrol show config | grep SlurmctldParameters SlurmctldParameters = enable_configless,rl_enable,rl_refill_rate=10,rl_bucket_size=50 When I restart slurmctld only these lines are printed in the log: [2023-11-13T20:15:08.988] RPC rate limiting enabled [2023-11-13T20:15:13.995] SchedulerParameters=default_queue_depth=2000,max_rpc_cnt=50,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=50000 > > Maybe the logging ought to be a bit more verbose? > What sort of more verbose logging output are you suggesting? IMHO, ALL of the rl_enable family parameter values shown in the slurm.conf manual page ought to be printed to the log so that we have a log of the settings. Thanks, Ole Ah I understand. The rate limit parameters are currently underneath the debug log level. I'll see if this is something we're interested in promoting. Do you have a specific use-case(s) for the needing the new levels? (In reply to Benjamin Witham from comment #8) > Do you have a specific use-case(s) for the needing the new levels? In order to tune and debug rl_enable parameters, we need to know what they are! And we need the slurmctld.log to record the settings at the normal logging level. In the present case, I have no idea about the currently active rl_enable parameters and whether or not they were updated by "scontrol reconfig" or if a restart was necessary! We would like to empirically experiment with rate limiting of user job submission in a trial-and-error fashion, given that rl_enable parameters have no understandable relationship to user experiences as stated in comment 1. Thanks, Ole We've upgraded the log message of the rate limit parameters from debug to info. This is in ahead of 23.11.0 in commit 50cf4e284c. (In reply to Benjamin Witham from comment #12) > We've upgraded the log message of the rate limit parameters from debug to > info. This is in ahead of 23.11.0 in commit 50cf4e284c. Thanks a lot, this will be nice to have in 23.11. I couldn't lookup commit 50cf4e284c though, do you have a direct link? Ole |