Bug 14472 - job_submit_plugins manual page
Summary: job_submit_plugins manual page
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Documentation (show other bugs)
Version: 21.08.8
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Oscar Hernández
QA Contact: Ben Roberts
URL:
Depends on:
Blocks:
 
Reported: 2022-07-04 02:20 MDT by Ole.H.Nielsen@fysik.dtu.dk
Modified: 2022-09-13 04:43 MDT (History)
1 user (show)

See Also:
Site: DTU Physics
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 22.05.3
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Ole.H.Nielsen@fysik.dtu.dk 2022-07-04 02:20:35 MDT
In the Job Submit Plugin API manual page https://slurm.schedmd.com/job_submit_plugins.html it is stated that:

> Slurm can be configured to use multiple job_submit plugins if desired, however the lua plugin will only execute one lua script named "job_submit.lua" located in the default script directory (typically the subdirectory "etc" of the installation directory).

In an RPM installation of Slurm the "default script directory" is defined as:

PluginDir = /usr/lib64/slurm

However, some conflicting information is also found in the manual page:

> The default installation location of the Lua scripts is the same location as the Slurm configuration file, slurm.conf.

I think this needs correction and clarification.  Some questions are:

1. Is it correct to assume that only 1 Lua plugin is possible, and that it must be installed in /etc/slurm/job_submit.lua?

2. Can one install both Lua and non-Lua plugins?  The slurm.conf man-page says:
JobSubmitPlugins: A  comma-delimited  list of job submission plugins to be used.  The specified plugins will be executed in the order listed. 

3. In a Configless Slurm installation where can we install the job_submit.lua file?  The /etc/slurm/ directory doesn't exist in login and compute nodes!

4. Would it be possible for Configless Slurm to provide also the /etc/slurm/job_submit.lua file?
Comment 1 Ole.H.Nielsen@fysik.dtu.dk 2022-07-04 02:39:33 MDT
As a test I've copied the example file /rpmbuild/BUILD/slurm-21.08.8-2/contribs/lua/job_submit.lua to /etc/slurm/ and restarted slurmctld.

However, when a user attempts to submit a job, an error message is printed:

$ sbatch script.sh
sbatch: error: Batch job submission failed: Unspecified error

and the slurmctld.log file says:

[2022-07-04T10:26:08.138] error: Couldn't find the specified plugin name for job_submit/job_submit.lua looking at all files
[2022-07-04T10:26:08.139] error: cannot find job_submit plugin for job_submit/job_submit.lua
[2022-07-04T10:26:08.139] error: cannot create job_submit context for job_submit/job_submit.lua
[2022-07-04T10:26:08.139] fatal: failed to initialize job_submit plugin

This causes slurmctld to crash!!!

It seems that correct documentation is urgently needed if the job_submit.lua plugin is to be activated.
Comment 2 Ole.H.Nielsen@fysik.dtu.dk 2022-07-04 03:39:47 MDT
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #1)
> As a test I've copied the example file
> /rpmbuild/BUILD/slurm-21.08.8-2/contribs/lua/job_submit.lua to /etc/slurm/
> and restarted slurmctld.

An I had configured this in slurm.conf:

JobSubmitPlugins=job_submit.lua
Comment 3 Oscar Hernández 2022-07-04 04:54:46 MDT
Dear Ole,

First of all, let me clarify that when plugins are mentioned in the slurm documentation, it refers to the *.so files that implement modular functionalities of slurm. It is not referring to the specific file (job_submit.lua script) needed by the job_submit/lua plugin.

So, when slurm.conf refers to "PluginDir", it is the path where all the libraries will be placed. For the job_submit/lua, in plugin dir you should have:

job_submit_lua.so (the one we are interested in and implements the c functions of the plugin)

And many more libraries which implement different behaviors of slurm.

Also, please note that in order to enable the desired plugin, you should have it like that in slurm.conf:

JobSubmitPlugins = lua #(not job_submit.lua)

Now, answering your questions one by one:

1. When you load the job_submit/lua plugin, slurm will search for the job_submit.lua file. This file is a LUA script that must be modified/created by the system administrator of each site, enforcing the desired policies (we are just providing an example). There can be just one file with this name, and it will be searched in the same folder where slurm.conf is located. Just as it is mentioned in the documentation.

2. Yes, as documentation reports, you can enable multiple job_submit plugins at the same time. Currently available job submit plugins are listed here[1] and will be executed in the order listed.

3-4 It will also work in configless setups. Job_submit lua script is executed by the slurm controller, so just one copy of it is needed, just as the slurm.conf and in its same location.

About the error you are getting, it is reported on starup/reconfigure as slurm is not able to find the plugin you specificed (as it is incorrectly specified). Configuring it in the slurm.conf as suggested earlier, should solve the issues.

Let me know if, after the suggested change, you still have issues with that. Or some thing was not correctly clarified.

[1]https://slurm.schedmd.com/job_submit_plugins.html#overview
Comment 4 Ole.H.Nielsen@fysik.dtu.dk 2022-07-04 05:34:19 MDT
Dear Oscar,

Thanks so much for the clarifications!  The current documentation is definitely confusing.

(In reply to Oscar Hernández from comment #3)
> First of all, let me clarify that when plugins are mentioned in the slurm
> documentation, it refers to the *.so files that implement modular
> functionalities of slurm. It is not referring to the specific file
> (job_submit.lua script) needed by the job_submit/lua plugin.
> 
> So, when slurm.conf refers to "PluginDir", it is the path where all the
> libraries will be placed. For the job_submit/lua, in plugin dir you should
> have:

Yes, I noticed the contents of "PluginDir" and became confused how the job_submit/lua plugin fits into this.

> job_submit_lua.so (the one we are interested in and implements the c
> functions of the plugin)
> 
> And many more libraries which implement different behaviors of slurm.
> 
> Also, please note that in order to enable the desired plugin, you should
> have it like that in slurm.conf:
> 
> JobSubmitPlugins = lua #(not job_submit.lua)

Ah, thanks a lot.  That is NOT documented anywhere!!  I refer to the slurm.conf manual page.

> Now, answering your questions one by one:
> 
> 1. When you load the job_submit/lua plugin, slurm will search for the
> job_submit.lua file. This file is a LUA script that must be modified/created
> by the system administrator of each site, enforcing the desired policies (we
> are just providing an example). There can be just one file with this name,
> and it will be searched in the same folder where slurm.conf is located. Just
> as it is mentioned in the documentation.

Understood.  I had configured JobSubmitPlugins=lua in slurm.conf, but forgot to copy the example job_submit.lua file to /etc/slurm/.  This caused a fatal crash of slurmctld:

[2022-07-04T13:23:59.139] error: job_submit/lua: Unable to stat /etc/slurm/job_submit.lua: No such file or directory
[2022-07-04T13:23:59.139] error: Couldn't load specified plugin name for job_submit/lua: Plugin init() callback failed
[2022-07-04T13:23:59.139] error: cannot create job_submit context for job_submit/lua
[2022-07-04T13:23:59.139] fatal: failed to initialize job_submit plugin

Request: Could the missing job_submit.lua file be changed to issue a warning message in stead of a fatal crash?

> 2. Yes, as documentation reports, you can enable multiple job_submit plugins
> at the same time. Currently available job submit plugins are listed here[1]
> and will be executed in the order listed.

Could you document an example when JobSubmitPlugins=lua is augmented by additional plugins?  That's not obvious at all.

> 3-4 It will also work in configless setups. Job_submit lua script is
> executed by the slurm controller, so just one copy of it is needed, just as
> the slurm.conf and in its same location.

Ah, thanks, good to know.  I thought that job_submit.lua was processed on the submit host.

> About the error you are getting, it is reported on starup/reconfigure as
> slurm is not able to find the plugin you specificed (as it is incorrectly
> specified). Configuring it in the slurm.conf as suggested earlier, should
> solve the issues.

Yes, configuring slurm.conf JobSubmitPlugins=lua and copying the example job_submit.lua file works!

> Let me know if, after the suggested change, you still have issues with that.
> Or some thing was not correctly clarified.
> 
> [1]https://slurm.schedmd.com/job_submit_plugins.html#overview

IMHO, the documentation ought to be clarified in the light of the confusion which I have experienced.

Thanks a lot,
Ole
Comment 5 Ole.H.Nielsen@fysik.dtu.dk 2022-07-04 05:41:07 MDT
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #0)
> In the Job Submit Plugin API manual page
> https://slurm.schedmd.com/job_submit_plugins.html it is stated that:
> 
> > Slurm can be configured to use multiple job_submit plugins if desired, however the lua plugin will only execute one lua script named "job_submit.lua" located in the default script directory (typically the subdirectory "etc" of the installation directory).

The wording "default script directory" seems to be incorrect!  What is meant is actually /etc/slurm.

Could you kindly update the documentation?
Comment 6 Oscar Hernández 2022-07-04 08:57:42 MDT
Dear Ole,

Thanks for your suggestions on that section of the slurm.conf documentation. 

I will discuss the suggested changes with my teammates and get back to you.

Kind regards,
Oscar
Comment 10 Ole.H.Nielsen@fysik.dtu.dk 2022-07-06 06:20:21 MDT
Hi Oscar,

I am unable to see comments 7, 8 and 9 in this bug.  Can you please make the comments available?

Thanks,
Ole
Comment 11 Oscar Hernández 2022-07-06 06:33:59 MDT
Hi Ole,

I apologize for that. The doc patch is currently undergoing QA, and that comment (and 7,8) was meant to be private. I have made it private now.

Basically we are discussing best ways of addressing your suggestions, to make the documentation clearer. I will let you know once it is definitive.

Sorry for the inconvenience,
Oscar
Comment 15 Oscar Hernández 2022-07-11 01:43:53 MDT
Dear Ole,

Thanks for your patience. A patch in the documentation has been pushed. The idea is to clarify the slurm.conf JobSubmitPlugins man, making clear which are the possible values, as well as explaining briefly each one. So, now we do:

1. List relevant available plugins
2. Explain lua config details in its description
3. Add an example for multiple plugin config

Commit is: https://github.com/SchedMD/slurm/commit/6aa1a78d3664b8a6678511e0ba583b5a878bc49b

Also, with regard to your petition of turning the fatal into warning, it won't be possible (for a fresh startup).

The reason is that when we are running a fresh start of the daemons, we expect all the relevant configs to be consistent/working, otherwise, if logs are not checked, users could experience unexpected behaviors.

However, for the case of "scontrol reconfigur", we do take that into consideration. If a wrong value is listed in slurm.conf:

[2022-07-11T09:28:54.380] JobSubmitPlugins changed to luawrong
[2022-07-11T09:28:54.380] error: Couldn't find the specified plugin name for job_submit/luawrong looking at all files
[2022-07-11T09:28:54.382] error: cannot find job_submit plugin for job_submit/luawrong
[2022-07-11T09:28:54.382] error: cannot create job_submit context for job_submit/luawrong

We list the error but do not fatal.

And, if after the reconfigure, the job_submit.lua is wrong formatted (or missing), it will use the previous version of the script (which we have stored backup previously):

[2022-07-11T09:27:13.137] error: job_submit/lua: Unable to stat /2205/install/etc/job_submit.lua, using old script: No such file or directory

Hope that helps clearing the doubts. But let me know if you have any other question.

Kind regards,
Oscar
Comment 16 Ole.H.Nielsen@fysik.dtu.dk 2022-07-11 02:07:36 MDT
Dear Oscar,

Thanks very much for the information and clarifications to the job_submit Lua plugin documentation!  This is now much clearer, and confusion should no longer be an issue.

Will the updated documentation be pushed to both 21.08 and 22.05 at the next minor version?

You're welcome to close this case.

Best regards,
Ole
Comment 17 Oscar Hernández 2022-07-11 09:42:00 MDT
Hi Ole,

> Will the updated documentation be pushed to both 21.08 and 22.05 at the next
> minor version?

It is pushed only to the current release (22.05). So manual pages for the 21.08 release will not be updated. When the next minor (22.05.3) is released, it will be available in the man of the new release, and at the same time, it will be updated in our online documentation.

> You're welcome to close this case.

Great, if you have any other doubt related, just let me know.

Regards,
Oscar
Comment 18 Ole.H.Nielsen@fysik.dtu.dk 2022-09-08 00:27:56 MDT
I would like to reopen this case because there is a current discussion of job_submit.lua in the slurm-users list.

In Comment 15 above there is a very useful description of what happens in case of an error in the job_submit.lua file:

> And, if after the reconfigure, the job_submit.lua is wrong formatted (or missing), it will use the previous version of the script (which we have stored backup previously):

I propose that this information should be added to the slurm.conf manual page in the section on JobSubmitPlugins right after this item:

lua
    Execute a Lua script implementing site's own job_submit logic. Only one Lua script will be executed. It must be named "job_submit.lua" and must be located in the default configuration directory (typically the subdirectory "etc" of the installation directory). Sample Lua scripts can be found with the Slurm distribution, in the directory contribs/lua. 

I'm sure this documentation would be appreciated by users.
Comment 20 Oscar Hernández 2022-09-09 04:12:44 MDT
Hi Ole,

Thanks for the suggestion, I do also agree that documenting this fallback behavior would improve the documentation. I will work on an update and let you know.

Just for clarification though, now that I re-read my comment. I would like to clarify a little the behavior.

We interpret that a lua script can fail in 2 different ways: 
1 - Because the script is wrong formatted (invalid script) or missing.
2 - It can fail at execution time (some function receiving unexpected values).

At slurmctld startup. Slurm will only check (1) if the script is valid and can be executed. If it is not, it will fail with a fatal as discussed in previous comments.

As the lua script is loaded each time a job is submitted, live modifications of this script are supported. There is no need to reconfigure between lua script modifications. (in my prevous comment I mention a reconfigure that might be misleading)

Here comes the fallback behavior. If some of the modifications done while slurmctld is running break the code ( error of type 1), or slurm cannot find the file. Slurm will use a stored version of the previous "working script". Slurm will also make clear this behaviour by showing in the logs something like (note the "using previous script"):

[2022-09-09T11:21:46.038] error: job_submit/lua: /2205/install/etc/job_submit.l
ua: /2205/install/etc/job_submit.lua:22: unexpected symbol near '.', using previous script.

However, dealing with errors of type 2 (runtime errors) cannot be controlled that way. Slurm has a valid script, so it executes it. If it fails, there is no fallback behavior here (as some of the code might have already modified things). Error logs will only show the lua error, so that the admin can understand what situation caused the error and fix the script accordingly.

Just wanted to take the opportunity to clarify it, please, let me know if you have any doubt on that regard.

Regards,
Oscar
Comment 23 Ole.H.Nielsen@fysik.dtu.dk 2022-09-09 05:54:52 MDT
Hi Oscar,

(In reply to Oscar Hernández from comment #20)
> Thanks for the suggestion, I do also agree that documenting this fallback
> behavior would improve the documentation. I will work on an update and let
> you know.

Sounds good!

> We interpret that a lua script can fail in 2 different ways: 
> 1 - Because the script is wrong formatted (invalid script) or missing.
> 2 - It can fail at execution time (some function receiving unexpected
> values).
> 
> At slurmctld startup. Slurm will only check (1) if the script is valid and
> can be executed. If it is not, it will fail with a fatal as discussed in
> previous comments.

IMHO, the fatal crash of slurmctld in this case is unwarranted!  It is much better to have the slurmctld working, but ignore the LUA script.  Is there any reason why SchedMD thinks that a crash is preferable?

> Here comes the fallback behavior. If some of the modifications done while
> slurmctld is running break the code ( error of type 1), or slurm cannot find
> the file. Slurm will use a stored version of the previous "working script".
> Slurm will also make clear this behaviour by showing in the logs something
> like (note the "using previous script"):
> 
> [2022-09-09T11:21:46.038] error: job_submit/lua:
> /2205/install/etc/job_submit.l
> ua: /2205/install/etc/job_submit.lua:22: unexpected symbol near '.', using
> previous script.
> 
> However, dealing with errors of type 2 (runtime errors) cannot be controlled
> that way. Slurm has a valid script, so it executes it. If it fails, there is
> no fallback behavior here (as some of the code might have already modified
> things). Error logs will only show the lua error, so that the admin can
> understand what situation caused the error and fix the script accordingly.

Sounds reasonable.  But suppose that the LUA script has an error of type 1 or 2, and the sysadmin now corrects the script.  The slurmctld will, however, continue to use the stored script in stead of the script file on disk, right?  How can the sysadmin request that slurmctld should now use the (hopefully) correct script?  Should we use "scontrol reconfig" or what?

Thanks,
Ole
Comment 24 Oscar Hernández 2022-09-09 07:01:10 MDT
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #23)

> IMHO, the fatal crash of slurmctld in this case is unwarranted!  It is much
> better to have the slurmctld working, but ignore the LUA script.  Is there
> any reason why SchedMD thinks that a crash is preferable?

As I comented in a previous message in this thread. When we are running a fresh start of the daemons, we expect all the relevant configs to be consistent/working. We do not want to disable something that has been explicitly requested. The reason behind all the script backup procedure, is precisely to avoid a running slurmctld fataling for some unwanted modification(or file deletion), so it continues working using a backup version (and notifies via error log).

> Sounds reasonable.  But suppose that the LUA script has an error of type 1
> or 2, and the sysadmin now corrects the script.  The slurmctld will,
> however, continue to use the stored script in stead of the script file on
> disk, right?  How can the sysadmin request that slurmctld should now use the
> (hopefully) correct script?  Should we use "scontrol reconfig" or what?

Every time a job is submitted, slurm tries to read the script. So, if the script is corrected, in the next job submission (no need for a reconfig), the corrected script will be executed. So, you can be certain you are running the last updated version unless you see some message like:

error: [reason why failed..], using previous script.
Comment 25 Ole.H.Nielsen@fysik.dtu.dk 2022-09-09 07:24:50 MDT
Hi Oscar,
Thanks, this is extremely useful to know.  The handling of the LUA script as you explain ought to be added to the documentation.
Thanks,
Ole
Comment 28 Oscar Hernández 2022-09-13 02:13:30 MDT
Hi Ole,

The documentation update has already been pushed. Hopefully it clarifies the behavior discussed in the last comments.

commit 100be455152fdbcd77a47190425d06bdf4b926f4

    Docs - Document lua script fallback behavior
    
    Bug 14472

Thanks again for the feedback,

Oscar
Comment 29 Ole.H.Nielsen@fysik.dtu.dk 2022-09-13 02:21:06 MDT
I'm out of the office, back on September 14.
Jeg er ikke på kontoret, tilbage den 14/9.

Best regards / Venlig hilsen,
Ole Holm Nielsen