Bug 5106 - Recommended lua version for 17.11?
Summary: Recommended lua version for 17.11?
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other bugs)
Version: 17.11.5
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-04-26 15:45 MDT by Lyn
Modified: 2018-05-09 01:25 MDT (History)
2 users (show)

See Also:
Site: NASA - NCCS
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name: Discover
CLE Version:
Version Fixed: 17.11.6 18.08.0-pre2
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
NCCS job_submit.lua relevant to next update. (3.49 KB, text/plain)
2018-04-27 15:56 MDT, Lyn
Details
NCCS slurm.conf for Discover SCU14 cluster (7.15 KB, text/plain)
2018-04-27 16:31 MDT, Lyn
Details
add ESLURM_INVALID_TIME_LIMIT to job_submit/lua plugin (607 bytes, patch)
2018-04-27 18:38 MDT, Tim Wickberg
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Lyn 2018-04-26 15:45:52 MDT
Hi SchedMD Gang,

We have a job_submit.lua that is working, short of the fact that it doesn't recognize and properly handle the error codes (e.g., ESLURM_INVALID_TIME_VALUE) from slurm/slurm_errno.h. 

Is there a recommended version of Lua to work w/17.11? In a long-ago exchange w/DA -- 14.11 timeframe -- he noted that 5.1 was better then. SLES12SP3 installs 5.2, so that's what got integrated into our 17.11 build. 

Appreciate any suggestions!

All the best,
Lyn & Bruce
Comment 1 Tim Wickberg 2018-04-26 16:05:35 MDT
(In reply to Lyn from comment #0)
> Hi SchedMD Gang,
> 
> We have a job_submit.lua that is working, short of the fact that it doesn't
> recognize and properly handle the error codes (e.g.,
> ESLURM_INVALID_TIME_VALUE) from slurm/slurm_errno.h. 
>
> Is there a recommended version of Lua to work w/17.11? In a long-ago
> exchange w/DA -- 14.11 timeframe -- he noted that 5.1 was better then.
> SLES12SP3 installs 5.2, so that's what got integrated into our 17.11 build. 

5.1, 5.2, or 5.3 should all be fine. If 5.2 is already packaged I'd stick with that.

- Tim
Comment 2 Lyn 2018-04-27 09:29:28 MDT
Thanks, Tim. 

Okay, so given that 5.2 should work, do you want the details of the ESLURM_<blah> that is not working properly (while others are) in this bug? Or do you want to close this one, since you've completed the action for the initial request? (I used to run 2 backline support teams for Sun; I know which answer I'd give.... :)

Best,
Lyn
Comment 3 Tim Wickberg 2018-04-27 14:43:00 MDT
(In reply to Lyn from comment #2)
> Thanks, Tim. 
> 
> Okay, so given that 5.2 should work, do you want the details of the
> ESLURM_<blah> that is not working properly (while others are) in this bug?
> Or do you want to close this one, since you've completed the action for the
> initial request? (I used to run 2 backline support teams for Sun; I know
> which answer I'd give.... :)

Either is fine actually, it doesn't make any real difference with how our workflow is setup.

Taking a quick look, I don't think we export anything beyond these at present:

        lua_pushnumber (L, SLURM_FAILURE);
        lua_setfield (L, -2, "FAILURE");
        lua_pushnumber (L, SLURM_ERROR);
        lua_setfield (L, -2, "ERROR");
        lua_pushnumber (L, SLURM_SUCCESS);
        lua_setfield (L, -2, "SUCCESS");
        lua_pushnumber (L, ESLURM_INVALID_LICENSES);
        lua_setfield (L, -2, "ESLURM_INVALID_LICENSES");

If there are other references you need I can look into adding them in the future.
Comment 4 Lyn 2018-04-27 15:56:30 MDT
Created attachment 6714 [details]
NCCS job_submit.lua relevant to next update.

NCCS job_submit.lua relevant to next update.
Comment 5 Lyn 2018-04-27 16:01:16 MDT
Thanks, Tim.

I just attached a copy of job_submit.lua to this bug. 

The main thing we're trying to do is reject requests that specify --qos=long, with a short (<=720mins) timelimit, which is most likely to happen when somebody is gaming us to get more jobs in the queue. So ESLURM_INVALID_TIME_LIMIT is the one we'd like to have functional in the future.

Right now, I can at least succeed in that rejection goal at submit time, albeit w/the ugly "sbatch:.... Unspecified error" that results from exiting slurm.FAILURE (or slurm.ERROR). 

At this point, however, there's a weird thing that's happening:

In the modify function, slurm.log_user doesn't function properly. Its messages do not show up in the output to the scontrol update command. They "disappear" (*bare with me), and are _not_ noted as having produced any error in the ctld log. And in the modify function, when I issue the return slurm.FAILURE, an error is written to the ctld log saying that it received a non-numeric value; then it just goes ahead and honors the modify request.

lgerner@rm5:~> sbatch --hold -t 721 -q long ~lgerner/bin/sleep10
Submitted batch job 36000297
lgerner@rm5:~> scontrol update jobid=36000297 timelimit=600
### should be my slurm.log_user("INVALID REQUESTED... msg here, but there isn't ###

[2018-04-27T17:36:31.259] job_modify/lua: /etc/slurm/job_submit.lua: non-numeric return code
[2018-04-27T17:36:31.259] sched: update_job: setting time_limit to 600 for job_id 36000297

lgerner@rm5:~> scontrol show job 36000297 |grep -i timelimit
   RunTime=00:00:00 TimeLimit=10:00:00 TimeMin=N/A

lgerner@rm5:~> scontrol update jobid=36000297 timelimit=500
lgerner@rm5:~> scontrol update jobid=36000297 timelimit=400

*And here's the ultimate weirdness: 

The next time I do an sbatch from the same test user's login session, all the disappeared error messages from prior scontrol cmds get splattered out _as sbatch errors_, before it reports on any new sbatch error:

lgerner@rm5:~> sbatch --hold -t 721 -q long ~lgerner/bin/sleep10
sbatch: INVALID REQUESTED timelimit=600 minutes for qos long
sbatch: INVALID REQUESTED timelimit=500 minutes for qos long
sbatch: INVALID REQUESTED timelimit=400 minutes for qos long
Submitted batch job 36000298

Thanks for any further efforts, but they can definitely wait til next week.

Good weekend,
Lyn
Comment 6 Lyn 2018-04-27 16:31:51 MDT
Created attachment 6715 [details]
NCCS slurm.conf for Discover SCU14 cluster

Not sure you need it, but attaching scontrol show conf output w/sanitizing, for completeness.
Comment 7 Lyn 2018-04-27 16:35:32 MDT
Wondering if there might be some complication from the fact that we have both job_submit/pbs and job_submit/lua in our stack.
Comment 8 Tim Wickberg 2018-04-27 18:38:02 MDT
Created attachment 6716 [details]
add ESLURM_INVALID_TIME_LIMIT to job_submit/lua plugin

There is no slurm.SLURM_FAILURE - it needs to be just slurm.FAILURE within lua. In lua it looks like we've decided to drop the SLURM_ prefix for SUCCESS / FAILURE / ERROR.

If you reference something lua can't identify as an identifier as the return code, it's producing the "job_modify/lua: /etc/slurm/job_submit.lua: non-numeric return code" warning message, and does still accept the job submission.

Unfortunately, log_user cannot be used within job_modify() currently. What's happening here is that those messages are being pushed into a queue, and dispatched on the next job_submit() instead. I can push a fix to prevent this from happening, but the current RPCs do not support pushing a message back to the user on job modification. I can open a separate enhancement request to cover that if you'd like.

The attached patch adds slurm.ESLURM_INVALID_TIME_LIMIT to job_submit.lua, and should be in the next maintenance release.

- Tim
Comment 14 Lyn 2018-04-30 14:29:48 MDT
Hi Tim,

Thanks for confirming my inferences about the various return codes, etc., and many thanks for the patch. Appreciate the RFE offer, very much, but Bruce and I agree that our desire for the slurm.log_user capability in the job_modify function doesn't rise to the level of "need," so we won't ask you to open one. (At this time, we think the patch will be sufficient. If we get new use cases, we might change our minds. :) Meantime, feel free to close this ticket.

Best,
Lyn & Bruce
Comment 16 Tim Wickberg 2018-05-08 22:37:42 MDT
These two fixes - one to add ESLURM_INVALID_TIME_LIMIT, one to block the use of log.user() in job_modify - will be in 17.11.6.

- Tim