Bug 14500

Summary: Job Submit Lua plugin needs access to slurm_errno.h error codes
Product: Slurm Reporter: Ole.H.Nielsen <Ole.H.Nielsen>
Component: OtherAssignee: Skyler Malinowski <skyler>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 21.08.8   
Hardware: Linux   
OS: Linux   
Site: DTU Physics Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 23.02.0pre1 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Ole.H.Nielsen@fysik.dtu.dk 2022-07-08 05:41:10 MDT
We're trying to implement job_submit_plugins starting with the Lua script in etc/job_submit.lua.example 

I have added the Lua snippet from the mailing list https://lists.schedmd.com/pipermail/slurm-users/2020-December/006459.html in order to check for GRES and GPUs in jobs:

  ESLURM_INVALID_GRES=2072
  ESLURM_BAD_TASK_COUNT=2025
  if ( job_desc.partition ~= slurm.NO_VAL ) then
    if (job_desc.partition ~= nil) then
      if (string.match(job_desc.partition,"gpgpu") or
string.match(job_desc.partition,"gpgputest")) then
        --slurm.log_info("slurm_job_submit (lua): detect job for gpgpu
partition")
        --Alert on invalid gpu count - eg: gpu:0 , gpu:p100:0
        if (job_desc.gres and string.find(job_desc.gres, "gpu")) then
          local numgpu = string.match(job_desc.gres, ":%d+$")
          if(numgpu ~= nil) then
              numgpu = numgpu:gsub(':', '')
              if ( tonumber(numgpu) < 1) then
                slurm.log_user("Invalid GPGPU count specified in GRES, must
be greater than 0")
                return ESLURM_INVALID_GRES
              end
          end
        else
        --Alternative use gpus in new version of slurm
          if (job_desc.tres_per_node == nil) then
            if (job_desc.tres_per_socket == nil) then
              if (job_desc.tres_per_task == nil) then
                 slurm.log_user("You tried submitting to a GPGPU partition,
but you didn't request one with GRES or GPUS")
                 return ESLURM_INVALID_GRES
                 else
                   if (job_desc.num_tasks == slurm.NO_VAL) then
                     slurm.user_msg("--gpus-per-task option requires
--tasks specification")
                    return ESLURM_BAD_TASK_COUNT
                   end
                 end
              end
            end
         end
      end
   end


What bothers me here is that one must define numeric codes for ESLURM_INVALID_GRES and ESLURM_BAD_TASK_COUNT.  It seems to be guessed by the programmer from typedef slurm_err_t in source file slurm/slurm_errno.h.

Using numeric return codes gives the expected error message to the user.  For example, with "return ESLURM_BAD_TASK_COUNT" (i.e. code 2025) I get:

sbatch: error: Batch job submission failed: Task count specification invalid

But if I try to use the following in my Lua script, by analogy to "return slurm.ERROR", then this seems to be equivalent to slurm.SUCCESS:

return slurm.ESLURM_BAD_TASK_COUNT

I note that only a very small subset of all the possible return codes are listed as symbolic names in
https://slurm.schedmd.com/job_submit_plugins.html#job_modify_returns
but I could not figure out from the source src/plugins/job_submit/lua/job_submit_lua.c how this subset is selected.

Question: How can Lua scripts be enabled to return all possible ESLURM* error codes from slurm_errno.h using symbolic names in stead of numeric values?

Thanks,
Ole
Comment 1 Skyler Malinowski 2022-07-11 11:12:07 MDT
> Using numeric return codes gives the expected error message to the user. 
> For example, with "return ESLURM_BAD_TASK_COUNT" (i.e. code 2025) I get:
> 
> sbatch: error: Batch job submission failed: Task count specification invalid
> 
> But if I try to use the following in my Lua script, by analogy to "return
> slurm.ERROR", then this seems to be equivalent to slurm.SUCCESS:
> 
> return slurm.ESLURM_BAD_TASK_COUNT

That is because slurm.ESLURM_BAD_TASK_COUNT is not defined and defaults to 0. slurm.SUCCESS is defined as 0 so this behavior makes sense.

> Question: How can Lua scripts be enabled to return all possible ESLURM*
> error codes from slurm_errno.h using symbolic names in stead of numeric
> values?

Looks like slurm_lua.c:_register_slurm_output_functions is responsible to loading the slurm_errno.h enum as a consumable symbol. As of right now, Slurm does not expose all the errno enum entries to LUA, hence there is nothing that can be enabled by conf to expose them all.

I will be working on a patch to expose all the error codes (or at least the ones that make sense).

In the meantime though, you can manually define each slurm.ESLURM_* (as desired) in your LUA script -- as you already have been doing. You can override/extend the slurm lua object with additional fields at runtime.

E.g.
> -- job_submit.lua
> slurm.ESLURM_INVALID_GRES=2072
> slurm.ESLURM_BAD_TASK_COUNT=2025
> 
> function slurm_job_submit(job_desc, part_list, submit_uid)
>   ...
>   if (job_desc.num_tasks == slurm.NO_VAL) then
>      return slurm.ESLURM_BAD_TASK_COUNT
>   end
>   ...
>  return slurm.SUCCESS
> end

Or create a new object for the manually defined Slurm error map.

Best,
Skyler
Comment 2 Ole.H.Nielsen@fysik.dtu.dk 2022-07-12 00:02:29 MDT
Hi Skyler,

(In reply to Skyler Malinowski from comment #1)
> Looks like slurm_lua.c:_register_slurm_output_functions is responsible to
> loading the slurm_errno.h enum as a consumable symbol. As of right now,
> Slurm does not expose all the errno enum entries to LUA, hence there is
> nothing that can be enabled by conf to expose them all.
> 
> I will be working on a patch to expose all the error codes (or at least the
> ones that make sense).

Thanks, I see the ESLURM* etc. symbols defined in slurm_lua.c.  It would be great to extend the list of symbols exposed to Lua!

I think that documentation in job_submit_plugins should also state clearly where the return codes can be looked up.  For example, in the section

  Returns:
  slurm.SUCCESS — Job submission accepted by plugin.

append something like this:

  All ESLURM* fields in the source file slurm/slurm_errno.h are available to Lua as slurm.ESLURM* return codes.

> In the meantime though, you can manually define each slurm.ESLURM_* (as
> desired) in your LUA script -- as you already have been doing. You can
> override/extend the slurm lua object with additional fields at runtime.
> 
> E.g.
> > -- job_submit.lua
> > slurm.ESLURM_INVALID_GRES=2072
> > slurm.ESLURM_BAD_TASK_COUNT=2025

I don't really know Lua, but are you saying that in Lua you are allowed to define arbitrary fields such as ESLURM_BAD_TASK_COUNT in the slurm data structure as in your examples?  Without overwriting existing fields or overwriting data outside the slurm data structure?

Thanks,
Ole
Comment 3 Skyler Malinowski 2022-07-12 07:44:28 MDT
> I don't really know Lua, but are you saying that in Lua you are allowed to
> define arbitrary fields such as ESLURM_BAD_TASK_COUNT in the slurm data
> structure as in your examples?  Without overwriting existing fields or
> overwriting data outside the slurm data structure?

Define arbitrary object fields, yes. Without overwrite. no. However, manually adding error codes to the LUA slurm object is safe enough to consider and should be convenient once the patch is introduced.
Comment 6 Skyler Malinowski 2022-07-29 09:08:34 MDT
Hi Ole,

The patch to expose all Slurm error codes in LUA plugins was merged in for 23.02.0pre1.

If interested in the relevant commits, please see the following:
> https://github.com/SchedMD/slurm/compare/fac34e6adc..423585c632

Cheers,
Skyler