Ticket 19564 - job_submit.lua needs access to --mem or --mem-per-cpu values
Summary: job_submit.lua needs access to --mem or --mem-per-cpu values
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 23.02.7
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Ricard Zarco Badia
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2024-04-10 05:38 MDT by Ole.H.Nielsen@fysik.dtu.dk
Modified: 2024-04-11 06:52 MDT (History)
1 user (show)

See Also:
Site: DTU Physics
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Ole.H.Nielsen@fysik.dtu.dk 2024-04-10 05:38:28 MDT
In our job_submit.lua script we would like to check (for a certain partition with 4TB memory nodes) that the users submit jobs which *must* specify --mem or --mem-per-cpu and that the values are above certain minimum limits.

However, I haven't been able to find the Lua job_desc fields describing --mem or --mem-per-cpu values.  I also don't know if the fields will contain NIL or NO_VAL in case the user omitted them.

Could you kindly inform me about the fields that might be named similar to job_desc.min_mem_per_node and job_desc.min_mem_per_cpu?

The kind of code I'm trying is:

local min_mem_per_node = 256000
if job_desc.min_mem_per_node ~= nil and job_desc.min_mem_per_node < min_mem_per_node then
  ...
  return slurm.ESLURM_INVALID_TASK_MEMORY
end

Could you offer an example code which performs this kind of check?

Thanks a lot,
Ole
Comment 1 Ricard Zarco Badia 2024-04-10 07:44:34 MDT
Hello Ole,

You already have almost everything sorted out, so my contribution will be minimal. If you want to know the complete list of job_desc fields, you can check them in the _get_job_req_field(...) function here [1].

I've created a basic code snippet for demonstration purposes that you can use as your base and also for some quick testing:

***********
function slurm_job_submit(job_desc, part_list, submit_uid)
    -- Verbose checks
    slurm.log_user("Values of the following parameters:")
    slurm.log_user("Min mem per node: %s (%s)", job_desc.min_mem_per_node, type(job_desc.min_mem_per_node))
    slurm.log_user("Min mem per cpu: %s (%s)", job_desc.min_mem_per_cpu, type(job_desc.min_mem_per_cpu))

    if job_desc.min_mem_per_node ~= nil and job_desc.min_mem_per_node < 300 then
        slurm.log_user("User requested less memory than 300M")
    end

    return slurm.SUCCESS
end
***********

Using this, you will quickly see how the values are populated. Here are some examples:

$ sbatch -n1 --mem=100M --wrap="srun hostname"
>>sbatch: Values of the following parameters:
>>sbatch: Min mem per node: 100.0 (number)
>>sbatch: Min mem per cpu: nil (nil)
>>sbatch: User requested less memory than 300M

$ sbatch -n1 --mem-per-cpu=100M --wrap="srun hostname"
>>sbatch: Values of the following parameters:
>>sbatch: Min mem per node: nil (nil)
>>sbatch: Min mem per cpu: 100.0 (number)

With some basic changes to my code (or yours) wou will have this up and running in no time. Let me know if you have any other doubts regarding this. If not, also let me know so I can mark the ticket as closed.

Best regards, Ricard.

[1] https://github.com/SchedMD/slurm/blob/master/src/plugins/job_submit/lua/job_submit_lua.c#L493
Comment 2 Ole.H.Nielsen@fysik.dtu.dk 2024-04-11 03:03:14 MDT
Hi Ricard,

Thanks for your quick reply:

(In reply to Ricard Zarco Badia from comment #1)
> I've created a basic code snippet for demonstration purposes that you can
> use as your base and also for some quick testing:
> 
> ***********
> function slurm_job_submit(job_desc, part_list, submit_uid)
>     -- Verbose checks
>     slurm.log_user("Values of the following parameters:")
>     slurm.log_user("Min mem per node: %s (%s)", job_desc.min_mem_per_node,
> type(job_desc.min_mem_per_node))
>     slurm.log_user("Min mem per cpu: %s (%s)", job_desc.min_mem_per_cpu,
> type(job_desc.min_mem_per_cpu))
> 
>     if job_desc.min_mem_per_node ~= nil and job_desc.min_mem_per_node < 300
> then
>         slurm.log_user("User requested less memory than 300M")
>     end
> 
>     return slurm.SUCCESS
> end

With Slurm 23.02.7 we're getting errors from slurm.log_user() when an argument has a NIL value.   When I add these lines to job_submit.lua:

    slurm.log_user("Values of the following parameters:")
    slurm.log_user("Min mem per node: %s (%s)", job_desc.min_mem_per_node, type(job_desc.min_mem_per_node))
    slurm.log_user("Min mem per cpu: %s (%s)", job_desc.min_mem_per_cpu, type(job_desc.min_mem_per_cpu))

we get in slurmctld.log:

[2024-04-11T10:39:04.930] error: job_submit/lua: /etc/slurm/job_submit.lua: [string "slurm.user_msg (string.format(unpack({...})))"]:1: bad argument #2 to 'format' (string expected, got nil)

Did you test on a newer Slurm version?  

I have to make cheks before calling slurm.log_user() like:

        if job_desc.min_mem_per_node ~= nil and job_desc.min_mem_per_node < min_mem_per_node then
                slurm.log_user("Big-memory partition %s requires jobs to specify a memory per node of at least %d MB",
                        job_desc.partition, min_mem_per_node)
                slurm.log_user("Your job requested %s MB", job_desc.min_mem_per_node)
                slurm.log_user("See the Wiki page %s", usage_page)
                return slurm.ESLURM_INVALID_TASK_MEMORY
        end

Can you comment on this handling of NIL values?

With appropriate checks, my job_submit.lua now seems to work as intended :-)

Thanks,
Ole
Comment 3 Ole.H.Nielsen@fysik.dtu.dk 2024-04-11 03:12:32 MDT
Node added: We have this version of LUA on CentOS 7.9:

# rpm -q lua
lua-5.1.4-15.el7.x86_64
Comment 4 Ricard Zarco Badia 2024-04-11 04:43:32 MDT
Hello Ole,

I think it is your Lua version, it is a bit outdated. My tests were performed with Slurm 23.02.7 + Lua 5.3.6. I've compared the behavior of Lua 5.1.4 vs 5.3.6 and I think that we have a culprit:

>> Lua 5.3.6  Copyright (C) 1994-2020 Lua.org, PUC-Rio
>> > local var1 = nil
>> > print(string.format("Variable is %s", var1))
>> Variable is nil

vs

>> Lua 5.1.4  Copyright (C) 1994-2008 Lua.org, PUC-Rio
>> > local var1 = nil
>> > print(string.format("Variable is %s", var1))
>> stdin:1: bad argument #2 to 'format' (string expected, got nil)
>> stack traceback:
>>         [C]: in function 'format'
>>         stdin:1: in main chunk
>>         [C]: ?

>> > var2 = nil
>> > print(string.format("Variable is %s", var2))
>> stdin:1: bad argument #2 to 'format' (string expected, got nil)
>> stack traceback:
>>         [C]: in function 'format'
>>         stdin:1: in main chunk
>>         [C]: ?

I would either upgrade to 5.3.X at least or just adapt your code to check your variables to make sure that they don't contain nil values before usage/formatting.

Best regards, Ricard.
Comment 5 Ole.H.Nielsen@fysik.dtu.dk 2024-04-11 06:10:42 MDT
Hi Ricard,

(In reply to Ricard Zarco Badia from comment #4)
> I think it is your Lua version, it is a bit outdated. My tests were
> performed with Slurm 23.02.7 + Lua 5.3.6. I've compared the behavior of Lua
> 5.1.4 vs 5.3.6 and I think that we have a culprit:
...
> I would either upgrade to 5.3.X at least or just adapt your code to check
> your variables to make sure that they don't contain nil values before
> usage/formatting.

Thanks for confirming that the old Lua 5.1.4 from CentOS 7 is causing this issue.

Unfortunately, there is no RPM package available for Lua 5.3.X on CentOS 7.  Even if we could find an RPM, installing it might potentially break other things in CentOS.

In the case of CentOS 7, it seems that we must make extra checks for NIL values in job_submit.lua.

When we later will move our Slurm controller to an EL8 server, the we're going to get Lua 5.3.4 by default and the NIL values will be handled much better.

I think that the present issue is now well understood.  You're welcome to close this case now.

Best regards,
Ole
Comment 6 Ricard Zarco Badia 2024-04-11 06:52:24 MDT
Perfect, I will close the ticket then. Feel free to open it again if something else relevant to the matter comes up.

Best regards, Ricard.