Ticket 8895 - Slurm job output to non-existent directory result into silent job failure
Summary: Slurm job output to non-existent directory result into silent job failure
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 19.05.4
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Felip Moll
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-04-17 16:05 MDT by BBP Administrator
Modified: 2020-05-27 03:09 MDT (History)
0 users

See Also:
Site: EPFL
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description BBP Administrator 2020-04-17 16:05:09 MDT
Dear Support Team,

I think is similar to Bug 3508.

We are running complicated Snakemake  workflows with more than 20k jobs. We would like to create SLURM logs into specific directories. For example, --error="logs/%N/slurm_%A.out".

But, currently if "logs/%N" directory doesn't exist then SLURM job just fails silently. This was very confusing to figure out as there is no error message given.

Is there any way to tell SLURM to create directories first and then create file instead of failure?

Thank you!
Comment 1 Felip Moll 2020-04-22 11:54:45 MDT
(In reply to BBP Administrator from comment #0)
> Dear Support Team,
> 
> I think is similar to Bug 3508.
> 
> We are running complicated Snakemake  workflows with more than 20k jobs. We
> would like to create SLURM logs into specific directories. For example,
> --error="logs/%N/slurm_%A.out".
> 
> But, currently if "logs/%N" directory doesn't exist then SLURM job just
> fails silently. This was very confusing to figure out as there is no error
> message given.
> 
> Is there any way to tell SLURM to create directories first and then create
> file instead of failure?
> 
> Thank you!

Hi,

At the moment this is not a supported thing. When launching tasks, slurmstepd first generates the file name path resolving the %j, %N and others in the url, e.g.:

main -> _step_setup -> mgr_launch_batch_job_setup -> batch_stepd_step_rec_create -> _batchfilename -> fname_create

After, it setups the I/O for the tasks. It does it in mgr.c:_fork_all_tasks(). If you follow the path you will see for example when invoking _setup_normal_io() it calls to io_create_local_client(). This function opens a file per task, or a single one for all tasks, and so on. This function also sets up the flags (io_get_file_flags()) for the "open()" syscall that will follow, which in all cases it will have the O_CREAT flag.

This is the mechanism on how output/error files are created: with an open syscall with O_CREAT flag. There's no logic at the moment to create the missing directories.


One idea that comes to my mind is this one:

The _fork_all_tasks() call is preceded by general initializations in slurmstepd. One of these is the spank_init() call.
The idea is to create a spank plugin which reads the job parameters, checks whether the directory exists and create if not.

The slurm_spank_init call would be we are looking for (https://slurm.schedmd.com/spank.html), because the next call is to spank_user() (slurm_spank_user_init) which is already too late:

slurm_spank_init -> Called just after plugins are loaded. In remote context, this is just after job step is initialized. This function is called before any plugin option processing. 

Unfortunatelly, and after coding a test spank plugin I see we don't have the stderr, stdout information in spank context.
There would be possibly very ugly workarounds around the spank plugins but they don't even deserve to be mentioned.


Having said that, I can only respond to you with a no, it is not possible right now. If you are really interested this could be marked as an enhancement but I am not sure we should deal with this. Probably it would be best served as a spank plugin and the enhancement could be to provide the missing spank_items in S_CTX_REMOTE.


Does it make sense?
Comment 4 Felip Moll 2020-04-22 13:14:05 MDT
> Probably it would be
> best served as a spank plugin and the enhancement could be to provide the
> missing spank_items in S_CTX_REMOTE.

I want to clarify here that this idea should be approved on our side after a discussion, so I cannot guarantee 100% an enhancement for this bug is going to be accepted.
Comment 5 Felip Moll 2020-04-23 06:01:39 MDT
Yet another clarifying note.

When the directory or file cannot be opened, the job fails silently, but there's an error in the slurmd file referring to the particular job step similar to that one:

[2020-04-22T19:47:55.124] [72.batch] error: Could not open stderr file /tmp/a/a.err: No such file or directory
Comment 9 Felip Moll 2020-05-12 11:31:46 MDT
Hi,

I checked with the team and this was indeed an enhancement request, but unfortunately there are no plans for adding this feature in a near future.

You should need to use a different approach to ensure you have the directories created previously to running the job.

Therefore, I am closing the issue as infogiven.

Thanks for your question and don't hesitate to open new bugs if other issues arise.
Comment 10 BBP Administrator 2020-05-25 14:33:34 MDT
Dear Felip, 

Sorry for delay in response.

thank you for detailed answer. We will use other approach (e.g. pre-creating dir).

By the way, for below

> but there's an error in the slurmd file referring to the particular job step similar to that one:
> [2020-04-22T19:47:55.124] [72.batch] error: Could not open stderr file /tmp/a/a.err: No such file or directory

The end users doesn't see this message right? (They typically just check their working directory.)
Comment 11 Felip Moll 2020-05-27 03:09:42 MDT
(In reply to BBP Administrator from comment #10)
> Dear Felip, 
> 
> Sorry for delay in response.
> 
> thank you for detailed answer. We will use other approach (e.g. pre-creating
> dir).
> 
> By the way, for below
> 
> > but there's an error in the slurmd file referring to the particular job step similar to that one:
> > [2020-04-22T19:47:55.124] [72.batch] error: Could not open stderr file /tmp/a/a.err: No such file or directory
> 
> The end users doesn't see this message right? (They typically just check
> their working directory.)

This is in slurmd log so end users shouldn't see this message. They could guess which is the issue by doing an 'scontrol show job' and checking Std*= which would point to an unexistent directory or file + JobState=FAILED Reason=NonZeroExitCode .

If they instead use 'srun', they will see "srun: error: Could not open stdout file: Permission denied" or "srun: error: Could not open stdout file: Not a directory". Since sbatch is run asynchronously and the job is allocated from the controller it doesn't show this error directly to user terminal at submission time.