Ticket 2265 - sacct doesn't escape delimiter and other special characters
Summary: sacct doesn't escape delimiter and other special characters
Status: CONFIRMED
Alias: None
Product: Slurm
Classification: Unclassified
Component: User Commands (show other tickets)
Version: 14.11.10
Hardware: Linux Linux
: --- 5 - Enhancement
Assignee: Tim Wickberg
QA Contact:
URL: http://kit.edu
: 3375 4086 7218 10853 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2015-12-16 03:45 MST by Martins Innus
Modified: 2021-02-15 01:59 MST (History)
9 users (show)

See Also:
Site: University of Buffalo (SUNY)
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: RHEL
Machine Name: fh2-login1.scc.kit.edu
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
strip newlines for user-provided strings, remove delimiter as well (10.71 KB, patch)
2015-12-21 04:17 MST, Tim Wickberg
Details | Diff
Add --escape-delimiter option to sacct (6.00 KB, patch)
2020-11-30 08:54 MST, Yair Yarom
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Martins Innus 2015-12-16 03:45:49 MST
User jobs may sometimes have special characters in their job names, usually as a result of a typo or "fat finger" mistake.

First issue is if the jobname has the same character that is used as a delimiter.  This makes the parsable output of sacct non-parsable in a trivial way.  For example, in a job script:

#SBATCH --job-name="DS-||Test"

likely a typo, by looking at the user's other jobs. But an sacct --parsable2 command returns the following:

4773753|ub-hpc|general-compute|mae609f15|mae609f15|89200004|kgvansly|367297|1449984227|1449984227|1449984297|1449984360|0:0|COMPLETED|1|4|4|cpn-k16-41-01|DS-||Test|18:00:00

Note the multiple pipes in succession inside what should be the jobname field.  Changing the delimited likely wont help since a user may have any character in the jobname.

The second issue is newlines.  If there is an inadvertent newline in a jobname, the sacct output for a single job will span multiple lines.
Comment 2 Tim Wickberg 2015-12-17 06:13:42 MST
We're looking into how to best handle this and some related output formatting issues, although the fix more invasive than we'd want to put into 15.08. I might have a relatively simple patch you could use in the meantime if you don't want to wait for a more extensive fix in 16.05.

If your immediate concern is just sacct, I'd suggest looking at --delimiter option. Admittedly your users may pick almost anything which would still cause problems, but something less common like ! may be easier in the meantime. It also allows for multiple characters, #### is just as valid as |. At present it would also permit UTF8 wide-characters like ☃ (Unicode Snowman, U+2603, it may or may not render in your email client) to be used.

Changing the delimiter doesn't address errant newlines, although I'm not sure how common that is for your system at present?

- Tim
Comment 3 Martins Innus 2015-12-17 06:34:09 MST
OK, thanks.  I'd be interested in the patch.  In the meantime we are looking into using multiple characters.

For the new lines, just escaping them would be fine for our purposes.  We see newlines a couple times a month.  From looking at the jobs and logs it seems like a user may do something like (on the command line):

sbatch --jobname="foo
>bar" --other-slurm-flags....

That is, hit an inadvertent newline while in a quoted string, get the shell continuation prompt, close the quote, and keep typing the command.
Comment 4 Tim Wickberg 2015-12-17 07:12:55 MST
I'll see what I can get you in the next day or so. The patch may differ from what we do in 16.05, but would work in the meantime.

The strategy I'm looking at taking is to (a) swap any '\n' for ' ' in strings, and (b) swap the delimiter string (if found, and as many times as found) for the same number of spaces. That also avoids some overhead with C string manipulation by keeping the same array throughout rather than cutting and splicing it back together.

We could look at trying to escape values with \ or similar, but that gets messy fast and I assume most people are parsing the output with `cut` or `awk` or similar which wouldn't necessarily respect the escaped characters anyways. It looks like XDMoD at least probably wouldn't parse the escaped values any better than unescaped? (I'm guessing from the uncommon --parsable2 flag mentioned that may be what you're feeding this into? That, and the fact that your group develops it from what I recall.)

If you're sitting near those folks (or are those folks?) there's an internal feature request to add a JSON output format to some commands which may be of interest to them. It might show up in 16.05, although it's a low priority as no one is sponsoring that work at present.

For 16.05 the strategy may shift to rejecting jobs with bad characters in the various fields (job name is but one of several that are affected - comment, stdout/stderr and a few other fields could also have junk in them) rather than trying to clean up the output later on.
Comment 5 Martins Innus 2015-12-17 12:40:34 MST
Yeah one of the uses for this is for XDMoD. The other is just some general auditing of jobs. Either escaping or replacing is fine. The output is parsed by either php or Python depending on the code path so we could handle either change.

JSON would be perfect, but understandable that it's low priority for you.

> On Dec 17, 2015, at 4:12 PM, bugs@schedmd.com wrote:
> 
> Tim Wickberg changed bug 2265 
> What	Removed	Added
> CC	da@schedmd.com, jette@schedmd.com	 
> Comment # 4 on bug 2265 from Tim Wickberg
> I'll see what I can get you in the next day or so. The patch may differ from
> what we do in 16.05, but would work in the meantime.
> 
> The strategy I'm looking at taking is to (a) swap any '\n' for ' ' in strings,
> and (b) swap the delimiter string (if found, and as many times as found) for
> the same number of spaces. That also avoids some overhead with C string
> manipulation by keeping the same array throughout rather than cutting and
> splicing it back together.
> 
> We could look at trying to escape values with \ or similar, but that gets messy
> fast and I assume most people are parsing the output with `cut` or `awk` or
> similar which wouldn't necessarily respect the escaped characters anyways. It
> looks like XDMoD at least probably wouldn't parse the escaped values any better
> than unescaped? (I'm guessing from the uncommon --parsable2 flag mentioned that
> may be what you're feeding this into? That, and the fact that your group
> develops it from what I recall.)
> 
> If you're sitting near those folks (or are those folks?) there's an internal
> feature request to add a JSON output format to some commands which may be of
> interest to them. It might show up in 16.05, although it's a low priority as no
> one is sponsoring that work at present.
> 
> For 16.05 the strategy may shift to rejecting jobs with bad characters in the
> various fields (job name is but one of several that are affected - comment,
> stdout/stderr and a few other fields could also have junk in them) rather than
> trying to clean up the output later on.
> You are receiving this mail because:
> You reported the bug.
Comment 6 Tim Wickberg 2015-12-21 04:17:14 MST
Created attachment 2530 [details]
strip newlines for user-provided strings, remove delimiter as well

Attached changes the output for sacct/sacctmgr/sreport/sshare/sstat to replace newlines with spaces and remove delimiter character(s) entirely in user-provided strings.

Note this is against 15.08, and does not look like it'll apply back to 14.11 without some adjustment.

I'm still looking into how best to approach this in 16.05, we'll likely add some additional checks up-front and potentially a new flag to enable this, but may not commit anything that changes the output. (This also helps keep the other commands - squeue in particular - free from bad output, as it uses a different set of functions for formatting.)
Comment 7 Martins Innus 2015-12-22 04:46:50 MST
OK, thanks.  We are going to upgrade to 15.08 in a few weeks so I'll not try to backport this to 14.11.

As you look at solutions for 16.05, I'll note that we've also seen what seems like arbitrary binary (non utf) data in the job name field.  Not sure where that comes from, maybe an errant copy/paste by the user.  This is easier for us to deal with because we know any data like that is bad and we just dump it during processing.

Thanks
Comment 8 Tim Wickberg 2015-12-22 05:06:35 MST
On 12/22/2015 01:46 PM, bugs@schedmd.com wrote:
> http://bugs.schedmd.com/show_bug.cgi?id=2265
>
> --- Comment #7 from Martins Innus <minnus@buffalo.edu> ---
> OK, thanks.  We are going to upgrade to 15.08 in a few weeks so I'll not try to
> backport this to 14.11.

Okay, good to hear.

If it's alright with you I'm going to update this bug to reclassify it 
as a Sev5 enhancement for string sanitization in user submissions, with 
a target release for 16.05.

> As you look at solutions for 16.05, I'll note that we've also seen what seems
> like arbitrary binary (non utf) data in the job name field.  Not sure where
> that comes from, maybe an errant copy/paste by the user.  This is easier for us
> to deal with because we know any data like that is bad and we just dump it
> during processing.

That'll come in as part of the input string sanitization. At present any 
user-provided strings are copied into char[]'s and printed out as-is. 
The sanitization step will reject jobs with any of the ASCII control 
characters (ASCII 0-31,127 , which aren't used in UTF8) to avoid this 
and force users to fix their mangled submissions appropriately.

Deciding what are legitimate UTF8 strings isn't something we want to 
tackle. You may still get some oddness, especially if people throw the 
text-direction characters in, but I'm guessing that's less likely.
Comment 9 Tim Wickberg 2016-01-06 06:24:06 MST
Marking as Sev5 / enhancement to provide for further sanitization of user-provided strings.
Comment 10 Tim Wickberg 2017-01-04 09:38:42 MST
*** Ticket 3375 has been marked as a duplicate of this ticket. ***
Comment 11 Tim Wickberg 2017-10-01 23:55:30 MDT
*** Ticket 4086 has been marked as a duplicate of this ticket. ***
Comment 12 nathanael.huebbe 2018-08-10 09:52:24 MDT
(In reply to Tim Wickberg from comment #11)
> *** Bug 4086 has been marked as a duplicate of this bug. ***

I would like to ask what the current status of the input sanitization is. It's been a long time that this bug has been open, and, as I pointed out in the bug report that I opened myself, it is a significant security risk. It's next to impossible for job-epilog script writers to harden their scripts against the broken input they get from `scontrol`.

I have taken a look at the current code, but could not find any sanitization - did I overlook something?
Comment 13 Jason Booth 2019-06-17 15:04:32 MDT
*** Ticket 7218 has been marked as a duplicate of this ticket. ***
Comment 14 Jason Booth 2019-08-22 10:31:25 MDT
*** Ticket 7218 has been marked as a duplicate of this ticket. ***
Comment 15 Brigitte May 2020-02-21 03:19:20 MST
Hi,

  we have this problem as well. The jobname like st:F6|J  cause a movement of the 
 "ResvCPU" and "ResvCPURAW" slurm fields. The output of for example "sacct --allocations --parsable2 --units=M  --format=All -j 1698989" shows  in the header 105 fields and in the output 106 fields. When could we receive a solution of this problem.

Kind regards
Brigitte May

Karlsruher Institut für Technologie (KIT) 
Steinbuch Centre for Computing (SCC)

Brigitte May
Scientific Computing und Simulation (SCS)

Zirkel 2, Gebäude 20.21, Raum 207
76131 Karlsruhe
Telefon: +49 721 608 46422
Fax: +49 721 32550
E-Mail: brigitte.may@kit.edu
Web: http://www.scc.kit.edu

KIT – Die Forschungsuniversität in der Helmholtz-Gemeinschaft
Comment 16 Wirawan Purwanto 2020-09-17 08:52:31 MDT
Another place where this makes an issue is where the "Constraints" field may have the "|" character also (e.g. "mem16G|mem32G" ). My processing tool also choked today due to this issue. This output was produced using SLURM 20.02.

Wirawan
Comment 17 Wirawan Purwanto 2020-09-17 21:00:51 MDT
I consider this issue easy to fix if we're willing to be a little dirty.
Why don't we do the following to avoid this malady:
We will replace "bad" characters with its C-like escape, e.g. \xNN where
NN is the hexadecimal ASCII code of the character?
In this case, we will replace:

    "\"     =>  \x5c
    "|"     =>  \x7c
    newline =>  \x0a

Ugly? Yes.
Works? Yes.
The approach above is amenable to machine parsing and preserves the data.
After all, those chars should not have been appearing in SLURM job names
if one is reasonable.
Other "bad" characters with ASCII codes lower than 32 can also
be similarly encoded.
(Optionally, if wanted, those higher than 127 can be encoded
in the same way.)
I propose that the fix is done in the `common/print_fields.c` in the
following functions: `print_fields_str` and `print_fields_char_list`.
I believe these are the only vulnerable points.
We make a function that scans the input string (before assigning the
value to `print_this` pointer) for the "forbidden" characters, and if so,
we will make a copy of the string by substituting the forbidden chars
with that long escape.

Optionally, if wanted, those higher than 127 can be encoded in the same way too.
I propose that the fix is done in the `common/print_fields.c` in the
following functions: `print_fields_str` and `print_fields_char_list`.
I believe these are the only vulnerable points.
We make a function that scans the input string (before assigning the
value to `print_this` pointer) for the "forbidden" characters, and if so,
we will make a copy of the string by substituting the forbidden chars
with that long escape.

Please comment whether this is a reasonable approach, before it is implemented.
Comment 18 Tim Wickberg 2020-09-18 10:03:22 MDT
(In reply to Wirawan Purwanto from comment #17)
> I consider this issue easy to fix if we're willing to be a little dirty.

To head off the rest of the discussion here: I'm not willing to "be a little dirty", and won't accept any patches based on what you've described.
Comment 19 lance.wilson 2020-10-15 23:30:20 MDT
This is still quite an issue for our centre as we keep finding new and unique ways that our friendly researchers can add characters that prevent reporting being carried out. 

I'd like to suggest urlencode (a python function) for this as it will encode all characters into a sane format that could be parsed. Nice industry standard behind it as well.

Thoughts?
Comment 20 Dave 2020-10-20 13:33:38 MDT
Just giving a +1 here, this is affecting our metrics as well. 

Sanitizing the delimiter by replacing it with spaces sounds fine to me, is there any reason why this hasn't made it to a release yet?
Comment 21 Jenny Williams 2020-10-20 13:33:50 MDT

I am out of the office until 10/27. If you have questions that require a reply before then please email research@unc.edu.
 Regards,
Virginia Williams
Systems Administrator
Research Computing
UNC Chapel Hill
Comment 22 Yair Yarom 2020-11-30 08:54:16 MST
Created attachment 16862 [details]
Add --escape-delimiter option to sacct


Hi,

I only bothered to search for this bug after I wrote a patch... So here's my patch if anyone wants. It's based on the current master - 2020-11-30, but we currently have 19.05 running here so it's tested only on 19.05.

Default sacct behavior is preserved, and only changes when adding the --escape-delimiter flag. In which case it will add '\' before any other '\' and before the delimiter (a single '\' is added even if the delimiter is more than one character). This patch doesn't take care of newlines or any other special character.

I'm not sure if this is still relevant, as an update to the parsing scripts will also be required in order to take the escapes into account. And I think the future is probably going for the rest api for these sort of things (not relevant for me yet, as we're still on 19.05).
Comment 23 Marcin Stolarek 2021-02-15 01:59:11 MST
*** Ticket 10853 has been marked as a duplicate of this ticket. ***