Bug 5847 - get_user_env fails to capture user env in 17.11.10 and 18.08
Summary: get_user_env fails to capture user env in 17.11.10 and 18.08
Status: CONFIRMED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other bugs)
Version: 18.08.1
Hardware: Linux Linux
: --- 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-10-12 04:15 MDT by l.r.sudbery
Modified: 2018-10-15 13:59 MDT (History)
4 users (show)

See Also:
Site: -Other-
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
src/common/env.c patch (565 bytes, patch)
2018-10-12 10:03 MDT, l.r.sudbery
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description l.r.sudbery 2018-10-12 04:15:22 MDT
We recently upgraded our controllers and compute nodes to 18.08.0 from 17.11.9, and found that jobs are not getting passed the users environment before they start (which leads primariliy to not being about to find `module`, and therefore the job failing before it can begin).

I saw the new release of 18.08.1 and tried that and 17.11.10 but the problem remains. I downgrade the compute nodes to 17.11.9-2 and the problem was resolved. The controllers remain at 18.08.0 for now, the problem appears to be in slurmd.

I beleive this commit: https://github.com/SchedMD/slurm/commit/41fc6fcf768cc80921487849f82c650fb2e7e409#diff-93abf0cd6414d06aaae4d8aa807296deR2084 introduces a bug which stops --get-user-env working.

It replaces a print function with a concatenate function. And replaces the empty `name` variable with the pre-populated `stepd_path` variable, which leads to `env_loc` having the path twice (i added some debugging to show it):

`Oct 12 10:14:06 bber0501u09a.bb2.cluster slurmd[24708]: env_array_user_default: env_loc: /usr/sbin/slurmstepd/usr/sbin/slurmstepd getenv`

Solution is to stick with original `name` variable, or drop `%s` from the format string. Or just revert that commit, which fixes 17.11.10 and 18.08.1.
Comment 1 l.r.sudbery 2018-10-12 10:03:52 MDT
Created attachment 8016 [details]
src/common/env.c patch

I'm not a developer, so I'm not sure if this is the best fix, but it certainly seems to be the simpliest! Generated against 18.08.1-1 - also works for 17.11.10-1 but the line numbers are different.
Comment 2 l.r.sudbery 2018-10-15 13:59:39 MDT
Looks like this has been picked up and fixed here:

https://github.com/SchedMD/slurm/commit/72b2355ca8d6f4381b4e417f76649712881f45b7

And the NEWS has been updated for 17.11. But not released yet. I assume this will be added into 18.08 before release as well?


I don't know if this is a duplicate of #5828 - it's mentioned in the above commit but appears to be a private bug.