We recently upgraded our controllers and compute nodes to 18.08.0 from 17.11.9, and found that jobs are not getting passed the users environment before they start (which leads primariliy to not being about to find `module`, and therefore the job failing before it can begin). I saw the new release of 18.08.1 and tried that and 17.11.10 but the problem remains. I downgrade the compute nodes to 17.11.9-2 and the problem was resolved. The controllers remain at 18.08.0 for now, the problem appears to be in slurmd. I beleive this commit: https://github.com/SchedMD/slurm/commit/41fc6fcf768cc80921487849f82c650fb2e7e409#diff-93abf0cd6414d06aaae4d8aa807296deR2084 introduces a bug which stops --get-user-env working. It replaces a print function with a concatenate function. And replaces the empty `name` variable with the pre-populated `stepd_path` variable, which leads to `env_loc` having the path twice (i added some debugging to show it): `Oct 12 10:14:06 bber0501u09a.bb2.cluster slurmd[24708]: env_array_user_default: env_loc: /usr/sbin/slurmstepd/usr/sbin/slurmstepd getenv` Solution is to stick with original `name` variable, or drop `%s` from the format string. Or just revert that commit, which fixes 17.11.10 and 18.08.1.
Created attachment 8016 [details] src/common/env.c patch I'm not a developer, so I'm not sure if this is the best fix, but it certainly seems to be the simpliest! Generated against 18.08.1-1 - also works for 17.11.10-1 but the line numbers are different.
Looks like this has been picked up and fixed here: https://github.com/SchedMD/slurm/commit/72b2355ca8d6f4381b4e417f76649712881f45b7 And the NEWS has been updated for 17.11. But not released yet. I assume this will be added into 18.08 before release as well? I don't know if this is a duplicate of #5828 - it's mentioned in the above commit but appears to be a private bug.