Bug 1124 - Frequent recent job environment issues
Summary: Frequent recent job environment issues
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other bugs)
Version: 14.03.7
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: David Bigagli
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2014-09-27 03:47 MDT by Josko Plazonic
Modified: 2014-09-29 04:40 MDT (History)
2 users (show)

See Also:
Site: Princeton (PICSciE)
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Josko Plazonic 2014-09-27 03:47:25 MDT
In the last few days we have had a couple of users report issues with their jobs where the environment module command was suddenly not recognized, e.g.:

*+ module load openmpi/gcc/1.6.3/64*
*/var/spool/slurmd/job240445/slurm_script: line 12: module: command not
found*

This is something new and it only seemed to start happening in the last few days (as I said to two users so far that we know).  I looked through job*/environment on slurmctld server (hence for queued jobs) and they all look good. I.e. have something that resembles:

module () 
{ 
    eval `/usr/bin/modulecmd bash $*`
}

in them. As you are surely aware this is something that is supposed to be inherited through shell's environment at job submission.  It is unlikely that user shell environments were broken (they claim not to have done anything weird and while some 800+ jobs of a user failed, those submitted 45 seconds later ran just fine).

We did update to bash package version bash-4.1.2-15.el6_5.2.x86_64 - i.e. one with all fixes for the recent bash vulnerability.  Coincidence is rather weird as we had no reports of this problem before the last update.

Note that it does not happen always, on the contrary - all the tests I ran are good.  

Any suggestions, if not how to fix it then how to narrow it down? Or workarounds? E.g. if we had a way to ensure all user environments at job run had parsed /etc/profile.d/environment.{sh,csh} that is likely to be good enough of a workaround.  But not clear if, say TaskProlog can be used for that or is there another way.

Thanks!
Comment 1 Josko Plazonic 2014-09-27 12:59:15 MDT
OK, I think I know what went on and slurm looks to be just an unfortunate victim.  According to this:
http://fedoramagazine.org/shellshock-how-does-it-actually-work/
exported functions (like module in our case) are now prefixed with BASH_FUNC_ and have () at the end.

So if a user submitted a job with old shell (including the case where they have been logged in prior to shell upgrade) then this will not be done and functions will not be picked up in (new version of) bash shell launched by slurm.

Therefore a temporary glitch until users log out and log back in and submit new jobs.

So probably nothing for you guys to do - I'd close but waiting on user confirmation.
Comment 2 Josko Plazonic 2014-09-29 04:03:44 MDT
OK, it looks like it was entirely due to bash update.  We have instructed users to login anew and resubmit jobs and so far that seems to be fixing the problem.
Comment 3 David Bigagli 2014-09-29 04:40:31 MDT
Thanks Josko, this is very good to know.

David