Bug 4181 - Upgrading from 17.02.1-2 to 17.02.7
Summary: Upgrading from 17.02.1-2 to 17.02.7
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other bugs)
Version: 17.02.1
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Brian Christiansen
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2017-09-22 09:54 MDT by NYU HPC Team
Modified: 2017-09-27 10:47 MDT (History)
0 users

See Also:
Site: NYU
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description NYU HPC Team 2017-09-22 09:54:31 MDT
Hi Slurm Experts:

We are planning an upgrade from 17.02.1-2 to 17.02.7 in one week. Please help to answer some questions. 

1/
Are there mysql DB schema changes across the version gaps? If yes, could you advise how to safely upgrade slurmdbd.

2/
Our StateSaveLocation is currently /tmp. We want to move to /opt/slurm/data/statesave. How can we safely do that during the upgrade downtime? 

3/ 
Anything else we need to pay special attention?


Thank you very much!
Comment 1 Tim Wickberg 2017-09-22 17:21:10 MDT
(In reply to NYU HPC Team from comment #0)
> Hi Slurm Experts:
> 
> We are planning an upgrade from 17.02.1-2 to 17.02.7 in one week. Please
> help to answer some questions. 
> 
> 1/
> Are there mysql DB schema changes across the version gaps? If yes, could you
> advise how to safely upgrade slurmdbd.

No. Schema changes only happen on a major release, not within the maintenance releases. So the next schema change would be on 17.11.0 and up.

> 2/
> Our StateSaveLocation is currently /tmp. We want to move to
> /opt/slurm/data/statesave. How can we safely do that during the upgrade
> downtime? 

Relocate the various files into the new location and you should be fine.

> 3/ 
> Anything else we need to pay special attention?

If you have any SPANK plugins, such as the x11 plugin, they'll need to be rebuild against the new release.

Aside from that minor caveat, upgrading to different maintenance releases should be straightforward and relatively painless.
Comment 2 NYU HPC Team 2017-09-23 08:21:40 MDT
Thank you Tim. Yes I re-built slurm-spank-x11 rpm. It's found that with this new rpm or the previous version built against older Slurm, I see the below message when running the 'xterm' command. But the command runs fine in both cases.

$ srun --x11 --pty /bin/bash
[deng@c26-04 ~]$ xterm
ERROR: ld.so: object 'libcr_run.so' from LD_PRELOAD cannot be preloaded: ignored.
ERROR: ld.so: object 'libcr_run.so' from LD_PRELOAD cannot be preloaded: ignored.

However running an interactive R and plotting a histogram, I do not see an error warning message as above.
Comment 3 Brian Christiansen 2017-09-25 21:51:53 MDT
Are you doing checkpoing/restart (BLCR)? 
Do you by chance have LD_PRELOAD defined in your environment?

Does this help:
https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#preload

Are you able to ssh (-X) directly to the node and run an xterm? If so, do you get the same error?
Comment 4 NYU HPC Team 2017-09-26 07:34:27 MDT
Yes we starting to do blcr in Slurm. 

There are the library files in standard directory.
$ ldconfig -p | grep libcr_run
	libcr_run.so.0 (libc6,x86-64) => /lib64/libcr_run.so.0
	libcr_run.so (libc6,x86-64) => /lib64/libcr_run.so
$ ls -l /lib64/libcr_run.so*
lrwxrwxrwx 1 root root    18 Jun  6 17:02 /lib64/libcr_run.so -> libcr_run.so.0.5.5
lrwxrwxrwx 1 root root    18 Jun  6 17:02 /lib64/libcr_run.so.0 -> libcr_run.so.0.5.5
-rwxr-xr-x 1 root root 10176 Apr  5 15:10 /lib64/libcr_run.so.0.5.5

Before srun, there is no LD_PRELOAD defined. In a srun job, it is defined but without path.
$ echo $LD_PRELOAD
$ srun --x11 --pty /bin/bash
[deng@c26-04 ~]$ echo $LD_PRELOAD
libcr_run.so

It seems that this part of code is related:
https://github.com/SchedMD/slurm/blob/8d596cfc9136c6a3b624b37b8ea1881ae28f5ec1/src/plugins/checkpoint/blcr/checkpoint_blcr.c#L383
Comment 6 Brian Christiansen 2017-09-27 08:29:57 MDT
I get the same error message:
brian@lappy:~/slurm/17.02/lappy$ echo $LD_PRELOAD

brian@lappy:~/slurm/17.02/lappy$ srun -pdebug --pty $SHELL
brian@lappy:~/slurm/17.02/lappy$ echo $LD_PRELOAD
libcr_run.so
brian@lappy:~/slurm/17.02/lappy$ xterm
ERROR: ld.so: object 'libcr_run.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object 'libcr_run.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.

The reason why the error message is happening is explained by the ld.so man page:
        LD_PRELOAD
              A list of additional, user-specified, ELF shared  objects  to  be  loaded
              before  all  others.  The items of the list can be separated by spaces or
              colons.  This can be used to  selectively  override  functions  in  other
              shared objects.  The objects are searched for using the rules given under
              DESCRIPTION.  

***
              In  secure-execution  mode,  preload  pathnames  containing
              slashes  are  ignored,  and  only  shared  objects in the standard search
              directories that have the set-user-ID mode bit enabled are loaded.
***

e.g.
brian@lappy:~/slurm/17.02/lappy$ locate libcr_run.so    
/usr/lib/libcr_run.so
/usr/lib/libcr_run.so.0
/usr/lib/libcr_run.so.0.5.5

brian@lappy:~/slurm/17.02/lappy$ ls -l /usr/lib/libcr_run.so
lrwxrwxrwx 1 root root 18 Aug  4  2016 /usr/lib/libcr_run.so -> libcr_run.so.0.5.5

brian@lappy:~/slurm/17.02/lappy$ ls -l /usr/lib/libcr_run.so.0.5.5
-rw-r--r-- 1 root root 10336 Aug  4  2016 /usr/lib/libcr_run.so.0.5.5

brian@lappy:~/slurm/17.02/lappy$ sudo chmod +s /usr/lib/libcr_run.so
brian@lappy:~/slurm/17.02/lappy$ ls -l /usr/lib/libcr_run.so.0.5.5
-rwSr-Sr-- 1 root root 10336 Aug  4  2016 /usr/lib/libcr_run.so.0.5.5

brian@lappy:~/slurm/17.02/lappy$ LD_PRELOAD=libcr_run.so xterm
brian@lappy:~/slurm/17.02/lappy$ LD_PRELOAD=blah xterm
ERROR: ld.so: object 'blah' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object 'blah' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object 'blah' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.

brian@lappy:~/slurm/17.02/lappy$ sudo chmod -s /usr/lib/libcr_run.so
brian@lappy:~/slurm/17.02/lappy$ LD_PRELOAD=libcr_run.so xterm
ERROR: ld.so: object 'libcr_run.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object 'libcr_run.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.


I'm investigating why the code explicitly sets LD_PRELOAD. We'll get back you on what we find.
Comment 7 Brian Christiansen 2017-09-27 08:44:59 MDT
I've created Bug 4192 to track the BLCR issue. Do you have any other questions regarding upgrading? If not lets close this one.

Thanks,
Brian
Comment 8 NYU HPC Team 2017-09-27 08:59:14 MDT
Okay, closing it.
Comment 9 Brian Christiansen 2017-09-27 10:47:43 MDT
Info given.