Bug 4192

Summary: With BLCR, xterm gives error in pty shell
Product: Slurm Reporter: Brian Christiansen <brian>
Component: SchedulingAssignee: Brian Christiansen <brian>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: alex, hpc-staff
Version: 17.02.7   
Hardware: Linux   
OS: Linux   
Site: NYU Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Brian Christiansen 2017-09-27 08:42:28 MDT
As noted in Bug 4181:

Configuring CheckpointType=blcr and running xterm within an interactive srun you get an error:

$ srun --x11 --pty /bin/bash
[deng@c26-04 ~]$ xterm
ERROR: ld.so: object 'libcr_run.so' from LD_PRELOAD cannot be preloaded: ignored.
ERROR: ld.so: object 'libcr_run.so' from LD_PRELOAD cannot be preloaded: ignored.

This is because the blcr code explicitly sets LD_PRELOAD to libcr_run.so.

The reason why the error message is happening is explained by the ld.so man page:
        LD_PRELOAD
              A list of additional, user-specified, ELF shared  objects  to  be  loaded
              before  all  others.  The items of the list can be separated by spaces or
              colons.  This can be used to  selectively  override  functions  in  other
              shared objects.  The objects are searched for using the rules given under
              DESCRIPTION.  

***
              In  secure-execution  mode,  preload  pathnames  containing
              slashes  are  ignored,  and  only  shared  objects in the standard search
              directories that have the set-user-ID mode bit enabled are loaded.
***

e.g.
brian@lappy:~/slurm/17.02/lappy$ locate libcr_run.so    
/usr/lib/libcr_run.so
/usr/lib/libcr_run.so.0
/usr/lib/libcr_run.so.0.5.5

brian@lappy:~/slurm/17.02/lappy$ ls -l /usr/lib/libcr_run.so
lrwxrwxrwx 1 root root 18 Aug  4  2016 /usr/lib/libcr_run.so -> libcr_run.so.0.5.5

brian@lappy:~/slurm/17.02/lappy$ ls -l /usr/lib/libcr_run.so.0.5.5
-rw-r--r-- 1 root root 10336 Aug  4  2016 /usr/lib/libcr_run.so.0.5.5

brian@lappy:~/slurm/17.02/lappy$ sudo chmod +s /usr/lib/libcr_run.so
brian@lappy:~/slurm/17.02/lappy$ ls -l /usr/lib/libcr_run.so.0.5.5
-rwSr-Sr-- 1 root root 10336 Aug  4  2016 /usr/lib/libcr_run.so.0.5.5

brian@lappy:~/slurm/17.02/lappy$ LD_PRELOAD=libcr_run.so xterm
brian@lappy:~/slurm/17.02/lappy$ LD_PRELOAD=blah xterm
ERROR: ld.so: object 'blah' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object 'blah' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object 'blah' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.

brian@lappy:~/slurm/17.02/lappy$ sudo chmod -s /usr/lib/libcr_run.so
brian@lappy:~/slurm/17.02/lappy$ LD_PRELOAD=libcr_run.so xterm
ERROR: ld.so: object 'libcr_run.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object 'libcr_run.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
Comment 1 Brian Christiansen 2017-09-27 08:43:30 MDT
I'm investigating why the code explicitly sets LD_PRELOAD. We'll get back you on what we find.
Comment 4 Brian Christiansen 2017-09-27 10:53:53 MDT
On a side note.

In the past we have encouraged others to consider using other alternatives to BLCR, such as SCR[1]. There are known limitations of BLCR. One of which is it requires each pid running at the time of checkpoint to be available for it's own use when restart happens. In the past we also have had concerns that might have been corrected with time, but maybe not:

BLCR makes assumptions like files aren't changed with time, and it is not checked, it doesn't support GPUs, if you have to talk to a license server it will cause problems and so on.

[1] http://computation.llnl.gov/projects/scalable-checkpoint-restart-for-mpi
Comment 9 Brian Christiansen 2017-10-23 14:07:30 MDT
After further investigation, this appears to be an issue with how xterm handles LD_PRELOAD. We could do some things to prevent the error from happening but that would just be masking that blcr won't work with xterm.

Further, we've decided to deprecate the BLCR plugin in 17.11 and remove it in 18.08. BLCR can still be used but it will have to be run manually by the user.

We recommend investigating the other alternatives such as SCR and DMTCP.

Let us know if you have any questions.

Thanks,
Brian
Comment 10 Alejandro Sanchez 2017-10-23 14:10:54 MDT
Deprecating BLCR sounds good to me. Project seems to be inactive since ~January 2013.