2013-10-02 16 views
1

我试图检查点的工作正在使用伯克利实验室检查点(BLCR)方案的扭矩作业调度处理,我有尝试cr_run“my_exec”时引发的错误,因为我相信,可执行在编译时静态链接。提交脚本的样子(简化了,伪版):扭矩BLCR检查点与静态链接的可执行文件

#!/bin/bash 
#PBS -q workq 
#PBS -l nodes=1:ppn=4 
#PBS -l pmem=1gb,pvmem=2gb 
#PBS -l walltime=30:00:00 
#PBS -o out.log 
#PBS -N jobname 
#PBS -j oe 

cd $PBS_O_WORKDIR 

NNODES=$(uniq $PBS_NODEFILE | wc -l) 
NP=$(wc -l $PBS_NODEFILE | awk '{print $1}') 
echo PBS_NODEFILE is $PBS_NODEFILE 
echo NNODES is $NNODES 
cat $PBS_NODEFILE 

cr_run 'executable' infile.inp > outfile.out & 

## store process ID as variable and sleep 29 hours, then checkpoint 
BGPID=$! 
sleep 104400 

cr_checkpoint -p $BGPID -f checkFile.checkpoint --term 

我已经成功设置检查点使用其进行动态链接的二进制文件(主要是可执行文件从我写我自己的代码内置)的工作,所以我已经知道该怎么做这个。问题是,我试图运行的可执行文件是预编译的,并且我没有源代码或者这不会是一个问题。

我发现了文档here(见4.2)似乎提供了一些建议,但在尝试解密和测试这里的建议之前,我认为值得看看是否有人对从可执行文件运行的检查点工作有经验在编译时不会动态链接。

作为附注,代码没有内部检查点。此外,我们正在使用更有礼貌的检查点方式,而不是睡觉29小时,我只是将其包含在内,以避免混淆脚本并使其更具可读性。

+0

你有没有在静态链接的代码做这个发现错误?我不认为这会影响事情。 – dbeer

+0

动态链接代码不会发生错误。查看上面的链接以查看在编译时静态链接代码时要执行的操作的解释。问题是我没有源代码,所以我无法控制可执行文件的链接方式。这是整个问题。 – MaxGraves

+0

我的错误 - 我以为你说你在重新启动时遇到了问题,但是你正在谈论使用BLCR进行初始编译。代码是由本地编译还是由供应商提供的?如果其本地,也许你可以与网站管理员合作,以获得BLCR兼容的静态链接副本。如果它来自供应商,您可能需要将供应商推向相同的东西,但可能会更困难。 – dbeer

回答

1

答案就在这里提到的BLCR FAQ:https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#staticlink

If you can checkpoint and restart a dynamically linked application correctly, but 
cannot do so with the same application linked statically, this FAQ entry is for you. 
There are multiple reasons why BLCR may have problems with statically executables. 

The cr_run utility only supports dynamic executables 
If you wish to checkpoint an unmodified executable, the typical recipe is 

$ cr_run my_app my_args 

However, the cr_run utility does its work using the "LD_PRELOAD" environment variable 
to force loading of BLCR's support code into the address space the applications. That 
mechanism is only functional for dynamically linked executables. There is no magic we 
can perform today that will resolve this (though in the future we'd like to replace 
our use of LD_PRELOAD with a kernel-side mechanism). So, you'll need to relink any 
statically linked executables to include BLCR support. 

** Linking BLCR's libraries statically takes special care ** 
OK, we've told you why cr_run doesn't work and told you to relink. You tried linking 
with -lcr_run and/or -lcr and still can't get a checkpoint to work. What went wrong? 
You need a -u option in addition the the -l or the static linking will simply ignore 
BLCR's library. 

** BLCR doesn't support LinuxThreads ** 
Ok, what else could go wrong? You've followed the guidance given in the "Cautionary 
linker notes" section of the BLCR Users Guide when you linked your application. You 
even ran 

$ nm my_app | grep link_me 

to be sure the symbol you specified with -u is linked in. However, you are seeing 
weird crashes of your application when you try to checkpoint. 

The culprit might be LinuxThreads. Why? Because at the time this FAQ entry is being 
written, there are many Linux distributions that install the static libs for 
LinuxThreads in the default library search path, and with the NPTL static libs 
elsewhere. The resolution could be as simple as linking your application with -L/usr 
/lib/nptl or -L/usr/lib64/nptl, perhaps by setting an "LDFLAGS" variable (though it is 
possible that your distribution has picked some other location). 

While it is not strictly required due to binary compatibility between LinuxThreads and 
NPTL, we'd recommend that you at least consider a recompile with -I/usr/include/nptl 
in CFLAGS. 

Note, of course, that if BLCR's utilities are statically linked to LinuxThreads, then 
they need to be rebuilt too. See the BLCR Admin Guide for more information on that. 
相关问题