2012-05-09 35 views
2

我在Ubuntu 10.04.2 LTS(主要和备用)上有PostgreSQL 9.1.3流式复制设置。复制通过流式基础备份(pg_basebackup)进行初始化。 restore_command脚本尝试使用rsync从远程归档位置获取所需的WAL归档。PostgreSQL 9.1流式复制restore_command:退出代码255的特殊含义?

一切工作像documentation描述的,当restore_command设置脚本失败,退出代码<> 255:

At startup, the standby begins by restoring all WAL available in the archive location, calling restore_command. Once it reaches the end of WAL available there and restore_command fails, it tries to restore any WAL available in the pg_xlog directory. If that fails, and streaming replication has been configured, the standby tries to connect to the primary server and start streaming WAL from the last valid record found in archive or pg_xlog. If that fails or streaming replication is not configured, or if the connection is later disconnected, the standby goes back to step 1 and tries to restore the file from the archive again. This loop of retries from the archive, pg_xlog, and via streaming replication goes on until the server is stopped or failover is triggered by a trigger file.

但当的restore_command脚本失败,退出代码为255(因为从失败的rsync的退出代码调用由脚本返回)服务器进程,出现以下错误死亡:

2012-05-09 23:21:30 CEST - @ LOG: database system was interrupted; last known up at  2012-05-09 23:21:25 CEST 
2012-05-09 23:21:30 CEST - @ LOG: entering standby mode 
rsync: connection unexpectedly closed (0 bytes received so far) [Receiver] 
rsync error: unexplained error (code 255) at io.c(601) [Receiver=3.0.7] 
2012-05-09 23:21:30 CEST - @ FATAL: could not restore file "00000001000000000000003D" from archive: return code 65280 
2012-05-09 23:21:30 CEST - @ LOG: startup process (PID 8184) exited with exit code 1 
2012-05-09 23:21:30 CEST - @ LOG: aborting startup due to startup process failure 

所以我的问题是现在:这是一个错误,还是有它的行吟诗人失踪的退出码255特殊的意义rwise出色的文档,或者我在这里错过了其他的东西?

+0

,我不是做这个的答案,现在,因为我没有时间来检查源代码来确认,直到后来还是明天,但我的记忆是,在恢复过程中应用WAL文件时,非-zero退出代码小于255表示“失败,但继续尝试”,而255(或更高)表示“失败严重;放弃”。您可能需要调整脚本以返回rsync失败的较小退出代码。 – kgrittn

+0

@kgrittn:谢谢,我在想这样的事情,但我找不到有关退出代码255的特殊含义的任何文档,我不知道在源代码中查找它的位置。 – tscho

+0

呃,它花了一段时间,但这又浮现为我不得不处理的一个问题,我的评论在这里被引用,所以我查了一下并发布了一个包含细节的答案。这一次我会看到有关将文件写入文档的方法...... – kgrittn

回答

2

在主服务器上,您有位于pg_xlog/目录中的WAL文件。尽管有WAL文件,PostgreSQL能够在请求它们时将它们传送到备用数据库。

通常情况下,你也有本地存档WAL位置,当文件被PostgreSQL可以搬到那里,他们不再能交付上线和待机期待他们通过restore_command来自归档WAL位置待命。

如果您在主服务器和备用服务器上设置了归档WAL的不同位置,则无法在一段时间内达到待机状态,并且存在差距。

你的情况,这可能意味着,即:

  • 00000001000000000000003D已经由初级PostgreSQL的存档;
  • 备用的restore_command没有从配置的源位置看到它。

您可以考虑从原发性手动复制丢失的WAL文件使用scprsync待机。也可能需要查看您的WAL位置,并确保两台服务器的方向相同。


编辑: grep的-ing来源为restore_command,只有access/transam/xlog.c引用它。在功能RestoreArchivedFile的结尾(9.1.3源的3115轮),检查restore_command是否正常退出或收到信号。

在第一种情况下,消息被分类为DEBUG2。如果restore_command收到的信号不是SIGTERM(并且无法正确处理,我猜),则会报告FATAL错误。对于所有大于125的代码都是如此。

虽然我不能告诉你为什么。
我建议询问hackers list

+0

感谢您的回答,但这不是我所要求的。我知道为什么restore_command失败以及我如何修复它,这不是重点。事实上,restore_command _must_在某些时候失败。我的问题是:为什么退出代码> 0但<> 255的效果与退出代码255的效果不同。退出代码为255时,备用服务器进程将会死掉,而它应该像退出代码1或17那样继续_restore loop_。 – tscho

+0

感谢您使用源代码位置进行编辑。我会看看它。 – tscho

0

这看起来像我遇到暂时(端口837 RPCBIND/rstatd的)使用NFS的rsync的问题:

$ rsync -avz /var/backup/* [email protected]:/data/backups 
rsync: connection unexpectedly closed (0 bytes received so far) [sender] 
rsync error: unexplained error (code 255) at io.c(600) [sender=3.0.6] 

这个固定对我来说:

service rpcbind stop 
0

我有同样的问题创建一个热备份(postgres 9.5)。流式处理工作正常(我通过pg_basebackup使用相同的凭据接入备用数据库,稍后将用于备用数据库的recovery.conf中)。

服用basebackup,我设置以下的recovery.conf后:

standby_mode = 'on' 
primary_conninfo = 'host=ip.of.master port=5432 user=pgstandby password=password' 
recovery_target_timeline = 'latest' 
restore_command = 'sftp -q [email protected]:data/master_wal_archive/%f "%p"' 
trigger_file = '/srv/pgsql/9.5/data/trigger' 

启动服务器将产生:

2016-03-08 12:34:58.981 UTC (/)LOG: database system was interrupted; last known up at 2016-03-08 12:26:10 UTC 
Couldn't read packet: Connection reset by peer 
2016-03-08 12:34:59.525 UTC (/)FATAL: could not restore file "00000002.history" from archive: child process exited with exit code 255 
2016-03-08 12:34:59.526 UTC (/)LOG: startup process (PID 26636) exited with exit code 1 
2016-03-08 12:34:59.526 UTC (/)LOG: aborting startup due to startup process failure 

如果我删除从recovey.conf的restore_command设置线路,备用开始很好,并开始从主人流WALs。

我最终将问题追溯到没有将备用postgres用户的公钥添加到WAL归档主机的authorized_hosts文件中。我也忘记将WAL归档主机的服务器指纹添加到备用postgres用户的known_hosts文件中。

这两个错误是(我假设)导致sftp restore_command以代码255退出。正如tscho所说,Postgres文档表明,如果restore_command以任何非零值退出,Postgres将简单地继续尝试来自主人的流而不是拒绝开始。事实上,如果退出代码高于某个数字(可能是125,就像vyegorov的源代码grepping暗示的那样),这似乎不是这种情况。

一旦我解决了两个SSH问题,备用数据库就可以正常启动,其recovery_command中存在recovery.conf。

0

下面是描述为什么选择命令进程的高退出状态的原因以及当前实现它的代码的注释。

/* 
    * Remember, we rollforward UNTIL the restore fails so failure here is 
    * just part of the process... that makes it difficult to determine 
    * whether the restore failed because there isn't an archive to restore, 
    * or because the administrator has specified the restore program 
    * incorrectly. We have to assume the former. 
    * 
    * However, if the failure was due to any sort of signal, it's best to 
    * punt and abort recovery. (If we "return false" here, upper levels will 
    * assume that recovery is complete and start up the database!) It's 
    * essential to abort on child SIGINT and SIGQUIT, because per spec 
    * system() ignores SIGINT and SIGQUIT while waiting; if we see one of 
    * those it's a good bet we should have gotten it too. 
    * 
    * On SIGTERM, assume we have received a fast shutdown request, and exit 
    * cleanly. It's pure chance whether we receive the SIGTERM first, or the 
    * child process. If we receive it first, the signal handler will call 
    * proc_exit, otherwise we do it here. If we or the child process received 
    * SIGTERM for any other reason than a fast shutdown request, postmaster 
    * will perform an immediate shutdown when it sees us exiting 
    * unexpectedly. 
    * 
    * Per the Single Unix Spec, shells report exit status > 128 when a called 
    * command died on a signal. Also, 126 and 127 are used to report 
    * problems such as an unfindable command; treat those as fatal errors 
    * too. 
    */ 
    if (WIFSIGNALED(rc) && WTERMSIG(rc) == SIGTERM) 
     proc_exit(1); 

    signaled = WIFSIGNALED(rc) || WEXITSTATUS(rc) > 125; 

    ereport(signaled ? FATAL : DEBUG2, 
      (errmsg("could not restore file \"%s\" from archive: %s", 
        xlogfname, wait_result_to_str(rc))));