2016-09-26 28 views
0

我们有一个ejabberd集群,它由两台主机组成,这两台主机在重新启动主机期间遇到问题。 我们看到不一致的数据库错误登录。但是,我们无法确定地分析配置或module_init执行中可能实际导致行为的内容。 删除node1上的mnesia可能有助于解决问题。然而,这对于管理目的来说并不理想。如何解决集群ejabberd环境中的Mnesia - inconsistent_database错误?

想请求对以下数据进行审核,并附上一些配置和反馈,说明实际可能导致该行为的原因以及如何减轻该行为。

预先感谢您。

环境配置如下:

  • Ejabberd Verison:16.03
  • 数量的主机:2
  • odbc_type:MySQL的

错误记录:

** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, other_node} 

重新载入步骤:

  • 重新启动节点1
  • 重新启动节点2

注意:如果主机以相反的顺序重新启动它不摄制。

MnesiaInfo:

似乎有两个模式具有不同的入口尺寸,要么节点上possbily内容: muc_online_room和我们的定制模式作为更名SCRUBBED_CUSTOM_FEATURE_SCHEMA_NAME楼下:

节点1:

---> Processes holding locks <--- 
---> Processes waiting for locks <--- 
---> Participant transactions <--- 
---> Coordinator transactions <--- 
---> Uncertain transactions <--- 
---> Active tables <--- 
mod_register_ip: with 0  records occupying 299  words of mem 
muc_online_room: with 348  records occupying 10757 words of mem 
http_bind  : with 0  records occupying 299  words of mem 
carboncopy  : with 0  records occupying 299  words of mem 
oauth_token : with 0  records occupying 299  words of mem 
session  : with 0  records occupying 299  words of mem 
session_counter: with 0  records occupying 299  words of mem 
sql_pool  : with 10  records occupying 439  words of mem 
route   : with 4  records occupying 405  words of mem 
iq_response : with 0  records occupying 299  words of mem 
temporarily_blocked: with 0  records occupying 299  words of mem 
s2s   : with 0  records occupying 299  words of mem 
route_multicast: with 0  records occupying 299  words of mem 
shaper   : with 2  records occupying 321  words of mem 
access   : with 28  records occupying 861  words of mem 
acl   : with 6  records occupying 459  words of mem 
local_config : with 32  records occupying 1293  words of mem 
schema   : with 19  records occupying 2727  words of mem 
SCRUBBED_CUSTOM_FEATURE_SCHEMA_NAME  : with 2457  records occupying 49953 words of mem 
===> System info in version "4.12.5", debug level = none <=== 
opt_disc. Directory "SCRUBBED_LOCATION" is used. 
use fallback at restart = false 
running db nodes = [SCRUBBED_NODE2,SCRUBBED_NODE1] 
stopped db nodes = [] 
master node tables = [] 
remote    = [] 
ram_copies   = [access,acl,carboncopy,http_bind,iq_response, 
         local_config,mod_register_ip,muc_online_room,route, 
         route_multicast,s2s,session,session_counter,shaper, 
         sql_pool,temporarily_blocked,SCRUBBED_CUSTOM_FEATURE_SCHEMA_NAME] 
disc_copies  = [oauth_token,schema] 
disc_only_copies = [] 
[{'SCRUBBED_NODE1',disc_copies}, 
{'SCRUBBED_NODE2',disc_copies}] = [schema, 
                    oauth_token] 
[{'SCRUBBED_NODE1',ram_copies}] = [local_config, 
                   acl,access, 
                   shaper, 
                   sql_pool, 
                   mod_register_ip] 
[{'SCRUBBED_NODE1',ram_copies}, 
{'SCRUBBED_NODE2',ram_copies}] = [route_multicast, 
                   s2s, 
                   temporarily_blocked, 
                   iq_response, 
                   route, 
                   session_counter, 
                   session, 
                   carboncopy, 
                   http_bind, 
                   muc_online_room, 
                   SCRUBBED_CUSTOM_FEATURE_SCHEMA_NAME] 
2623 transactions committed, 35 aborted, 26 restarted, 60 logged to disc 
0 held locks, 0 in queue; 0 local transactions, 0 remote 
0 transactions waits for other nodes: [] 
ok 

节点2:

mnesia:info(). 
---> Processes holding locks <--- 
---> Processes waiting for locks <--- 
---> Participant transactions <--- 
---> Coordinator transactions <--- 
---> Uncertain transactions <--- 
---> Active tables <--- 
mod_register_ip: with 0  records occupying 299  words of mem 
muc_online_room: with 348  records occupying 8651  words of mem 
http_bind  : with 0  records occupying 299  words of mem 
carboncopy  : with 0  records occupying 299  words of mem 
oauth_token : with 0  records occupying 299  words of mem 
session  : with 0  records occupying 299  words of mem 
session_counter: with 0  records occupying 299  words of mem 
route   : with 4  records occupying 405  words of mem 
sql_pool  : with 10  records occupying 439  words of mem 
iq_response : with 0  records occupying 299  words of mem 
temporarily_blocked: with 0  records occupying 299  words of mem 
s2s   : with 0  records occupying 299  words of mem 
route_multicast: with 0  records occupying 299  words of mem 
shaper   : with 2  records occupying 321  words of mem 
access   : with 28  records occupying 861  words of mem 
acl   : with 6  records occupying 459  words of mem 
local_config : with 32  records occupying 1293  words of mem 
schema   : with 19  records occupying 2727  words of mem 
SCRUBBED_CUSTOM_FEATURE_SCHEMA_NAME  : with 2457  records occupying 38232 words of mem 
===> System info in version "4.12.5", debug level = none <=== 
opt_disc. Directory "SCRUBBED_LOCATION" is used. 
use fallback at restart = false 
running db nodes = ['SCRUBBED_NODE1','SCRUBBED_NODE2'] 
stopped db nodes = [] 
master node tables = [] 
remote    = [] 
ram_copies   = [access,acl,carboncopy,http_bind,iq_response, 
         local_config,mod_register_ip,muc_online_room,route, 
         route_multicast,s2s,session,session_counter,shaper, 
         sql_pool,temporarily_blocked,SCRUBBED_CUSTOM_FEATURE_SCHEMA_NAME] 
disc_copies  = [oauth_token,schema] 
disc_only_copies = [] 
[{'SCRUBBED_NODE1',disc_copies}, 
{'SCRUBBED_NODE2',disc_copies}] = [schema, 
                    oauth_token] 
[{'SCRUBBED_NODE1',ram_copies}, 
{'SCRUBBED_NODE2',ram_copies}] = [route_multicast, 
                   s2s, 
                   temporarily_blocked, 
                   iq_response, 
                   route, 
                   session_counter, 
                   session, 
                   carboncopy, 
                   http_bind, 
                   muc_online_room, 
                   SCRUBBED_CUSTOM_FEATURE_SCHEMA_NAME] 
[{'SCRUBBED_NODE2',ram_copies}] = [local_config, 
                   acl,access, 
                   shaper, 
                   sql_pool, 
                   mod_register_ip] 
2998 transactions committed, 18 aborted, 0 restarted, 99 logged to disc 
0 held locks, 0 in queue; 0 local transactions, 0 remote 
0 transactions waits for other nodes: [] 
ok 

回答

0

注意:如果主机按相反顺序重新启动,则不重新生成。

不一致的数据库是保护数据。如果您以一个顺序停止集群,则必须以相反的顺序重新启动它。否则,第一个节点停止,将记录有其他活动节点以及用于防止数据丢失的最新信息。

+0

感谢您的关注,Mickaël。 –

+0

当我使用术语“重新启动”时,我提到停止并启动同一个节点。在我们的环境中,我们可以随时重新启动第二个节点,但为了正常重新启动第一个节点,第二个节点需要关闭。 Stop02,stop01,start01,start02工作以及停止01,停止02,开始01,开始02.但是,stop01,stop02,start02,start01不起作用。我想得出结论,node01是某种需要首先重新启动的集群主节点。我们希望重新启动节点的原因是为了节省可用节点的实例以避免停机。 –

+0

我们的系统工程师建议的替代方案是从集群中删除节点,进行更改并重新加入,以节省开销,因为订单的管理也需要注意,并且如果主机只是失败且无响应,则不需要执行必要的步骤。我相信更好的转述问题是**“通过随时保持集群在非系统级别中断更改时重新启动的正确方法是什么?”** –