2013-08-19 79 views
5

我们遇到了群集中的某些节点突然离开群集而没有任何明显原因的问题。Elasicsearch节点断开连接

我们在Elasticsearch v0.20.6,JVM 7u25上运行。我们使用单播发现。

这是一个嵌入式ES实例,集群中有7个节点。一个位置(网络)上的节点47,48,49和50,另一个位置上的节点24,25和26。

每隔一段时间后都会发生同样的事情,索引文件在测试之间被删除。其中一个24,25,26节点突然认为它的主人(这再次导致了脑裂情况 - 这没问题,我明白为什么会发生这种情况,但问题是为什么断开了。

首先,NODE47当选主所有其他节点加入,事情顺利运行了几个小时左右

突然,这里是的东西第一痕迹明显走错了,周围19:10:

Node47: 
2013-08-14 19:09:49,243 DEBUG [org.elasticsearch.transport.netty] (elasticsearch[local][transport_client_worker][T#3]{New I/O worker #3}) [local] disconnected from [[local][VbxjXeqGRIyNFzvK-1JCIw][inet[/**NODE24**:8800]]{local=false}], channel closed event 
2013-08-14 19:09:54,109 DEBUG [org.elasticsearch.transport.netty] (elasticsearch[local][transport_client_worker][T#3]{New I/O worker #3}) [local] disconnected from [[local][V7FXnZiLR-GVIyZ2DOwV2w][inet[/**NODE26**:8800]]{local=false}], channel closed event 
2013-08-14 19:10:06,008 DEBUG [org.elasticsearch.transport.netty] (elasticsearch[local][transport_client_worker][T#4]{New I/O worker #4}) [local] disconnected from [[local][da-T28GDRtWgadrkCvxS-w][inet[/**NODE25**:8800]]{local=false}], channel closed event 
2013-08-14 19:10:34,253 TRACE [org.elasticsearch.discovery.zen.fd] (elasticsearch[local][generic][T#19]) [local] [node ] [[local][VbxjXeqGRIyNFzvK-1JCIw][inet[/**NODE24**:8800]]{local=false}] transport disconnected (with verified connect) 
2013-08-14 19:10:34,259 DEBUG [org.elasticsearch.transport.netty] (elasticsearch[local][generic][T#24]) [local] connected to node [[local][V7FXnZiLR-GVIyZ2DOwV2w][inet[/**NODE26**:8800]]{local=false}] 
2013-08-14 19:10:34,259 DEBUG [org.elasticsearch.transport.netty] (elasticsearch[local][generic][T#25]) [local] connected to node [[local][da-T28GDRtWgadrkCvxS-w][inet[/**NODE25**:8800]]{local=false}] 
2013-08-14 19:10:34,273 DEBUG [org.elasticsearch.transport.netty] (elasticsearch[local][generic][T#26]) [local] connected to node [[local][VbxjXeqGRIyNFzvK-1JCIw][inet[/**NODE24**:8800]]{local=false}] 
2013-08-14 19:10:34,290 DEBUG [org.elasticsearch.transport.netty] (elasticsearch[local][generic][T#27]) [local] disconnected from [[local][VbxjXeqGRIyNFzvK-1JCIw][inet[/**NODE24**:8800]]{local=false}] 


Node24: 
2013-08-14 19:10:35,167 DEBUG [org.elasticsearch.discovery.zen.fd] (elasticsearch[local][transport_client_worker][T#4]{New I/O worker #4}) [local] [master] pinging a master [local][Y01TgbUzRg-JIIpQ7NqlZg][inet[/**NODE47**:8800]]{local=false} but we do not exists on it, act as if its master failure 
2013-08-14 19:10:35,170 DEBUG [org.elasticsearch.discovery.zen.fd] (elasticsearch[local][transport_client_worker][T#4]{New I/O worker #4}) [local] [master] stopping fault detection against master [[local][Y01TgbUzRg-JIIpQ7NqlZg][inet[/**NODE47**:8800]]{local=false}], reason [master failure, do not exists on master, act as master failure] 
2013-08-14 19:10:35,171 INFO [org.elasticsearch.discovery.zen] (elasticsearch[local][generic][T#1]) [local] master_left [[local][Y01TgbUzRg-JIIpQ7NqlZg][inet[/**NODE47**:8800]]{local=false}], reason [do not exists on master, act as master failure] 
2013-08-14 19:10:35,174 DEBUG [org.elasticsearch.discovery.zen.fd] (elasticsearch[local][clusterService#updateTask][T#1]) [local] [master] restarting fault detection against master [[local][JrRrD5Y8R8WHn1ZAkjYNBw][inet[/**NODE45**:8800]]{local=false}], reason [possible elected master since master left (reason = do not exists on master, act as master failure)] 
2013-08-14 19:10:35,181 DEBUG [org.elasticsearch.transport.netty] (elasticsearch[local][generic][T#1]) [local] disconnected from [[local][Y01TgbUzRg-JIIpQ7NqlZg][inet[/**NODE47**:8800]]{local=false}] 
2013-08-14 19:10:36,233 DEBUG [org.elasticsearch.discovery.zen.fd] (elasticsearch[local][transport_client_worker][T#4]{New I/O worker #4}) [local] [master] pinging a master [local][JrRrD5Y8R8WHn1ZAkjYNBw][inet[/**NODE45**:8800]]{local=false} that is no longer a master 
2013-08-14 19:10:36,235 INFO [org.elasticsearch.discovery.zen] (elasticsearch[local][generic][T#5]) [local] master_left [[local][JrRrD5Y8R8WHn1ZAkjYNBw][inet[/**NODE45**:8800]]{local=false}], reason [no longer master] 
2013-08-14 19:10:36,235 DEBUG [org.elasticsearch.discovery.zen.fd] (elasticsearch[local][transport_client_worker][T#4]{New I/O worker #4}) [local] [master] stopping fault detection against master [[local][JrRrD5Y8R8WHn1ZAkjYNBw][inet[/**NODE45**:8800]]{local=false}], reason [master failure, no longer master] 
2013-08-14 19:10:36,241 DEBUG [org.elasticsearch.discovery.zen.fd] (elasticsearch[local][clusterService#updateTask][T#1]) [local] [master] restarting fault detection against master [[local][V7FXnZiLR-GVIyZ2DOwV2w][inet[/**NODE26**:8800]]{local=false}], reason [possible elected master since master left (reason = no longer master)] 
2013-08-14 19:10:36,245 DEBUG [org.elasticsearch.transport.netty] (elasticsearch[local][generic][T#5]) [local] disconnected from [[local][JrRrD5Y8R8WHn1ZAkjYNBw][inet[/**NODE45**:8800]]{local=false}] 
2013-08-14 19:10:37,359 DEBUG [org.elasticsearch.discovery.zen.fd] (elasticsearch[local][transport_client_worker][T#3]{New I/O worker #3}) [local] [master] pinging a master [local][V7FXnZiLR-GVIyZ2DOwV2w][inet[/**NODE26**:8800]]{local=false} that is no longer a master 
2013-08-14 19:10:37,361 INFO [org.elasticsearch.discovery.zen] (elasticsearch[local][generic][T#10]) [local] master_left [[local][V7FXnZiLR-GVIyZ2DOwV2w][inet[/**NODE26**:8800]]{local=false}], reason [no longer master] 
2013-08-14 19:10:37,363 DEBUG [org.elasticsearch.discovery.zen.fd] (elasticsearch[local][transport_client_worker][T#3]{New I/O worker #3}) [local] [master] stopping fault detection against master [[local][V7FXnZiLR-GVIyZ2DOwV2w][inet[/**NODE26**:8800]]{local=false}], reason [master failure, no longer master] 
2013-08-14 19:10:37,393 DEBUG [org.elasticsearch.transport.netty] (elasticsearch[local][generic][T#10]) [local] disconnected from [[local][V7FXnZiLR-GVIyZ2DOwV2w][inet[/**NODE26**:8800]]{local=false}] 

据我所能读取的日志;这是发生了什么事情:

19:09:49243 - 一个信道关闭时从NODE24接收到NODE47(主)事件,并将其断开 19:10:3​​4273 - 到NODE24的连接完成后,然后 19:10:3​​4290 - 我们从NODE24中获得一个“断开连接”012:19:10:3​​5,167 - NODE24 pings master(NODE47),但是主节点在其节点列表中没有NODE24,并且威胁到主节点故障。据我所知,所有这一切都发生在第二个 - 唉,没有超时工作在这里工作。此外,在此期间或之前没有大型GC或任何可以测量的放缓。

Im at loss;为什么会发生?如果网络问题;网络端应该测试什么?

回答

2

为了回答这个问题,我自己与行为的实际原因;

2个节点之间的tcp连接(同时保持与其他节点的连接)断开连接。它可以通过使用像tcpkill这样的实用程序来重新创建。

Elasticsearch Zen发现可悲的是不能处理这样的错误非常好,而且各种奇怪的结果都是可能的。连接到主节点的节点将进行选举,并可能会混淆其他节点。

+1

男人,这不是我想听到的。那么避免这些情况的最佳解决方法是什么? :( –

+0

避免麻烦的方法是确保minimum_master_nodes设置正确地设置为节点/ 2-1,但遗憾的是这在我们的案例中是不可行的。我真的希望Elasticsearch能够继续发展Zen更强大 – runarM

+0

你使用的是什么类型的发现?目前我正在使用cloud-aws插件,我不确定为什么我的节点保持断开连接。你认为它会通过单播发现而不是使用cloud-aws插件提高吗? – syllogismos