2017-08-22 62 views
1

我们有三个服务必须位于群集中。因此,我们使用Infinispan为集群节点和共享这些服务之间的数据。成功重新启动后,有时候我收到异常,并在其他节点中收到了“已更改”事件。实际上所有节点都在运行。我无法弄清楚这个原因。org.infinispan.util.concurrent.TimeoutException:“节点名称”的复制超时

我使用的Infinispan 8.1.3分布式缓存,JGroups的-3.4

org.infinispan.util.concurrent.TimeoutException: Replication timeout for sipproxy-16964 
      at org.infinispan.remoting.transport.jgroups.JGroupsTransport.checkRsp(JGroupsTransport.java:765) 
      at org.infinispan.remoting.transport.jgroups.JGroupsTransport.lambda$invokeRemotelyAsync$80(JGroupsTransport.java:599) 
      at org.infinispan.remoting.transport.jgroups.JGroupsTransport$$Lambda$9/1547262581.apply(Unknown Source) 
      at java.util.concurrent.CompletableFuture$ThenApply.run(CompletableFuture.java:717) 
      at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:193) 
      at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2345) 
      at org.infinispan.remoting.transport.jgroups.SingleResponseFuture.call(SingleResponseFuture.java:46) 
      at org.infinispan.remoting.transport.jgroups.SingleResponseFuture.call(SingleResponseFuture.java:17) 
      at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
      at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) 
      at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) 
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
      at java.lang.Thread.run(Thread.java:745) 
    2017-08-22 04:44:52,902 INFO [JGroupsTransport] (ViewHandler,ISPN,transport_manager-48870) ISPN000094: Received new cluster view for channel ISPN: [transport_manager-48870|3] (2) [transport_manager-48870, mediaproxy-47178] 
    2017-08-22 04:44:52,949 WARN [PreferAvailabilityStrategy] (transport-thread-transport_manager-p4-t24) ISPN000313: Cache mediaProxyResponseCache lost data because of abrupt leavers [sipproxy-16964] 
    2017-08-22 04:44:52,951 WARN [ClusterTopologyManagerImpl] (transport-thread-transport_manager-p4-t24) ISPN000197: Error updating cluster member list 
    java.lang.IllegalArgumentException: There must be at least one node with a non-zero capacity factor 
      at org.infinispan.distribution.ch.impl.DefaultConsistentHashFactory.checkCapacityFactors(DefaultConsistentHashFactory.java:57) 
      at org.infinispan.distribution.ch.impl.DefaultConsistentHashFactory.updateMembers(DefaultConsistentHashFactory.java:74) 
      at org.infinispan.distribution.ch.impl.DefaultConsistentHashFactory.updateMembers(DefaultConsistentHashFactory.java:26) 
      at org.infinispan.topology.ClusterCacheStatus.updateCurrentTopology(ClusterCacheStatus.java:431) 
      at org.infinispan.partitionhandling.impl.PreferAvailabilityStrategy.onClusterViewChange(PreferAvailabilityStrategy.java:56) 
      at org.infinispan.topology.ClusterCacheStatus.doHandleClusterView(ClusterCacheStatus.java:337) 
      at org.infinispan.topology.ClusterTopologyManagerImpl.updateCacheMembers(ClusterTopologyManagerImpl.java:397) 
      at org.infinispan.topology.ClusterTopologyManagerImpl.handleClusterView(ClusterTopologyManagerImpl.java:314) 
      at org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener$1.run(ClusterTopologyManagerImpl.java:571) 
      at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
      at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
      at java.lang.Thread.run(Thread.java:745) 

jgroups.xml:

<config xmlns="urn:org:jgroups" 
     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
     xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/JGroups-3.4.xsd"> 
    <TCP bind_addr="131.10.20.16" 
     bind_port="8010" port_range="10" 
     recv_buf_size="20000000" 
     send_buf_size="640000" 
     loopback="false" 
     max_bundle_size="64k" 
     bundler_type="old" 
     enable_diagnostics="true" 
     thread_naming_pattern="cl" 
     timer_type="new" 
     timer.min_threads="4" 
     timer.max_threads="30" 
     timer.keep_alive_time="3000" 
     timer.queue_max_size="100" 
     timer.wheel_size="200" 
     timer.tick_time="50" 
     thread_pool.enabled="true" 
     thread_pool.min_threads="2" 
     thread_pool.max_threads="30" 
     thread_pool.keep_alive_time="5000" 
     thread_pool.queue_enabled="true" 
     thread_pool.queue_max_size="100" 
     thread_pool.rejection_policy="discard" 

     oob_thread_pool.enabled="true" 
     oob_thread_pool.min_threads="2" 
     oob_thread_pool.max_threads="30" 
     oob_thread_pool.keep_alive_time="5000" 
     oob_thread_pool.queue_enabled="false" 
     oob_thread_pool.queue_max_size="100" 
     oob_thread_pool.rejection_policy="discard"/> 
     <TCPPING initial_hosts="131.10.20.16[8010],131.10.20.17[8010],131.10.20.182[8010]" port_range="2" 
     timeout="3000" num_initial_members="3" /> 

    <MERGE3 max_interval="30000" 
      min_interval="10000"/> 

    <FD_SOCK/> 
    <FD_ALL interval="3000" timeout="10000" /> 
    <VERIFY_SUSPECT timeout="500" /> 
    <BARRIER /> 
    <pbcast.NAKACK use_mcast_xmit="false" 
        retransmit_timeout="100,300,600,1200" 
        discard_delivered_msgs="true" /> 
    <UNICAST3 conn_expiry_timeout="0"/> 

    <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000" 
        max_bytes="10m"/> 
    <pbcast.GMS print_local_addr="true" join_timeout="5000" 
       max_bundling_time="30" 
       view_bundling="true"/> 
    <UFC max_credits="2M" 
     min_threshold="0.4"/> 
    <MFC max_credits="2M" 
     min_threshold="0.4"/> 
    <FRAG2 frag_size="60000" /> 
    <pbcast.STATE_TRANSFER/> 
</config> 

回答

1

TimeoutException异常只是一个RPC响应还未内收到说暂停,没有更多。当服务器处于压力下时可能会发生这种情况,但这可能不是这种情况 - 以下日志说节点是'怀疑'的 - 该节点可能无响应超过10秒(这是配置的限制,见FD_ALL)。

首先检查该服务器中的日志是否有错误,以及GC日志是否有任何停止世界暂停。

+0

好的,谢谢.i会检查当时是否有gc –

+0

你是对的!完整的GC导致这个:) –

1

作为@flavius建议的主要原因是您的某个节点由于某种原因停止并且未能回复RPC。

我建议改变的JGroups的日志记录级别,这样你可以看到为什么一个节点被怀疑(它可以由FD_SOCKFD_ALL协议发生),为什么它是从视图中消除(这是很可能是这个发生因为VERIFY_SUSPECT协议)。

你也可以检查为什么发生。在大多数情况下,这是由于长时间的GC暂停造成的。但是由于其他原因,您的虚拟机也可能被主机暂停。我建议在这两个VM中使用JHiccup,并将其作为Java代理附加到您的进程。这样你应该注意到它是否是JVM停止世界造成这个或是操作系统。

+0

好的,谢谢。我会试试这个。 –

+0

你是对的!完整的GC导致:) –

+0

我很高兴你找到了这个! – altanis