请看下面的代码。正如文档所述,我已关闭auto-down-unreachable-after
功能。相反,我实现了一种与正常情况有点不同的自定义逻辑。下面的代码的关键是如果网络分区发生,只有拥有多数的集群节点将在一些可配置的5之后取下UnreachableMember
。另一方面,少数群集节点将踩在他们的UnreachableMember
(这是多数组为unreachable
,并不要把它们放下来形成一个岛屿。大多数人的想法是从MongoDB借用,我认为是在计算机科学领域并不新鲜。
class ClusterListener extends Actor with ActorLogging {
val cluster = Cluster(context.system)
var unreachableMember: Set[Member] = Set()
// subscribe to cluster changes, re-subscribe when restart
override def preStart(): Unit = {
//#subscribe
cluster.subscribe(self, initialStateMode = InitialStateAsEvents, classOf[UnreachableMember], classOf[ReachableMember])
//#subscribe
}
override def postStop(): Unit = cluster.unsubscribe(self)
def receive = {
case UnreachableMember(member) =>
log.info("Member detected as unreachable: {}", member)
val state = cluster.state
if (isMajority(state.members.size, state.unreachable.size)) {
scheduletakeDown(member)
}
case ReachableMember(member) =>
unreachableMember = unreachableMember - member
case _: MemberEvent => // ignore
case "die" =>
unreachableMember.foreach { member =>
cluster.down(member.address)
}
}
// find out majority number of the group
private def majority(n: Int): Int = (n+1)/2 + (n+1)%2
private def isMajority(total: Int, dead: Int): Boolean = {
require(total > 0)
require(dead >= 0)
(total - dead) >= majority(total)
}
private def scheduletakeDown(member: Member) = {
implicit val dispatcher = context.system.dispatcher
unreachableMember = unreachableMember + member
// make 5s config able!!!
context.system.scheduler.scheduleOnce(5 seconds, self, "die")
}
}
我也有同样的问题你。看来,我们没有办法阻止2集群分区启动自己的集群辛格尔顿的评论 – mingchuno