首页 > 解决方案 > 设置因果集群失败

问题描述

我正在尝试设置一个具有 3 个核心(仅限核心)的 Neo4J 因果集群。我有三台 Debian 服务器,都是 debian 8.5。我在每台服务器上都安装了 Java 8 和 Neo4J Enterprise 3.4.0(包源 deb https://debian.neo4j.org/repo stable/)。

我的主机是 192.168.20.163、192.168.20.164 和 192.168.20.165。每个主机上的配置都是相同的,IP 地址有明显的变化。以下是针对 .163 主机的

dbms.connectors.default_listen_address=0.0.0.0
dbms.connectors.default_advertised_address=192.168.20.163
dbms.mode=CORE
causal_clustering.expected_core_cluster_size=3
causal_clustering.minimum_core_cluster_size_at_formation=3
causal_clustering.minimum_core_cluster_size_at_runtime=3
causal_clustering.initial_discovery_members=192.168.20.163:5000,192.168.20.164:5000,192.168.20.165:5000
causal_clustering.discovery_type=LIST
causal_clustering.discovery_listen_address=192.168.20.163:5000
causal_clustering.transaction_listen_address=192.168.20.163:6000
causal_clustering.raft_listen_address=192.168.20.163:7000

服务器通过选举过程,但 LEADER 继续切换回 FOLLOWER 并触发新的选举。

非领导服务器或“成员”每个都会收到以下错误:

错误 [onccssCoreStateDownloader] 由于商店 ID 不匹配,商店复制失败

首先启动的服务器成为 LEADER,但如所示切换回 FOLLOWER:

2018-05-30 14:58:22.808+0000 INFO [o.n.c.c.c.RaftMachine] Moving to CANDIDATE state after successfully starting election
2018-05-30 14:58:22.825+0000 INFO [o.n.c.m.SenderService] Creating channel to: [192.168.20.165:7000]
2018-05-30 14:58:22.827+0000 INFO [o.n.c.m.SenderService] Creating channel to: [192.168.20.164:7000]
2018-05-30 14:58:22.838+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Scheduling handshake (and timeout) local null remote null
2018-05-30 14:58:22.848+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Scheduling handshake (and timeout) local null remote null
2018-05-30 14:58:22.861+0000 INFO [o.n.c.m.SenderService] Connected: [id: 0x2ee2e930, L:/192.168.20.163:50169 - R:/192.168.20.165:7000]
2018-05-30 14:58:22.862+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Initiating handshake local /192.168.20.163:50169 remote /192.168.20.165:7000
2018-05-30 14:58:22.863+0000 INFO [o.n.c.m.SenderService] Connected: [id: 0x3d670ef3, L:/192.168.20.163:38239 - R:/192.168.20.164:7000]
2018-05-30 14:58:22.863+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Initiating handshake local /192.168.20.163:38239 remote /192.168.20.164:7000
2018-05-30 14:58:22.928+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Installing: ProtocolStack{applicationProtocol=RAFT_1, modifierProtocols=[]}
2018-05-30 14:58:22.929+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Installing: ProtocolStack{applicationProtocol=RAFT_1, modifierProtocols=[]}
2018-05-30 14:58:22.965+0000 INFO [o.n.c.p.h.HandshakeServerInitializer] Installing handshake server local /192.168.20.163:7000 remote /192.168.20.164:41725
2018-05-30 14:58:23.036+0000 INFO [o.n.c.c.c.RaftMachine] Moving to LEADER state at term 111 (I am MemberId{fbdff840}), voted for by [MemberId{4fe121e0}]
2018-05-30 14:58:23.036+0000 INFO [o.n.c.c.c.s.RaftState] First leader elected: MemberId{fbdff840}
2018-05-30 14:58:23.044+0000 INFO [o.n.c.c.c.s.RaftLogShipper] Starting log shipper: MemberId{f202d023}[matchIndex: -1, lastSentIndex: 0, localAppendIndex: 3, mode: MISMATCH]
2018-05-30 14:58:23.045+0000 INFO [o.n.c.c.c.s.RaftLogShipper] Starting log shipper: MemberId{4fe121e0}[matchIndex: -1, lastSentIndex: 0, localAppendIndex: 3, mode: MISMATCH]
2018-05-30 14:58:23.045+0000 INFO [o.n.c.c.c.m.RaftMembershipChanger] Idle{}
2018-05-30 14:58:23.046+0000 INFO [c.n.c.d.SslHazelcastCoreTopologyService] Leader MemberId{fbdff840} updating leader info for database default and term 111
2018-05-30 14:58:24.105+0000 INFO [o.n.c.p.h.HandshakeServerInitializer] Installing handshake server local /192.168.20.163:6000 remote /192.168.20.164:58041
2018-05-30 14:58:26.841+0000 INFO [o.n.c.p.h.HandshakeServerInitializer] Installing handshake server local /192.168.20.163:7000 remote /192.168.20.165:48317
2018-05-30 14:58:30.881+0000 INFO [o.n.c.p.h.HandshakeServerInitializer] Installing handshake server local /192.168.20.163:6000 remote /192.168.20.165:47015
2018-05-30 14:58:38.462+0000 INFO [o.n.c.c.c.m.MembershipWaiter] Leader commit unknown
2018-05-30 14:58:40.411+0000 INFO [o.n.c.c.c.RaftMachine] Moving to FOLLOWER state after not receiving heartbeat responses in this election timeout period. Heartbeats received: []
2018-05-30 14:58:40.411+0000 INFO [o.n.c.c.c.s.RaftState] Leader changed from MemberId{fbdff840} to null
2018-05-30 14:58:40.412+0000 INFO [o.n.c.c.c.s.RaftLogShipper] Stopping log shipper MemberId{f202d023}[matchIndex: -1, lastSentIndex: 3, localAppendIndex: 3, mode: MISMATCH]
2018-05-30 14:58:40.413+0000 INFO [o.n.c.c.c.s.RaftLogShipper] Stopping log shipper MemberId{4fe121e0}[matchIndex: -1, lastSentIndex: 3, localAppendIndex: 3, mode: MISMATCH]
2018-05-30 14:58:40.413+0000 INFO [o.n.c.c.c.m.RaftMembershipChanger] Inactive{}
2018-05-30 14:58:40.413+0000 INFO [c.n.c.d.SslHazelcastCoreTopologyService] Step down event detected. This topology member, with MemberId MemberId{fbdff840}, was leader in term 111, now moving to follower.
2018-05-30 14:58:48.342+0000 INFO [o.n.c.c.c.RaftMachine] Election timeout triggered

最终服务器失败:

错误 [oncccmMembershipWaiterLifecycle] 服务器未能在追赶时间限制内加入集群 [600000 毫秒]

标签: neo4j

解决方案


根据您收到的消息,我假设您正在尝试使用来自某处的备份来为集群播种?这是你应该做的:

  1. 检查集群是否在没有种子的情况下正确形成(因此使用空数据库)。这样您就可以验证所有设置是否正确。
  2. 使用备份播种集群时,您需要在启动前在每个实例上取消绑定 neo4j-admin的数据库。检查https://neo4j.com/docs/operations-manual/current/clustering/causal-clustering/seed-cluster/以了解您的案例的具体说明。如果您不取消绑定,您将得到商店ID 不匹配。
  3. 如果 1. 和 2. 不能解决您的问题,请检查 Neo4j 支持(因为您使用的是 EE,我假设您确实有支持)。

希望这可以帮助。

问候,汤姆


推荐阅读