apache-zookeeper - 在 kafka 中出现错误，因为无法重新连接到 Zookeeper，会话 0x20000xxxxxxxx

问题描述

我们在 kubernetes(1.14.6) 上运行 confluent kafka ( https://github.com/confluentinc/cp-helm-charts )。我们的日志保留时间为 30 分钟，存储空间为 300GB。我们有 4 个代理，复制因子为 3。我们有大约 65MBps 的吞吐量。大约一个小时后运行后，我们观察到以下错误。Kafka 代理有 6GB 的堆。

[2019-09-27 12:32:05,278] WARN [ReplicaFetcher replicaId=2, leaderId=3, fetcherId=0] Error when sending leader epoch request for Map(RT-15-7 -> (currentLeaderEpoch=Optional[8], leaderEpoch=6), RT-17-0 -> (currentLeaderEpoch=Optional[5], leaderEpoch=3), RT-19-6 -> (currentLeaderEpoch=Optional[7], leaderEpoch=5), RT-02-0 -> (currentLeaderEpoch=Optional[5], leaderEpoch=3), RT-27-4 -> (currentLeaderEpoch=Optional[6], leaderEpoch=4), RT-22-3 -> (currentLeaderEpoch=Optional[7], leaderEpoch=5), RT-32-5 -> (currentLeaderEpoch=Optional[7], leaderEpoch=5), RT-42-4 -> (currentLeaderEpoch=Optional[7], leaderEpoch=5), RT-27-1 -> (currentLeaderEpoch=Optional[9], leaderEpoch=7), _confluent-controlcenter-5-2-0-1-MetricsAggregateStore-repartition-2 -> (currentLeaderEpoch=Optional[11], leaderEpoch=9), RT-21-6 -> (currentLeaderEpoch=Optional[8], leaderEpoch=6), _confluent-controlcenter-5-2-0-1-metrics-trigger-measurement-rekey-3 -> (currentLeaderEpoch=Optional[11], leaderEpoch=9), RT-30-6 -> (currentLeaderEpoch=Optional[7], leaderEpoch=5), RT-06-6 -> (currentLeaderEpoch=Optional[7], leaderEpoch=5), _confluent-controlcenter-5-2-0-1-expected-group-consumption-rekey-1 -> (currentLeaderEpoch=Optional[11], leaderEpoch=9), RT-17-1 -> (currentLeaderEpoch=Optional[7], leaderEpoch=5), _confluent-metrics-10 -> (currentLeaderEpoch=Optional[11], leaderEpoch=9), RT-21-0 -> (currentLeaderEpoch=Optional[7], leaderEpoch=5), _confluent-monitoring-9 -> (currentLeaderEpoch=Optional[11], leaderEpoch=9), RT-17-9 -> (currentLeaderEpoch=Optional[9], leaderEpoch=7), RT-02-9 -> (currentLeaderEpoch=Optional[9], leaderEpoch=7), RT-20-1 -> (currentLeaderEpoch=Optional[8], leaderEpoch=6), RT-30-0 -> (currentLeaderEpoch=Optional[7], leaderEpoch=5), RT-12-0 -> (currentLeaderEpoch=Optional[8], leaderEpoch=6), RT-32-9 -> (currentLeaderEpoch=Optional[6], leaderEpoch=4), RT-02-1 -> (currentLeaderEpoch=Optional[7], leaderEpoch=5), RT-06-0 -> (currentLeaderEpoch=Optional[7], leaderEpoch=5), _confluent-monitoring-3 -> (currentLeaderEpoch=Optional[11], leaderEpoch=9), _confluent-metrics-7 -> (currentLeaderEpoch=Optional[11], leaderEpoch=9), _confluent-controlcenter-5-2-0-1-MetricsAggregateStore-changelog-0 -> (currentLeaderEpoch=Optional[11], leaderEpoch=9), RT-27-5 -> (currentLeaderEpoch=Optional[9], leaderEpoch=7), RT-17-6 -> (currentLeaderEpoch=Optional[6], leaderEpoch=4), RT-32-3 -> (currentLeaderEpoch=Optional[7], leaderEpoch=5), RT-02-6 -> (currentLeaderEpoch=Optional[6], leaderEpoch=4), _confluent-monitoring-0 -> (currentLeaderEpoch=Optional[11], leaderEpoch=9), RT-27-2 -> (currentLeaderEpoch=Optional[6], leaderEpoch=4), _confluent-controlcenter-5-2-0-1-actual-group-consumption-rekey-2 -> (currentLeaderEpoch=Optional[11], leaderEpoch=9)) (kafka.server.ReplicaFetcherThread) java.io.IOException: Connection to 3 was disconnected before the response was read at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:100) at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:107) at kafka.server.ReplicaFetcherThread.fetchEpochEndOffsets(ReplicaFetcherThread.scala:310) at kafka.server.AbstractFetcherThread.truncateToEpochEndOffsets(AbstractFetcherThread.scala:208) at kafka.server.AbstractFetcherThread.maybeTruncate(AbstractFetcherThread.scala:173) at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:113) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:89)

其余配置为默认配置。我不确定是什么导致这个 zookeeper 关闭套接字连接。我也可以看到我所有的豆荚都很健康。如果需要添加更多信息，请告诉我。感谢任何调试指针。

Grafana Dashbaord 描述代理和分区其他有用的指标

标签： apache-zookeeperconfluent-platform

apache-zookeeper - 在 kafka 中出现错误，因为无法重新连接到 Zookeeper，会话 0x20000xxxxxxxx

问题描述

解决方案

推荐阅读