首页 > 解决方案 > 卡夫卡经纪人因未知原因挂起

问题描述

我正在使用 Kafka 2.0.0 的单节点设置,并且在其生命周期的特定时间遇到奇怪的挂起。

我有一个 24/7 连接的消费者,在某些时候它会抛出这些错误消息:

2018/11/27 22:34:33 Consumer error: kafkaa:9094/1001: 1 request(s) timed out: disconnect (<nil>)
%3|1543358073.459|ERROR|rdkafka#consumer-1| [thrd:app]: rdkafka#consumer-1: 1/1 brokers are down: Local: All broker connections are down
runtime stack:
fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x7fb3c5fdedb4]

runtime.throw(0x9749e0, 0x2a)
runtime.sigpanic()
goroutine 1 [syscall]:
    /usr/local/go/src/runtime/signal_unix.go:374 +0x2f2

尽管在应用程序端没有做任何事情,消费者却意外地断开了连接。

我将 Kafka 日志追溯到断开连接开始发生的时间:

[2018-11-27 22:30:37,905] INFO [GroupMetadataManager brokerId=1001] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2018-11-27 22:33:36,037] TRACE [Controller id=1001] Leader imbalance ratio for broker 1001 is 0.0 (kafka.controller.KafkaController)
[2018-11-27 22:33:36,037] TRACE [Controller id=1001] Checking need to trigger auto leader balancing (kafka.controller.KafkaController)
[2018-11-27 22:33:36,037] DEBUG [Controller id=1001] Topics not in preferred replica for broker 1001 Map() (kafka.controller.KafkaController)
[2018-11-27 22:33:36,037] DEBUG [Controller id=1001] Preferred replicas by broker Map(1001 -> Map(__consumer_offsets-22 -> Vector(1001), __consumer_offsets-30 -> Vector(1001), __consumer_offsets-8 -> Vector(1001), __consumer_offsets-21 -> Vector(1001), __consumer_offsets-4 -> Vector(1001), __consumer_offsets-27 -> Vector(1001), __consumer_offsets-7 -> Vector(1001), pcap-input-0 -> Vector(1001), __consumer_offsets-9 -> Vector(1001), __consumer_offsets-46 -> Vector(1001), __consumer_offsets-25 -> Vector(1001), __consumer_offsets-35 -> Vector(1001), __consumer_offsets-41 -> Vector(1001), __consumer_offsets-33 -> Vector(1001), __consumer_offsets-23 -> Vector(1001), __consumer_offsets-49 -> Vector(1001), _schemas-0 -> Vector(1001), pcap-output-0 -> Vector(1001), __consumer_offsets-47 -> Vector(1001), __consumer_offsets-16 -> Vector(1001), __consumer_offsets-28 -> Vector(1001), __consumer_offsets-31 -> Vector(1001), __consumer_offsets-36 -> Vector(1001), __consumer_offsets-42 -> Vector(1001), __consumer_offsets-3 -> Vector(1001), __consumer_offsets-18 -> Vector(1001), __consumer_offsets-37 -> Vector(1001), __consumer_offsets-15 -> Vector(1001), __consumer_offsets-24 -> Vector(1001), pcap-input-error-0 -> Vector(1001), __consumer_offsets-38 -> Vector(1001), __consumer_offsets-17 -> Vector(1001), __consumer_offsets-48 -> Vector(1001), __confluent.support.metrics-0 -> Vector(1001), __consumer_offsets-19 -> Vector(1001), __consumer_offsets-11 -> Vector(1001), __consumer_offsets-13 -> Vector(1001), __consumer_offsets-2 -> Vector(1001), __consumer_offsets-43 -> Vector(1001), __consumer_offsets-6 -> Vector(1001), __consumer_offsets-14 -> Vector(1001), __consumer_offsets-20 -> Vector(1001), __consumer_offsets-0 -> Vector(1001), __consumer_offsets-44 -> Vector(1001), pcaps-output-failures-memsql-0 -> Vector(1001), __consumer_offsets-39 -> Vector(1001), __consumer_offsets-12 -> Vector(1001), __consumer_offsets-45 -> Vector(1001), __consumer_offsets-1 -> Vector(1001), __consumer_offsets-5 -> Vector(1001), __consumer_offsets-26 -> Vector(1001), __consumer_offsets-29 -> Vector(1001), __consumer_offsets-34 -> Vector(1001), __consumer_offsets-10 -> Vector(1001), pcaps-output-failures-elastic-0 -> Vector(1001), __consumer_offsets-32 -> Vector(1001), __consumer_offsets-40 -> Vector(1001))) (kafka.controller.KafkaController)
[2018-11-27 22:48:10,422] WARN Attempting to send response via channel for which there is no open connection, connection id 172.17.0.28:9094-172.17.0.27:40266-160 (kafka.network.Processor)
[2018-11-27 22:48:10,422] WARN Client session timed out, have not heard from server in 849630ms for sessionid 0x10001e9d5c00006 (org.apache.zookeeper.ClientCnxn)

资源指标显示,此时代理的 CPU 峰值高达 100%(准确地说是 22:34)。

经纪人在这里做什么?这如何证明 100% 的 CPU 消耗是合理的?

标签: apache-kafka

解决方案


您的消费者应用程序很长时间没有汇集,所以 kafka 经纪人认为它有问题 - 可能发生网络分区或应用程序刚刚崩溃。这样做是为了让 kafka 代理可以进行重新平衡并将特定分区分配给不同的消费者。

您可能需要调整zookeeper.session.timeout.ms heartbeat.interval.ms session.timeout.msmax.poll.interval.ms解决它。看看http://kafka.apache.org/20/documentation.html,搜索heartbeat. 我看到您正在使用一些 go 库 - 请验证如何在其中调整心跳。


推荐阅读