首页 > 解决方案 > HDP 集群上的 ambari + ambari-metrics-collector 服务未启动

问题描述

我们的服务有一些问题ambari-metrics-collector,(我们有HDP集群版本 -2.6.4有 8 个节点)

ambari 指标收集器服务无法启动或启动几秒钟然后失败

在此处输入图像描述

有关指标收集器版本的详细信息

rpm -qa | grep metrics
ambari-metrics-grafana-2.6.1.0-143.x86_64
ambari-metrics-monitor-2.6.1.0-143.x86_64
ambari-metrics-collector-2.5.0.3-7.x86_64
ambari-metrics-hadoop-sink-2.6.1.0-143.x86_64

所有机器都是rhel 7.2

我们执行了以下步骤以解决问题

1.重启metrics-collector服务

su - ams -c '/usr/sbin/ambari-metrics-collector --config /etc/ambari-metrics-collector/conf/ stop'
su - ams -c '/usr/sbin/ambari-metrics-collector --config /etc/ambari-metrics-collector/conf/ start'

or

ambari-metrics-collector stop 
ambari-metrics-collector start

2.在所有节点上重新启动ambari-metrics-monitor

 ambari-metrics-monitor stop
 ambari-metrics-monitor start

3.清理文件夹/var/lib/ambari-metrics-collector/hbase-tmp/zookeeper/

mv /var/lib/ambari-metrics-collector/hbase-tmp/zookeeper/zookeeper_0 /tmp/bck/zookeeper/

然后重启metrics-collector服务

4.根据 - https://docs.cloudera.com/HDPDocuments/Ambari-2.2.1.0/bk_ambari_reference_guide/content/_ams_general_guidelines.html调整指标收集器参数

我们更新 ambari 中的以下参数

metrics_collector_heap_size=1024
hbase_regionserver_heapsize=1024
hbase_master_heapsize=512
hbase_master_xmn_size=128

目前的状态: - 步骤 1-4 无济于事

从日志中我们可以看到以下内容:

日志文件 -ambari-metrics-collector.log

2020-06-25 09:06:14,474 WARN org.apache.zookeeper.ClientCnxn: Session 0x172eab71f310002 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
2020-06-25 09:06:14,575 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=master02.sys671.com:61181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /ams-hbase-unsecure/meta-region-server

日志文件 -hbase-ams-master-master02.sys671.com.log

2020-06-25 09:38:18,799 WARN  [RS:0;master02:51842-SendThread(master02.sys671.com:61181)] zookeeper.ClientCnxn: Session 0x172ead5d73a0004 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
2020-06-25 09:38:20,437 INFO  [main-SendThread(master02.sys671.com:61181)] zookeeper.ClientCnxn: Opening socket connection to server master02.sys671.com/23.2.35.171:61181. Will not attempt to authenticate using SASL (unknown error)
2020-06-25 09:38:20,438 WARN  [main-SendThread(master02.sys671.com:61181)] zookeeper.ClientCnxn: Session 0x172ead5d73a0002 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused

我们也没有看到端口正在监听(timeline.metrics.service.webapp.address)

netstat -tulpn  | grep  6188

任何建议如何从这一点继续?

我们将不胜感激获得有关此问题的任何帮助

标签: metricsrhelambarihdp

解决方案


推荐阅读