metrics - HDP 集群上的 ambari + ambari-metrics-collector 服务未启动
问题描述
我们的服务有一些问题ambari-metrics-collector
,(我们有HDP
集群版本 -2.6.4
有 8 个节点)
ambari 指标收集器服务无法启动或启动几秒钟然后失败
有关指标收集器版本的详细信息
rpm -qa | grep metrics
ambari-metrics-grafana-2.6.1.0-143.x86_64
ambari-metrics-monitor-2.6.1.0-143.x86_64
ambari-metrics-collector-2.5.0.3-7.x86_64
ambari-metrics-hadoop-sink-2.6.1.0-143.x86_64
所有机器都是rhel 7.2
我们执行了以下步骤以解决问题
1.重启metrics-collector服务
su - ams -c '/usr/sbin/ambari-metrics-collector --config /etc/ambari-metrics-collector/conf/ stop'
su - ams -c '/usr/sbin/ambari-metrics-collector --config /etc/ambari-metrics-collector/conf/ start'
or
ambari-metrics-collector stop
ambari-metrics-collector start
2.在所有节点上重新启动ambari-metrics-monitor
ambari-metrics-monitor stop
ambari-metrics-monitor start
3.清理文件夹/var/lib/ambari-metrics-collector/hbase-tmp/zookeeper/
mv /var/lib/ambari-metrics-collector/hbase-tmp/zookeeper/zookeeper_0 /tmp/bck/zookeeper/
然后重启metrics-collector服务
4.根据 - https://docs.cloudera.com/HDPDocuments/Ambari-2.2.1.0/bk_ambari_reference_guide/content/_ams_general_guidelines.html调整指标收集器参数
我们更新 ambari 中的以下参数
metrics_collector_heap_size=1024
hbase_regionserver_heapsize=1024
hbase_master_heapsize=512
hbase_master_xmn_size=128
目前的状态: - 步骤 1-4 无济于事
从日志中我们可以看到以下内容:
日志文件 -ambari-metrics-collector.log
2020-06-25 09:06:14,474 WARN org.apache.zookeeper.ClientCnxn: Session 0x172eab71f310002 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
2020-06-25 09:06:14,575 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=master02.sys671.com:61181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /ams-hbase-unsecure/meta-region-server
日志文件 -hbase-ams-master-master02.sys671.com.log
2020-06-25 09:38:18,799 WARN [RS:0;master02:51842-SendThread(master02.sys671.com:61181)] zookeeper.ClientCnxn: Session 0x172ead5d73a0004 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
2020-06-25 09:38:20,437 INFO [main-SendThread(master02.sys671.com:61181)] zookeeper.ClientCnxn: Opening socket connection to server master02.sys671.com/23.2.35.171:61181. Will not attempt to authenticate using SASL (unknown error)
2020-06-25 09:38:20,438 WARN [main-SendThread(master02.sys671.com:61181)] zookeeper.ClientCnxn: Session 0x172ead5d73a0002 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
我们也没有看到端口正在监听(timeline.metrics.service.webapp.address)
netstat -tulpn | grep 6188
任何建议如何从这一点继续?
我们将不胜感激获得有关此问题的任何帮助
解决方案
推荐阅读
- android - AWS Cognito 刷新令牌没有发生
- makefile - GNU 使用前导空格制作测试输出
- mysql - XAMPP:错误:MySQL 意外关闭
- r - 概率,样本函数 - 区间
- r - 在 ggplot Predict 中更改子面板名称
- devops - ClearML Web UI 自定义列不持久
- javascript - 使用 JSON 属性添加 ID
- postgresql - pg_dump 自定义格式文件包含 'DROP DATABASE'
- jquery - 我们不应该在生产环境中使用 jQuery Migrate 吗?
- python - Python:如何根据 2 列中的条件过滤掉行