hadoop - YARN Timeline Service v2 无法启动
问题描述
我在 AWS 上有一个测试 HDP 集群设置,用于评估项目。Ambari UI 报告了许多错误,当我根据需要重新启动服务时,我遇到了 YARN 的问题。为 YARN 启动 Timeline Service Reader V2 时,出现错误
2018-08-10 15:51:06,400 INFO [main] client.RpcRetryingCallerImpl: Call exception, tries=15, retries=15, started=129034 ms ago, cancelled=false, msg=Call to HOSTNAME/IPADDRESS:17020 failed on connection exception: org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: HOSTNAME/IPADDRESS:17020, details=row 'prod.timelineservice.entity' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=HOSTNAME,17020,1533827052949, seqNum=-1
最终导致
stderr:
Traceback (most recent call last):
File "/usr/lib/ambari-agent/lib/resource_management/libraries/script/script.py", line 982, in restart
self.status(env)
File "/var/lib/ambari-agent/cache/stacks/HDP/3.0/services/YARN/package/scripts/timelinereader.py", line 88, in status
check_process_status(pid_file)
File "/usr/lib/ambari-agent/lib/resource_management/libraries/functions/check_process_status.py", line 43, in check_process_status
raise ComponentIsNotRunning()
ComponentIsNotRunning
The above exception was the cause of the following exception:
Traceback (most recent call last):
File "/var/lib/ambari-agent/cache/stacks/HDP/3.0/services/YARN/package/scripts/timelinereader.py", line 108, in <module>
ApplicationTimelineReader().execute()
File "/usr/lib/ambari-agent/lib/resource_management/libraries/script/script.py", line 353, in execute
method(env)
File "/usr/lib/ambari-agent/lib/resource_management/libraries/script/script.py", line 993, in restart
self.start(env, upgrade_type=upgrade_type)
File "/var/lib/ambari-agent/cache/stacks/HDP/3.0/services/YARN/package/scripts/timelinereader.py", line 51, in start
hbase(action='start')
File "/var/lib/ambari-agent/cache/stacks/HDP/3.0/services/YARN/package/scripts/hbase_service.py", line 80, in hbase
createTables()
File "/var/lib/ambari-agent/cache/stacks/HDP/3.0/services/YARN/package/scripts/hbase_service.py", line 147, in createTables
logoutput=True)
File "/usr/lib/ambari-agent/lib/resource_management/core/base.py", line 166, in __init__
self.env.run()
File "/usr/lib/ambari-agent/lib/resource_management/core/environment.py", line 160, in run
self.run_action(resource, action)
File "/usr/lib/ambari-agent/lib/resource_management/core/environment.py", line 124, in run_action
provider_action()
File "/usr/lib/ambari-agent/lib/resource_management/core/providers/system.py", line 263, in action_run
returns=self.resource.returns)
File "/usr/lib/ambari-agent/lib/resource_management/core/shell.py", line 72, in inner
result = function(command, **kwargs)
File "/usr/lib/ambari-agent/lib/resource_management/core/shell.py", line 102, in checked_call
tries=tries, try_sleep=try_sleep, timeout_kill_strategy=timeout_kill_strategy, returns=returns)
File "/usr/lib/ambari-agent/lib/resource_management/core/shell.py", line 150, in _call_wrapper
result = _call(command, **kwargs_copy)
File "/usr/lib/ambari-agent/lib/resource_management/core/shell.py", line 308, in _call
raise ExecuteTimeoutException(err_msg)
resource_management.core.exceptions.ExecuteTimeoutException: Execution of 'ambari-sudo.sh su yarn-ats -l -s /bin/bash -c 'export PATH='"'"'/usr/sbin:/sbin:/usr/lib/ambari-server/*:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/var/lib/ambari-agent'"'"' ; sleep 10;export HBASE_CLASSPATH_PREFIX=/usr/hdp/3.0.0.0-1634/hadoop-yarn/timelineservice/*; /usr/hdp/3.0.0.0-1634/hbase/bin/hbase --config /usr/hdp/3.0.0.0-1634/hadoop/conf/embedded-yarn-ats-hbase org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator -Dhbase.client.retries.number=35 -create -s'' was killed due timeout after 300 seconds
哪个组件需要重新启动才能使 YARN 恢复健康状态,将来调试问题的正确方法是什么?
解决方案
如果您进入“后台操作”(Ambari UI 中的齿轮图标),然后转到 Timeline Service V2 启动链接(您可能必须先单击运行 Timeline Service 的机器才能到达那里),您应该在右上角有链接,上面写着“复制”和“打开”。这些有望更详细地向您显示错误日志。
在我的情况下,时间线服务 V2 无法启动,因为系统上没有足够的内存。这是一个小型 VM 集群,仅用于在每台机器上使用 2GB 内存。我通过更详细的错误日志发现它给出了内存不足的错误,所以当我将 VM 内存增加到 4GB 时,它能够运行。我最好的猜测是您在运行 Ambari UI 的主 NameNode 上的内存不够。似乎需要大约 4GB+ 的空间,具体取决于您在主 NameNode 上运行的服务数量。
推荐阅读
- python - miniconda installation issue on iMac with Apple M1 chip running macOS Big Sur
- docker - Docker scan command is not working for an image in local registry
- python - 应用方法后拆分数据框列
- python - 遍历每一行的简单计算 - Pandas
- c++ - 如何从.m函数获取结果到.mm类Objective-C与Qt混合
- signalr - 使用 endpoints.MapBlazorHub().RequireAuthorization() 时的 SignalR 身份验证错误
- python - 有没有办法扫描条形码并在 PDF 打印中有相应的页面?
- atlassian-sourcetree - 升级到 4.1.1 后 sourcetree 不断崩溃
- selenium - pytest-selenium 已安装,但不能作为 pytest 的插件使用。我做错了什么?
- javascript - 如何获取应该显示在 JavaScript 表格中的数字的值