java - Flink - TimeoutException:ID为someId的TaskManager的心跳超时
问题描述
它类似于这个问题尽管我在本地运行可执行 jar(使用 java -jar 命令)。我使用 180G 作为 jvm 堆最大大小,大约 37 分钟后在大约 110G 的使用量时抛出异常
java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id someId timed out.
在上面的问题中提到“可能 JobManager 的内存太小”。这如何转化为本地执行?只是像我上面写的那样使用 Xmx 配置的内存(180G)吗?
我在其他地方读到我可以配置任务管理器的心跳间隔?我如何在本地执行中做到这一点?
(编辑)我在哪里可以找到本地执行任务管理器的日志?
顺便说一下我用的是flink1.9
编辑日志:
18:28:33.993 [flink-akka.actor.default-dispatcher-533] DEBUG o.a.f.r.r.StandaloneResourceManager - Trigger heartbeat request.
18:28:33.993 [flink-akka.actor.default-dispatcher-532] DEBUG o.a.f.runtime.jobmaster.JobMaster - Trigger heartbeat request.
18:28:33.993 [flink-akka.actor.default-dispatcher-533] DEBUG o.a.f.r.r.StandaloneResourceManager - Trigger heartbeat request.
18:28:33.993 [flink-akka.actor.default-dispatcher-532] DEBUG o.a.f.runtime.jobmaster.JobMaster - Received heartbeat request from 2ac3bc91e5eb2b7e41baf47d97764652.
18:28:33.993 [flink-akka.actor.default-dispatcher-530] DEBUG o.a.f.r.taskexecutor.TaskExecutor - Received heartbeat request from 966e2f0da4aa05748d6c7921fb02819e.
18:28:33.993 [flink-akka.actor.default-dispatcher-533] DEBUG o.a.f.r.r.StandaloneResourceManager - Received heartbeat from 966e2f0da4aa05748d6c7921fb02819e.
18:28:33.993 [flink-akka.actor.default-dispatcher-533] DEBUG o.a.f.runtime.jobmaster.JobMaster - Received heartbeat from 0895b99a-27b3-43da-8226-16031db924de.
18:28:33.993 [flink-akka.actor.default-dispatcher-530] DEBUG o.a.f.r.taskexecutor.TaskExecutor - Received heartbeat request from 2ac3bc91e5eb2b7e41baf47d97764652.
18:28:33.993 [flink-akka.actor.default-dispatcher-532] DEBUG o.a.f.r.r.StandaloneResourceManager - Received heartbeat from 0895b99a-27b3-43da-8226-16031db924de.
18:28:33.993 [flink-akka.actor.default-dispatcher-532] DEBUG o.a.f.r.r.s.SlotManagerImpl - Received slot report from instance 899db08270ab48abff5e3111979c6f07: SlotReport{slotsStatus=[SlotStatus{slotID=0895b99a-27b3-43da-8226-16031db924de_0, resourceProfile=ResourceProfile{cpuCores=1.7976931348623157E308, heapMemoryInMB=2147483647, directMemoryInMB=2147483647, nativeMemoryInMB=2147483647, networkMemoryInMB=2147483647, managedMemoryInMB=165639}, allocationID=8b473e13eadfa5d9b8e34e9020536b5e, jobID=6109483ea411f1c099d0bb18a5995742}]}.
18:29:44.070 [flink-akka.actor.default-dispatcher-534] DEBUG o.a.f.runtime.jobmaster.JobMaster - Trigger heartbeat request.
18:29:44.070 [flink-akka.actor.default-dispatcher-535] DEBUG o.a.f.r.r.StandaloneResourceManager - Trigger heartbeat request.
18:29:44.071 [flink-akka.actor.default-dispatcher-535] DEBUG o.a.f.r.r.StandaloneResourceManager - Trigger heartbeat request.
18:29:44.071 [flink-akka.actor.default-dispatcher-536] DEBUG o.a.f.r.taskexecutor.TaskExecutor - Received heartbeat request from 966e2f0da4aa05748d6c7921fb02819e.
18:29:44.071 [flink-akka.actor.default-dispatcher-534] DEBUG o.a.f.runtime.jobmaster.JobMaster - Received heartbeat request from 2ac3bc91e5eb2b7e41baf47d97764652.
18:29:44.071 [flink-akka.actor.default-dispatcher-535] INFO o.a.f.r.r.StandaloneResourceManager - The heartbeat of JobManager with id 966e2f0da4aa05748d6c7921fb02819e timed out.
18:29:44.071 [flink-akka.actor.default-dispatcher-534] DEBUG o.a.f.r.j.slotpool.SlotPoolImpl - Slot Pool Status:
status: connected to akka://flink/user/resourcemanager
registered TaskManagers: [0895b99a-27b3-43da-8226-16031db924de]
available slots: []
allocated slots: [[AllocatedSlot 8b473e13eadfa5d9b8e34e9020536b5e @ 0895b99a-27b3-43da-8226-16031db924de @ localhost (dataPort=-1) - 0]]
pending requests: []
}
18:29:44.071 [flink-akka.actor.default-dispatcher-535] INFO o.a.f.r.r.StandaloneResourceManager - Disconnect job manager afeaa9d79956984a8a830ef4007a4315@akka://flink/user/jobmanager_1 for job 6109483ea411f1c099d0bb18a5995742 from the resource manager.
18:29:44.071 [flink-akka.actor.default-dispatcher-535] INFO o.a.f.r.r.StandaloneResourceManager - The heartbeat of TaskManager with id 0895b99a-27b3-43da-8226-16031db924de timed out.
18:29:44.071 [flink-akka.actor.default-dispatcher-536] DEBUG o.a.f.r.taskexecutor.TaskExecutor - Received heartbeat request from 2ac3bc91e5eb2b7e41baf47d97764652.
18:29:44.071 [flink-akka.actor.default-dispatcher-535] INFO o.a.f.r.r.StandaloneResourceManager - Closing TaskExecutor connection 0895b99a-27b3-43da-8226-16031db924de because: The heartbeat of TaskManager with id 0895b99a-27b3-43da-8226-16031db924de timed out.
18:29:44.071 [flink-akka.actor.default-dispatcher-535] DEBUG o.a.f.r.r.s.SlotManagerImpl - Unregister TaskManager 899db08270ab48abff5e3111979c6f07 from the SlotManager.
18:29:44.071 [flink-akka.actor.default-dispatcher-534] DEBUG o.a.f.runtime.jobmaster.JobMaster - Disconnect TaskExecutor 0895b99a-27b3-43da-8226-16031db924de because: Heartbeat of TaskManager with id 0895b99a-27b3-43da-8226-16031db924de timed out.
18:29:44.071 [flink-akka.actor.default-dispatcher-535] DEBUG o.a.f.r.r.StandaloneResourceManager - Received heartbeat from 966e2f0da4aa05748d6c7921fb02819e.
18:29:44.071 [flink-akka.actor.default-dispatcher-535] DEBUG o.a.f.r.r.StandaloneResourceManager - Received heartbeat from 0895b99a-27b3-43da-8226-16031db924de.
18:29:44.071 [flink-akka.actor.default-dispatcher-535] DEBUG o.a.f.r.r.StandaloneResourceManager - Received slot report from TaskManager 0895b99a-27b3-43da-8226-16031db924de which is no longer registered.
18:29:44.071 [flink-akka.actor.default-dispatcher-536] DEBUG o.a.f.r.taskexecutor.TaskExecutor - Close ResourceManager connection 2ac3bc91e5eb2b7e41baf47d97764652.
java.util.concurrent.TimeoutException: The heartbeat of TaskManager with id 0895b99a-27b3-43da-8226-16031db924de timed out.
at org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(ResourceManager.java:1144)
at org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:318)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
at scala.PartialFunction.applyOrElse(PartialFunction.scala:127)
at scala.PartialFunction.applyOrElse$(PartialFunction.scala:126)
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:175)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176)
at akka.actor.Actor.aroundReceive(Actor.scala:517)
at akka.actor.Actor.aroundReceive$(Actor.scala:515)
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
at akka.actor.ActorCell.invoke(ActorCell.scala:561)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
at akka.dispatch.Mailbox.run(Mailbox.scala:225)
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
18:29:44.072 [flink-akka.actor.default-dispatcher-534] INFO o.a.f.r.e.ExecutionGraph - Map (1/1) (f6b46e7b79dc5c196c14072ff765354f) switched from RUNNING to FAILED.
java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id 0895b99a-27b3-43da-8226-16031db924de timed out.
at org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(JobMaster.java:1149)
at org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:318)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
at scala.PartialFunction.applyOrElse(PartialFunction.scala:127)
at scala.PartialFunction.applyOrElse$(PartialFunction.scala:126)
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:175)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176)
at akka.actor.Actor.aroundReceive(Actor.scala:517)
at akka.actor.Actor.aroundReceive$(Actor.scala:515)
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
at akka.actor.ActorCell.invoke(ActorCell.scala:561)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
at akka.dispatch.Mailbox.run(Mailbox.scala:225)
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
18:29:44.072 [flink-akka.actor.default-dispatcher-535] DEBUG o.a.f.r.r.StandaloneResourceManager - No open TaskExecutor connection 0895b99a-27b3-43da-8226-16031db924de. Ignoring close TaskExecutor connection. Closing reason was: The heartbeat of TaskManager with id 0895b99a-27b3-43da-8226-16031db924de timed out.
18:29:44.072 [flink-akka.actor.default-dispatcher-536] INFO o.a.f.r.taskexecutor.TaskExecutor - Connecting to ResourceManager akka://flink/user/resourcemanager(bcfecd2d9083fd495b6f80b0a30e4507).
18:29:44.072 [flink-akka.actor.default-dispatcher-536] DEBUG o.a.f.r.rpc.akka.AkkaRpcService - Try to connect to remote RPC endpoint with address akka://flink/user/resourcemanager. Returning a org.apache.flink.runtime.resourcemanager.ResourceManagerGateway gateway.
18:29:44.072 [flink-akka.actor.default-dispatcher-534] INFO o.a.f.r.e.ExecutionGraph - Job Flink Streaming Job (6109483ea411f1c099d0bb18a5995742) switched from state RUNNING to FAILING.
java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id 0895b99a-27b3-43da-8226-16031db924de timed out.
at org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(JobMaster.java:1149)
at org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:318)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
at scala.PartialFunction.applyOrElse(PartialFunction.scala:127)
at scala.PartialFunction.applyOrElse$(PartialFunction.scala:126)
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:175)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176)
at akka.actor.Actor.aroundReceive(Actor.scala:517)
at akka.actor.Actor.aroundReceive$(Actor.scala:515)
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
at akka.actor.ActorCell.invoke(ActorCell.scala:561)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
at akka.dispatch.Mailbox.run(Mailbox.scala:225)
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
解决方案
推荐阅读
- java - Java - 如何从套接字附加到 jTextArea?
- html - 带有文本和点的进度条
- ios - iOS 11.4(15g77) 的设备支持文件
- java - Jsoup 去除字符串中的多个空格
- visual-studio-code - 使用 vscode 进行远程调试,console.log 有效,但 stdout 被抑制
- configuration - GitLab:从作业脚本影响作业配置
- python - 为什么赋值给None提示未定义变量
- reactjs - ReactJs Collapse 打开后关闭
- python-3.x - 打开时出现奇怪的语法错误
- pascal - Pascal(免费或 Turbo)读取