apache-flink - Flink 卡在启动 TaskManagers 中
问题描述
我们正在尝试使用以下设置在我们的纱线集群中运行我们的 flink 应用程序:
并行度:60
作业管理器内存:2g
任务管理器内存:2g
任务管理器插槽:1
当我们提交作业时,会发生以下情况
2020-11-26 17:24:40,450 INFO org.apache.flink.yarn.YarnResourceManager [] - Received 1 containers.
2020-11-26 17:24:40,456 INFO org.apache.flink.yarn.YarnResourceManager [] - Received 1 containers with resource <memory:2048, vCores:10>, 1 pending container requests.
2020-11-26 17:24:40,461 INFO org.apache.flink.yarn.YarnResourceManager [] - TaskExecutor container_e352_1592997214849_1528083_01_000005 will be started on dlawsq.sample.com with TaskExecutorProcessSpec {cpuCores=10.0, frameworkHeapSize=128.000mb (134217728 bytes), frameworkOffHeapSize=128.000mb (134217728 bytes), taskHeapSize=537.600mb (563714445 bytes), taskOffHeapSize=0 bytes, networkMemSize=158.720mb (166429984 bytes), managedMemorySize=634.880mb (665719939 bytes), jvmMetaspaceSize=256.000mb (268435456 bytes), jvmOverheadSize=204.800mb (214748368 bytes)}.
2020-11-26 17:24:40,467 INFO org.apache.flink.yarn.YarnResourceManager [] - Adding keytab hdfs://storagens1/user/user/.flink/application_1592997214849_1528083/user.keytab to the AM container local resource bucket
2020-11-26 17:24:41,259 INFO org.apache.flink.yarn.YarnResourceManager [] - Creating container launch context for TaskManagers
2020-11-26 17:24:41,261 INFO org.apache.flink.yarn.YarnResourceManager [] - Starting TaskManagers
2020-11-26 17:24:41,280 INFO org.apache.flink.yarn.YarnResourceManager [] - Removing container request Capability[<memory:2048, vCores:10>]Priority[1].
2020-11-26 17:24:41,280 INFO org.apache.flink.yarn.YarnResourceManager [] - Accepted 1 requested containers, returned 0 excess containers, 0 pending container requests of resource <memory:2048, vCores:10>.
2020-11-26 17:24:54,015 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Checkpoint triggering task Source: radcom_gi_v3 -> Flink Kafka Consumer -> (Column Reduction -> Timestamps/Watermarks, Map -> Sink: Dropped Messages HDFS Parquet Writer) (1/5) of job 2e127b83d14dcaa44baae146360169ed is not in state RUNNING but SCHEDULED instead. Aborting checkpoint.
2020-11-26 17:24:56,476 INFO org.apache.flink.yarn.YarnResourceManager [] - Requesting new TaskExecutor container with resource WorkerResourceSpec {cpuCores=10.0, taskHeapSize=537.600mb (563714445 bytes), taskOffHeapSize=0 bytes, networkMemSize=158.720mb (166429984 bytes), managedMemSize=634.880mb (665719939 bytes)}. Number pending workers of this resource is 1.
2020-11-26 17:25:01,978 INFO org.apache.flink.yarn.YarnResourceManager [] - Received 1 containers.
2020-11-26 17:25:01,979 INFO org.apache.flink.yarn.YarnResourceManager [] - Received 1 containers with resource <memory:2048, vCores:10>, 1 pending container requests.
2020-11-26 17:25:01,979 INFO org.apache.flink.yarn.YarnResourceManager [] - TaskExecutor container_e352_1592997214849_1528083_01_000010 will be started on dlawsq.sample.com with TaskExecutorProcessSpec {cpuCores=10.0, frameworkHeapSize=128.000mb (134217728 bytes), frameworkOffHeapSize=128.000mb (134217728 bytes), taskHeapSize=537.600mb (563714445 bytes), taskOffHeapSize=0 bytes, networkMemSize=158.720mb (166429984 bytes), managedMemorySize=634.880mb (665719939 bytes), jvmMetaspaceSize=256.000mb (268435456 bytes), jvmOverheadSize=204.800mb (214748368 bytes)}.
2020-11-26 17:25:01,980 INFO org.apache.flink.yarn.YarnResourceManager [] - Adding keytab hdfs://storagens1/user/user/.flink/application_1592997214849_1528083/user.keytab to the AM container local resource bucket
2020-11-26 17:25:02,024 INFO org.apache.flink.yarn.YarnResourceManager [] - Creating container launch context for TaskManagers
2020-11-26 17:25:02,026 INFO org.apache.flink.yarn.YarnResourceManager [] - Starting TaskManagers
2020-11-26 17:25:02,027 INFO org.apache.flink.yarn.YarnResourceManager [] - Removing container request Capability[<memory:2048, vCores:10>]Priority[1].
2020-11-26 17:25:02,027 INFO org.apache.flink.yarn.YarnResourceManager [] - Accepted 1 requested containers, returned 0 excess containers, 0 pending container requests of resource <memory:2048, vCores:10>.
2020-11-26 17:25:17,486 INFO org.apache.flink.yarn.YarnResourceManager [] - Requesting new TaskExecutor container with resource WorkerResourceSpec {cpuCores=10.0, taskHeapSize=537.600mb (563714445 bytes), taskOffHeapSize=0 bytes, networkMemSize=158.720mb (166429984 bytes), managedMemSize=634.880mb (665719939 bytes)}. Number pending workers of this resource is 1.
上述行为一直持续到达到槽请求超时,然后遇到 NoResourceAvailable 异常。
2020-11-26 17:29:34,419 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: source_gi_v3 -> Flink Kafka Consumer -> (Column Reduction -> Timestamps/Watermarks, Map -> Sink: Dropped Messages HDFS Parquet Writer) (1/5) (79c554d2a3ff14499262a6a5e852f846) switched from SCHEDULED to FAILED on not deployed.
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate the required slot within slot request timeout. Please make sure that the cluster has enough resources.
at org.apache.flink.runtime.scheduler.DefaultScheduler.maybeWrapWithNoResourceAvailableException(DefaultScheduler.java:441) ~[flink-streaming-framework-1.0.jar:?]
at org.apache.flink.runtime.scheduler.DefaultScheduler.lambda$assignResourceOrHandleError$6(DefaultScheduler.java:422) ~[flink-streaming-framework-1.0.jar:?]
at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822) ~[?:1.8.0_181]
at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797) ~[?:1.8.0_181]
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) ~[?:1.8.0_181]
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) ~[?:1.8.0_181]
at org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:168) ~[flink-streaming-framework-1.0.jar:?]
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) ~[?:1.8.0_181]
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) ~[?:1.8.0_181]
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) ~[?:1.8.0_181]
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) ~[?:1.8.0_181]
at org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:726) ~[flink-streaming-framework-1.0.jar:?]
at org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:537) ~[flink-streaming-framework-1.0.jar:?]
at org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:432) ~[flink-streaming-framework-1.0.jar:?]
at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822) ~[?:1.8.0_181]
at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797) ~[?:1.8.0_181]
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) ~[?:1.8.0_181]
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) ~[?:1.8.0_181]
at org.apache.flink.runtime.concurrent.FutureUtils.lambda$forwardTo$21(FutureUtils.java:1120) ~[flink-streaming-framework-1.0.jar:?]
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) ~[?:1.8.0_181]
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) ~[?:1.8.0_181]
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) ~[?:1.8.0_181]
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) ~[?:1.8.0_181]
at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1036) ~[flink-streaming-framework-1.0.jar:?]
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402) ~[flink-streaming-framework-1.0.jar:?]
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195) ~[flink-streaming-framework-1.0.jar:?]
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) ~[flink-streaming-framework-1.0.jar:?]
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) ~[flink-streaming-framework-1.0.jar:?]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) [flink-streaming-framework-1.0.jar:?]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) [flink-streaming-framework-1.0.jar:?]
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) [flink-streaming-framework-1.0.jar:?]
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) [flink-streaming-framework-1.0.jar:?]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) [flink-streaming-framework-1.0.jar:?]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-streaming-framework-1.0.jar:?]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-streaming-framework-1.0.jar:?]
at akka.actor.Actor$class.aroundReceive(Actor.scala:517) [flink-streaming-framework-1.0.jar:?]
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) [flink-streaming-framework-1.0.jar:?]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) [flink-streaming-framework-1.0.jar:?]
at akka.actor.ActorCell.invoke(ActorCell.scala:561) [flink-streaming-framework-1.0.jar:?]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) [flink-streaming-framework-1.0.jar:?]
at akka.dispatch.Mailbox.run(Mailbox.scala:225) [flink-streaming-framework-1.0.jar:?]
at akka.dispatch.Mailbox.exec(Mailbox.scala:235) [flink-streaming-framework-1.0.jar:?]
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [flink-streaming-framework-1.0.jar:?]
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [flink-streaming-framework-1.0.jar:?]
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [flink-streaming-framework-1.0.jar:?]
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [flink-streaming-framework-1.0.jar:?]
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) ~[?:1.8.0_181]
at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) ~[?:1.8.0_181]
at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593) ~[?:1.8.0_181]
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577) ~[?:1.8.0_181]
... 25 more
Caused by: java.util.concurrent.TimeoutException
... 23 more
2020-11-26 17:29:34,440 INFO org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy [] - Calculating tasks to restart to recover the failed task cbc357ccb763df2852fee8c4fc7d55f2_0.
2020-11-26 17:29:34,441 INFO org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy [] - 12 tasks should be restarted to recover the failed task cbc357ccb763df2852fee8c4fc7d55f2_0.
2020-11-26 17:29:34,443 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job Aggregation (2e127b83d14dcaa44baae146360169ed) switched from state RUNNING to FAILING.
到目前为止,我们已经尝试了这些事情:
将并行度从 60 降低到 7
将任务管理器插槽增加到 10
在这些之后,我们仍然遇到同样的错误。有人可以建议如何解决我们遇到的错误。
解决方案
推荐阅读
- laravel - Laravel Horizon 没有显示工作
- java - 在 Tomcat 上部署失败
- ios - 当嵌入到 SwiftUI 中的 ScrollView 中时,VStack 内的内容会缩小
- github - 我无法删除 github 上的拉取请求
- c++ - 我的程序在句子中查找最长单词有什么问题?
- python - Celery 任务在 Docker 容器中始终处于 PENDING 状态(Flask + Celery + RabbitMQ + Docker)
- google-cloud-platform - 如何在 Terraform 脚本中指定公司的自定义启动映像?
- html - 单独对齐导航栏中的元素
- reactjs - 使用路由 api 使用 Next.js 从客户端从公共文件夹中删除文件
- javascript - 如何动态加载名称中包含内容哈希的 CSS 和 JS 文件?