首页 > 解决方案 > Flink 卡在启动 TaskManagers 中

问题描述

我们正在尝试使用以下设置在我们的纱线集群中运行我们的 flink 应用程序:

并行度:60

作业管理器内存:2g

任务管理器内存:2g

任务管理器插槽:1

当我们提交作业时,会发生以下情况

2020-11-26 17:24:40,450 INFO  org.apache.flink.yarn.YarnResourceManager                    [] - Received 1 containers.
2020-11-26 17:24:40,456 INFO  org.apache.flink.yarn.YarnResourceManager                    [] - Received 1 containers with resource <memory:2048, vCores:10>, 1 pending container requests.
2020-11-26 17:24:40,461 INFO  org.apache.flink.yarn.YarnResourceManager                    [] - TaskExecutor container_e352_1592997214849_1528083_01_000005 will be started on dlawsq.sample.com with TaskExecutorProcessSpec {cpuCores=10.0, frameworkHeapSize=128.000mb (134217728 bytes), frameworkOffHeapSize=128.000mb (134217728 bytes), taskHeapSize=537.600mb (563714445 bytes), taskOffHeapSize=0 bytes, networkMemSize=158.720mb (166429984 bytes), managedMemorySize=634.880mb (665719939 bytes), jvmMetaspaceSize=256.000mb (268435456 bytes), jvmOverheadSize=204.800mb (214748368 bytes)}.
2020-11-26 17:24:40,467 INFO  org.apache.flink.yarn.YarnResourceManager                    [] - Adding keytab hdfs://storagens1/user/user/.flink/application_1592997214849_1528083/user.keytab to the AM container local resource bucket
2020-11-26 17:24:41,259 INFO  org.apache.flink.yarn.YarnResourceManager                    [] - Creating container launch context for TaskManagers
2020-11-26 17:24:41,261 INFO  org.apache.flink.yarn.YarnResourceManager                    [] - Starting TaskManagers
2020-11-26 17:24:41,280 INFO  org.apache.flink.yarn.YarnResourceManager                    [] - Removing container request Capability[<memory:2048, vCores:10>]Priority[1].
2020-11-26 17:24:41,280 INFO  org.apache.flink.yarn.YarnResourceManager                    [] - Accepted 1 requested containers, returned 0 excess containers, 0 pending container requests of resource <memory:2048, vCores:10>.
2020-11-26 17:24:54,015 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Checkpoint triggering task Source: radcom_gi_v3 -> Flink Kafka Consumer -> (Column Reduction -> Timestamps/Watermarks, Map -> Sink: Dropped Messages HDFS Parquet Writer) (1/5) of job 2e127b83d14dcaa44baae146360169ed is not in state RUNNING but SCHEDULED instead. Aborting checkpoint.
2020-11-26 17:24:56,476 INFO  org.apache.flink.yarn.YarnResourceManager                    [] - Requesting new TaskExecutor container with resource WorkerResourceSpec {cpuCores=10.0, taskHeapSize=537.600mb (563714445 bytes), taskOffHeapSize=0 bytes, networkMemSize=158.720mb (166429984 bytes), managedMemSize=634.880mb (665719939 bytes)}. Number pending workers of this resource is 1.
2020-11-26 17:25:01,978 INFO  org.apache.flink.yarn.YarnResourceManager                    [] - Received 1 containers.
2020-11-26 17:25:01,979 INFO  org.apache.flink.yarn.YarnResourceManager                    [] - Received 1 containers with resource <memory:2048, vCores:10>, 1 pending container requests.
2020-11-26 17:25:01,979 INFO  org.apache.flink.yarn.YarnResourceManager                    [] - TaskExecutor container_e352_1592997214849_1528083_01_000010 will be started on dlawsq.sample.com with TaskExecutorProcessSpec {cpuCores=10.0, frameworkHeapSize=128.000mb (134217728 bytes), frameworkOffHeapSize=128.000mb (134217728 bytes), taskHeapSize=537.600mb (563714445 bytes), taskOffHeapSize=0 bytes, networkMemSize=158.720mb (166429984 bytes), managedMemorySize=634.880mb (665719939 bytes), jvmMetaspaceSize=256.000mb (268435456 bytes), jvmOverheadSize=204.800mb (214748368 bytes)}.
2020-11-26 17:25:01,980 INFO  org.apache.flink.yarn.YarnResourceManager                    [] - Adding keytab hdfs://storagens1/user/user/.flink/application_1592997214849_1528083/user.keytab to the AM container local resource bucket
2020-11-26 17:25:02,024 INFO  org.apache.flink.yarn.YarnResourceManager                    [] - Creating container launch context for TaskManagers
2020-11-26 17:25:02,026 INFO  org.apache.flink.yarn.YarnResourceManager                    [] - Starting TaskManagers
2020-11-26 17:25:02,027 INFO  org.apache.flink.yarn.YarnResourceManager                    [] - Removing container request Capability[<memory:2048, vCores:10>]Priority[1].
2020-11-26 17:25:02,027 INFO  org.apache.flink.yarn.YarnResourceManager                    [] - Accepted 1 requested containers, returned 0 excess containers, 0 pending container requests of resource <memory:2048, vCores:10>.
2020-11-26 17:25:17,486 INFO  org.apache.flink.yarn.YarnResourceManager                    [] - Requesting new TaskExecutor container with resource WorkerResourceSpec {cpuCores=10.0, taskHeapSize=537.600mb (563714445 bytes), taskOffHeapSize=0 bytes, networkMemSize=158.720mb (166429984 bytes), managedMemSize=634.880mb (665719939 bytes)}. Number pending workers of this resource is 1.

上述行为一直持续到达到槽请求超时,然后遇到 NoResourceAvailable 异常。

2020-11-26 17:29:34,419 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: source_gi_v3 -> Flink Kafka Consumer -> (Column Reduction -> Timestamps/Watermarks, Map -> Sink: Dropped Messages HDFS Parquet Writer) (1/5) (79c554d2a3ff14499262a6a5e852f846) switched from SCHEDULED to FAILED on not deployed.
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate the required slot within slot request timeout. Please make sure that the cluster has enough resources.
    at org.apache.flink.runtime.scheduler.DefaultScheduler.maybeWrapWithNoResourceAvailableException(DefaultScheduler.java:441) ~[flink-streaming-framework-1.0.jar:?]
    at org.apache.flink.runtime.scheduler.DefaultScheduler.lambda$assignResourceOrHandleError$6(DefaultScheduler.java:422) ~[flink-streaming-framework-1.0.jar:?]
    at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822) ~[?:1.8.0_181]
    at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797) ~[?:1.8.0_181]
    at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) ~[?:1.8.0_181]
    at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) ~[?:1.8.0_181]
    at org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:168) ~[flink-streaming-framework-1.0.jar:?]
    at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) ~[?:1.8.0_181]
    at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) ~[?:1.8.0_181]
    at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) ~[?:1.8.0_181]
    at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) ~[?:1.8.0_181]
    at org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:726) ~[flink-streaming-framework-1.0.jar:?]
    at org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:537) ~[flink-streaming-framework-1.0.jar:?]
    at org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:432) ~[flink-streaming-framework-1.0.jar:?]
    at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822) ~[?:1.8.0_181]
    at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797) ~[?:1.8.0_181]
    at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) ~[?:1.8.0_181]
    at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) ~[?:1.8.0_181]
    at org.apache.flink.runtime.concurrent.FutureUtils.lambda$forwardTo$21(FutureUtils.java:1120) ~[flink-streaming-framework-1.0.jar:?]
    at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) ~[?:1.8.0_181]
    at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) ~[?:1.8.0_181]
    at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) ~[?:1.8.0_181]
    at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) ~[?:1.8.0_181]
    at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1036) ~[flink-streaming-framework-1.0.jar:?]
    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402) ~[flink-streaming-framework-1.0.jar:?]
    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195) ~[flink-streaming-framework-1.0.jar:?]
    at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) ~[flink-streaming-framework-1.0.jar:?]
    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) ~[flink-streaming-framework-1.0.jar:?]
    at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) [flink-streaming-framework-1.0.jar:?]
    at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) [flink-streaming-framework-1.0.jar:?]
    at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) [flink-streaming-framework-1.0.jar:?]
    at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) [flink-streaming-framework-1.0.jar:?]
    at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) [flink-streaming-framework-1.0.jar:?]
    at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-streaming-framework-1.0.jar:?]
    at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-streaming-framework-1.0.jar:?]
    at akka.actor.Actor$class.aroundReceive(Actor.scala:517) [flink-streaming-framework-1.0.jar:?]
    at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) [flink-streaming-framework-1.0.jar:?]
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) [flink-streaming-framework-1.0.jar:?]
    at akka.actor.ActorCell.invoke(ActorCell.scala:561) [flink-streaming-framework-1.0.jar:?]
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) [flink-streaming-framework-1.0.jar:?]
    at akka.dispatch.Mailbox.run(Mailbox.scala:225) [flink-streaming-framework-1.0.jar:?]
    at akka.dispatch.Mailbox.exec(Mailbox.scala:235) [flink-streaming-framework-1.0.jar:?]
    at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [flink-streaming-framework-1.0.jar:?]
    at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [flink-streaming-framework-1.0.jar:?]
    at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [flink-streaming-framework-1.0.jar:?]
    at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [flink-streaming-framework-1.0.jar:?]
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
    at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) ~[?:1.8.0_181]
    at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) ~[?:1.8.0_181]
    at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593) ~[?:1.8.0_181]
    at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577) ~[?:1.8.0_181]
    ... 25 more
Caused by: java.util.concurrent.TimeoutException
    ... 23 more
2020-11-26 17:29:34,440 INFO  org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy [] - Calculating tasks to restart to recover the failed task cbc357ccb763df2852fee8c4fc7d55f2_0.
2020-11-26 17:29:34,441 INFO  org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy [] - 12 tasks should be restarted to recover the failed task cbc357ccb763df2852fee8c4fc7d55f2_0. 
2020-11-26 17:29:34,443 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job  Aggregation (2e127b83d14dcaa44baae146360169ed) switched from state RUNNING to FAILING.

到目前为止,我们已经尝试了这些事情:

将并行度从 60 降低到 7

将任务管理器插槽增加到 10

在这些之后,我们仍然遇到同样的错误。有人可以建议如何解决我们遇到的错误。

标签: apache-flinkflink-streaming

解决方案


推荐阅读