amazon-s3 - Apache Flink 错误检查点到 S3
问题描述
我们在 EMR 集群上运行 Apache Flink (1.4.2)。我们正在检查点到 S3 存储桶,并且每秒通过流推送大约 5,000 条记录。我们最近在日志中看到以下错误:
java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink@ip-XXX-XXX-XXX-XXX:XXXXXX/user/taskmanager#-XXXXXXX]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.messages.TaskManagerMessages$RequestTaskManagerLog".
at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:442)
at akka.dispatch.OnComplete.internal(Future.scala:258)
at akka.dispatch.OnComplete.internal(Future.scala:256)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
紧接着,我们在日志中得到以下信息:
2018-07-30 15:08:32,177 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 831 @ 1532963312177
2018-07-30 15:09:46,750 ERROR org.apache.flink.runtime.blob.BlobServerConnection - PUT operation failed
java.io.EOFException: Read an incomplete length
at org.apache.flink.runtime.blob.BlobUtils.readLength(BlobUtils.java:366)
at org.apache.flink.runtime.blob.BlobServerConnection.readFileFully(BlobServerConnection.java:403)
at org.apache.flink.runtime.blob.BlobServerConnection.put(BlobServerConnection.java:349)
at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:114)
此时,流崩溃并且无法自动恢复,但是我们可以手动重新启动流,而无需更改 s3 存储桶的位置。在推送到 S3 时发生崩溃的事实让我认为这是问题的症结所在。
有任何想法吗?
解决方案
仅供参考,这是由于节点之间的串扰过多导致每台服务器上的 NIC 泛滥。解决方案是更智能的分区。
推荐阅读
- r - 为什么在 ArcGIS 和 R 中 Eckert IV 投影的处理方式不同?
- javascript - 如何使用箭头函数组件(rafce)定义 defaultProps?
- python - 如何检查频道是否已存在?
- typescript - TypeScript 如何实现可以将泛型函数类型作为参数的泛型函数?
- r - 如何在R中执行kmean聚类除以组变量
- javascript - 如何从 chrome 扩展更改 google chrome 主页/启动页面?
- mongodb - 无法在 GitHub Actions Maven 构建中运行嵌入式 mongo fladdoodle
- javascript - 找不到变量:setState
- linux - REDHAWK IDL / jtrs-interfaces 到目标 SDR
- html - 无法提供指向徽标的锚链接