java - 连接到远程任务管理器失败。这可能表明远程任务管理器已丢失
问题描述
我创建了一个带有 1 个作业管理器和 2 个任务管理器的 flink 独立集群。
提交批处理任务/作业时,其中一个任务管理器抛出以下错误。flink 仪表板显示两个任务管理器都处于活动状态。示例 wordcount 程序有效。
java.io.IOException: Connecting the channel failed: Connecting to remote task manager + 'hostname/127.0.0.1:46537' has failed. This might indicate that the remote task manager has been lost.
at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:197)
at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:132)
at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:84)
at org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:59)
at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:156)
at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:480)
at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:502)
at org.apache.flink.runtime.io.network.api.reader.AbstractRecordReader.getNextRecord(AbstractRecordReader.java:86)
at org.apache.flink.runtime.io.network.api.reader.MutableRecordReader.next(MutableRecordReader.java:42)
at org.apache.flink.runtime.operators.util.ReaderIterator.next(ReaderIterator.java:73)
at org.apache.flink.runtime.operators.NoOpDriver.run(NoOpDriver.java:94)
at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:490)
at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:355)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connecting to remote task manager + 'hostname/127.0.0.1:46537' has failed. This might indicate that the remote task manager has been lost.
at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:220)
at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:132)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:603)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:563)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:268)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:284)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
... 1 more
Caused by: java.net.ConnectException: Connection refused: ekablr-ca-s010/127.0.0.1:46537
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:281)
... 6 more
解决方案
要检查的事项:
- 你的工作中是否有什么东西会拖慢 TaskManager 的速度,以至于它无法在请求超时之前回复它们?观察运行 TaskManager 的机器上的 CPU 使用情况。
- TaskManager 是否真的仍然处于活动状态(即仍然发送心跳,您可以在仪表板的 TaskManager 部分看到实时更新)或者 JobManager 只是没有放弃它足以将其标记为已死?您的工作可能会杀死它正在运行的 TaskManager,但我通常只看到它使用通过 JNI 或其他一些直接内存操作类型的东西运行的本机代码。
推荐阅读
- linux - 如何让 Linux USB 小工具和网络驱动程序相互通信
- python - 如何在 statsmodel ols 中获得“样本内”预测值(y hat)?
- sql - 来自多个 ID 的单一身份的 SQL 解决方案
- stata - 如何在Stata中找到最接近(第二接近,第三接近......)的值
- java - Hibernate中构造函数、getter和setter的顺序
- automapper - 为什么 AutoMapper 会抛出错误 AutoMapperConfigurationException?
- listview - 过滤时 ListView.builder 的索引未保留在 listTile onTap 内
- python - 使用 PIL 从 Python 3 中的图像中删除单个像素行
- sql - 如何从按日期列分区的配置单元表中获取最新日期?
- html - 如何将 HTML5 图像组织到子文件夹中?