首页 > 解决方案 > Ignite TcpDiscoverySpi 因“ServerSocket [addr=0.0.0.0/0.0.0.0..”的接受循环而导致 SocketTimeout 出现严重系统错误而失败

问题描述

使用 Ignite 2.7.6 在尝试通过简单配置在docker 桥接网络上启动嵌入式 ignite 服务器节点(在 spring boot 应用程序中)时,服务器启动失败并出现以下错误,

[10:16:16] Ignite node started OK (id=e7276b83)
[10:16:16] >>> Ignite cluster is not active (limited functionality available). Use control.(sh|bat) script or IgniteCluster interface to activate.
[10:16:16] Topology snapshot [ver=1, locNode=e7276b83, servers=1, clients=0, state=INACTIVE, CPUs=1, offheap=0.1GB, heap=0.4GB]
mediation-service - [INFO ] 10:16:16.981 [main] com.**.**.perfmon.common.spring.EmbeddedIgnite    - ====>>> Activating Ignite Cluster
mediation-service - [WARN ] 10:16:17.383 [exchange-worker-#49] org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager     - Started write-ahead log manager in NONE mode, persisted data may be lost in a case of unexpected node failure. Make sure to deactivate the cluster before shutdown.
[10:16:17] Started write-ahead log manager in NONE mode, persisted data may be lost in a case of unexpected node failure. Make sure to deactivate the cluster before shutdown.
mediation-service - [ERROR] 10:16:21.982 [tcp-disco-srvr-#3] org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi        - Failed to accept TCP connection.
java.net.SocketTimeoutException: Accept timed out
        at java.base/java.net.PlainSocketImpl.socketAccept(Native Method)
        at java.base/java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:458)
        at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:565)
        at java.base/java.net.ServerSocket.accept(ServerSocket.java:533)
        at org.apache.ignite.spi.discovery.tcp.ServerImpl$TcpServer.body(ServerImpl.java:5845)
        at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
        at org.apache.ignite.spi.discovery.tcp.ServerImpl$TcpServerThread.body(ServerImpl.java:5763)
        at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
mediation-service - [WARN ] 10:16:21.982 [RMI TCP Accept-19887] sun.rmi.transport.tcp   - RMI TCP Accept-19887: accept loop for ServerSocket[addr=0.0.0.0/0.0.0.0,localport=19887] throws
java.net.SocketTimeoutException: Accept timed out
        at java.base/java.net.PlainSocketImpl.socketAccept(Native Method)
        at java.base/java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:458)
        at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:565)
        at java.base/java.net.ServerSocket.accept(ServerSocket.java:533)
        at java.rmi/sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:394)
        at java.rmi/sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:366)
        at java.base/java.lang.Thread.run(Thread.java:834)
mediation-service - [WARN ] 10:16:21.982 [RMI TCP Accept-0] sun.rmi.transport.tcp       - RMI TCP Accept-0: accept loop for ServerSocket[addr=0.0.0.0/0.0.0.0,localport=33254] throws
java.net.SocketTimeoutException: Accept timed out
        at java.base/java.net.PlainSocketImpl.socketAccept(Native Method)
        at java.base/java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:458)
        at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:565)
        at java.base/java.net.ServerSocket.accept(ServerSocket.java:533)
        at java.rmi/sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:394)
        at java.rmi/sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:366)
        at java.base/java.lang.Thread.run(Thread.java:834)
mediation-service - [ERROR] 10:16:21.984 [tcp-disco-srvr-#3]    - Critical system error detected. Will be handled accordingly to configured handler [hnd=NoOpFailureHandler [super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.net.SocketTimeoutException: Accept timed out]]

以下是相关配置,

点燃配置 xml 片段:

....
....
<property name="discoverySpi">
            <bean
                class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
                <property name="ipFinder">
                    <bean class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder"/>
                </property>
            </bean>
</property>
....
....

码头工人撰写片段:

services:
  ***-mediation-service:
    image: ***/mediation-service:latest
    build: .
    environment:
    - PERCENTAGE_OF_RAM_FOR_HEAP=80.0
    - SERVICE_NAME=mediation-service
    - SERVICE_PORT=9887
    - IGNITE_TCP_DISCOVERY_ADDRESSES=localhost
    - JAVA_TOOL_OPTIONS=-Dcom.sun.management.jmxremote=true
  -Dcom.sun.management.jmxremote.rmi.port=19887
  -Dcom.sun.management.jmxremote.port=19887
  -Dcom.sun.management.jmxremote.local.only=false
  -Dcom.sun.management.jmxremote.authenticate=false
  -Dcom.sun.management.jmxremote.ssl=false
  -Djava.rmi.server.hostname=$HOST_IP
  -Djava.net.preferIPv4Stack=true
  -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=29887
    ...
    ...
    networks:
      - something-mediation-network

networks:
  something-mediation-network:
    driver: bridge
    ipam:
      driver: default
      config:
      - subnet: 186.30.240.0/24

有谁知道这里发生了什么?

谢谢穆图

更新(2020 年 11 月 13 日):我尝试了与 @alamar 建议的 2.9.0 相同的方法,但结果相同..请参见下文

mediation-service - [ERROR] 01:03:16.871 [tcp-disco-srvr-[:47500]-#3-#50] org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi   - Failed to accept TCP connection.
java.net.SocketTimeoutException: Accept timed out
    at java.base/java.net.PlainSocketImpl.socketAccept(Native Method)
    at java.base/java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:458)
    at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:565)
    at java.base/java.net.ServerSocket.accept(ServerSocket.java:533)
    at org.apache.ignite.spi.discovery.tcp.ServerImpl$TcpServer.body(ServerImpl.java:6620)
    at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
    at org.apache.ignite.spi.discovery.tcp.ServerImpl$TcpServerThread.body(ServerImpl.java:6543)
    at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:58)
mediation-service - [WARN ] 01:03:16.871 [RMI TCP Accept-19887] sun.rmi.transport.tcp   - RMI TCP Accept-19887: accept loop for ServerSocket[addr=0.0.0.0/0.0.0.0,localport=19887] throws
java.net.SocketTimeoutException: Accept timed out
    at java.base/java.net.PlainSocketImpl.socketAccept(Native Method)
    at java.base/java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:458)
    at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:565)
    at java.base/java.net.ServerSocket.accept(ServerSocket.java:533)
    at java.rmi/sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:394)
    at java.rmi/sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:366)
    at java.base/java.lang.Thread.run(Thread.java:834)
mediation-service - [WARN ] 01:03:16.871 [RMI TCP Accept-0] sun.rmi.transport.tcp   - RMI TCP Accept-0: accept loop for ServerSocket[addr=0.0.0.0/0.0.0.0,localport=33351] throws
java.net.SocketTimeoutException: Accept timed out
    at java.base/java.net.PlainSocketImpl.socketAccept(Native Method)
    at java.base/java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:458)
    at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:565)
    at java.base/java.net.ServerSocket.accept(ServerSocket.java:533)
    at java.rmi/sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:394)
    at java.rmi/sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:366)
    at java.base/java.lang.Thread.run(Thread.java:834)
mediation-service - [ERROR] 01:03:16.876 [tcp-disco-srvr-[:47500]-#3-#50]   - Critical system error detected. Will be handled accordingly to configured handler [hnd=NoOpFailureHandler [super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.net.SocketTimeoutException: Accept timed out]]
java.net.SocketTimeoutException: Accept timed out
    at java.base/java.net.PlainSocketImpl.socketAccept(Native Method)
    at java.base/java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:458)
    at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:565)
    at java.base/java.net.ServerSocket.accept(ServerSocket.java:533)
    at org.apache.ignite.spi.discovery.tcp.ServerImpl$TcpServer.body(ServerImpl.java:6620)
    at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
    at org.apache.ignite.spi.discovery.tcp.ServerImpl$TcpServerThread.body(ServerImpl.java:6543)
    at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:58)
mediation-service - [WARN ] 01:03:17.271 [tcp-disco-srvr-[:47500]-#3-#50] org.apache.ignite.internal.processors.cache.CacheDiagnosticManager    - Page locks dump:

更新(2020 年 11 月 18 日):

我还有另一个更新,如果我使用 Java 8 而不是 Java 11,我在集群激活期间看不到这个问题并且一切正常。

所以我怀疑这与底层的java库使用/依赖有关..

标签: dockerignitegridgain

解决方案


该错误意味着套接字设置了超时,并且在超时期间没有收到任何传入消息。

有趣的是,Ignite 创建的套接字没有超时!这表明某处存在错误...

...这次是在 Java 中:JDK-8237858。错误描述说accept可以被信号中断(这是预期的),这会导致 Java 抛出错误(这是错误)。

根据 OpenJDK Jira,这不会影响 Java 8。在 Java 16 中已修复,并且默认设置也不影响 Java 13。

不过,我没有看到在 Java 11 维护版本中提到修复。

更新:在 2.12 中对此进行了修复。基本上,Ignite 必须在自己的代码中嵌入一个解决该错误的方法。


推荐阅读