首页 > 解决方案 > 使用 Ignite 计算窃取新节点的作业

问题描述

我正在尝试在节点使用作业窃取策略的 Ignite 集群上计算一批任务。

一切正常,除非在批处理已经启动时有新节点加入集群:该节点似乎无法窃取已经运行的批处理的任何任务。我收到以下消息:

'SEVERE: Failed to send job stealing message to node: TcpDiscoveryNode [...]'

我认为这里描述了一个已经存在的问题:https ://issues.apache.org/jira/browse/IGNITE-1267

这个问题似乎在线程中有修复,但在 Ignite 2.6.0 中问题仍然存在。

这是我的计算配置:

    JobStealingCollisionSpi spi = new JobStealingCollisionSpi();
    spi.setWaitJobsThreshold(1);
    spi.setMessageExpireTime(1000);
    spi.setMaximumStealingAttempts(10);
    spi.setActiveJobsThreshold(1);
    spi.setStealingEnabled(true);

    JobStealingFailoverSpi failoverSpi = new JobStealingFailoverSpi();
    cfg.setCollisionSpi(spi);
    cfg.setFailoverSpi(failoverSpi);

    Ignite ignite = Ignition.start(cfg);

难道我做错了什么 ?

编辑:试图重现它,但现在它似乎按预期工作。这是一个非常奇怪的行为!

EDIT2:设法随机重现问题,这里是堆栈:

class org.apache.ignite.spi.IgniteSpiException: Failed to send message to remote node: TcpDiscoveryNode [id=f54e6f43-620c-418d-a840-bce51ad1f5f5, addrs=[0:0:0:0:0:0:0:1%lo, 10.36.3.4, 127.0.0.1], sockAddrs=[/10.36.3.4:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=3, intOrder=3, lastExchangeTime=1543917557221, loc=false, ver=2.6.0#20180710-sha1:669feacc, isClient=false]
    at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:2718)
    at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:2651)
    at org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1643)
    at org.apache.ignite.internal.managers.communication.GridIoManager.sendToCustomTopic(GridIoManager.java:1703)
    at org.apache.ignite.internal.managers.GridManagerAdapter$1.send(GridManagerAdapter.java:422)
    at org.apache.ignite.spi.collision.jobstealing.JobStealingCollisionSpi.checkIdle(JobStealingCollisionSpi.java:1074)
    at org.apache.ignite.spi.collision.jobstealing.JobStealingCollisionSpi.onCollision(JobStealingCollisionSpi.java:722)
    at org.apache.ignite.internal.managers.collision.GridCollisionManager.onCollision(GridCollisionManager.java:119)
    at org.apache.ignite.internal.processors.job.GridJobProcessor.handleCollisions(GridJobProcessor.java:712)
    at org.apache.ignite.internal.processors.job.GridJobProcessor.access$3000(GridJobProcessor.java:111)
    at org.apache.ignite.internal.processors.job.GridJobProcessor$JobDiscoveryListener.onEvent(GridJobProcessor.java:2008)
    at org.apache.ignite.internal.managers.eventstorage.GridEventStorageManager$LocalListenerWrapper.onEvent(GridEventStorageManager.java:1384)
    at org.apache.ignite.internal.managers.eventstorage.GridEventStorageManager.notifyListeners(GridEventStorageManager.java:873)
    at org.apache.ignite.internal.managers.eventstorage.GridEventStorageManager.notifyListeners(GridEventStorageManager.java:858)
    at org.apache.ignite.internal.managers.eventstorage.GridEventStorageManager.record0(GridEventStorageManager.java:341)
    at org.apache.ignite.internal.managers.eventstorage.GridEventStorageManager.record(GridEventStorageManager.java:307)
    at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$DiscoveryWorker.recordEvent(GridDiscoveryManager.java:2703)
    at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$DiscoveryWorker.body0(GridDiscoveryManager.java:2920)
    at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$DiscoveryWorker.body(GridDiscoveryManager.java:2732)
    at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
    at java.lang.Thread.run(Thread.java:748)
Caused by: class org.apache.ignite.IgniteCheckedException: Failed to connect to node (is node still alive?). Make sure that each ComputeTask and cache Transaction has a timeout set in order to prevent parties from waiting forever in case of network issues [nodeId=f54e6f43-620c-418d-a840-bce51ad1f5f5, addrs=[/10.36.3.4:47100, /0:0:0:0:0:0:0:1%lo:47100, /127.0.0.1:47100]]
    at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3422)
    at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2958)
    at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2841)
    at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:2692)
    ... 20 more
    Suppressed: class org.apache.ignite.IgniteCheckedException: Failed to connect to address [addr=/10.36.3.4:47100, err=Failed to read remote node recovery handshake (connection closed).]
        at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3425)
        ... 23 more
    Caused by: class org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$HandshakeException: Failed to read remote node recovery handshake (connection closed).
        at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.safeTcpHandshake(TcpCommunicationSpi.java:3737)
        at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3276)
        ... 23 more
    Suppressed: class org.apache.ignite.IgniteCheckedException: Failed to connect to address [addr=/10.36.3.4:47100, err=Failed to read remote node recovery handshake (connection closed).]
        at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3425)
        ... 23 more
    Caused by: class org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$HandshakeException: Failed to read remote node recovery handshake (connection closed).
        at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.safeTcpHandshake(TcpCommunicationSpi.java:3737)
        at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3276)
        ... 23 more
    Suppressed: class org.apache.ignite.IgniteCheckedException: Failed to connect to address [addr=/10.36.3.4:47100, err=Failed to read remote node recovery handshake (connection closed).]
        at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3425)
        ... 23 more
    Caused by: class org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$HandshakeException: Failed to read remote node recovery handshake (connection closed).
        at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.safeTcpHandshake(TcpCommunicationSpi.java:3737)
        at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3276)
        ... 23 more
    Suppressed: class org.apache.ignite.IgniteCheckedException: Failed to connect to address [addr=/10.36.3.4:47100, err=Failed to read remote node recovery handshake (connection closed).]
        at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3425)
        ... 23 more
    Caused by: class org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$HandshakeException: Failed to read remote node recovery handshake (connection closed).
        at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.safeTcpHandshake(TcpCommunicationSpi.java:3737)
        at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3276)
        ... 23 more
    Suppressed: class org.apache.ignite.IgniteCheckedException: Failed to connect to address [addr=/10.36.3.4:47100, err=Failed to read remote node recovery handshake (connection closed).]
        at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3425)
        ... 23 more
    Caused by: class org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$HandshakeException: Failed to read remote node recovery handshake (connection closed).
        at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.safeTcpHandshake(TcpCommunicationSpi.java:3737)
        at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3276)
        ... 23 more

标签: javaignitegrid-computing

解决方案


推荐阅读