首页 > 解决方案 > 提交后 SparkAppHandle 状态丢失,但驱动程序运行完美

问题描述

我正在使用 spark java API 将驱动程序提交到本地 Spark 集群(1 个 master + 1 个 worker)。在连接了监听器的情况下调用 startApplication 后,第一次调用 stateChanged 会给出 LOST 状态。

驱动程序提交正常并在工作人员中运行良好。

我尝试过使用等待循环而不是侦听器。

我尝试过使用 Spark 版本 2.3.1 和 2.4.3。

我已经在 OSX 和 Ubuntu 中尝试过。

我尝试将 Spark Master Host 更改为机器的 IP 而不是名称。

SparkLauncher launcher = new SparkLauncher(env)
    .setAppResource(path)
    .setMainClass("full.package.name.RTADriver")
    .setMaster("spark://" + sparkMasterHost + ":" + sparkMasterPort)
    .setAppName("rta_scala_app_")
    .setDeployMode("cluster")
    .setConf("spark.ui.enabled", "true")
    .addAppArgs(runnerStr)
    .setVerbose(true);

SparkAppHandle handle = launcher.startApplication();

while (!handle.getState().equals(SparkAppHandle.State.FINISHED)){
    System.out.println("Wait Loop: App_ID: " + handle.getAppId() + " state: " +  handle.getState());
    Thread.sleep(10000);
}

我的代码中 System.out 的日志:

First State App_ID: null state: UNKNOWN
Wait Loop: App_ID: null state: UNKNOWN
Wait Loop: App_ID: null state: LOST
Wait Loop: App_ID: null state: LOST
...

重要的火花提交日志:

INFO: 19/06/04 11:27:54 INFO Utils: Successfully started service 'driverClient' on port 52077.
INFO: 19/06/04 11:27:54 INFO TransportClientFactory: Successfully created connection to /10.10.0.179:7077 after 34 ms (0 ms spent in bootstraps)
INFO: 19/06/04 11:27:54 INFO ClientEndpoint: Driver successfully submitted as driver-20190604112754-0030
INFO: 19/06/04 11:27:54 INFO ClientEndpoint: ... waiting before polling master for driver state
INFO: 19/06/04 11:27:59 INFO ClientEndpoint: ... polling master for driver state
INFO: 19/06/04 11:27:59 INFO ClientEndpoint: State of driver-20190604112754-0030 is RUNNING
INFO: 19/06/04 11:27:59 INFO ClientEndpoint: Driver running on 10.10.0.179:49705 (worker-20190603154544-10.10.0.179-49705)
INFO: 19/06/04 11:27:59 INFO ShutdownHookManager: Shutdown hook called
INFO: 19/06/04 11:27:59 INFO ShutdownHookManager: Deleting directory /private/var/folders/90/pgndgkk11lj0qb4q5qw_f03c0000gn/T/spark-8d8d92b9-8d0c-43a1-8bb9-3d08f1519c53
Wait Loop: App_ID: null state: LOST
...

标签: javaapache-spark

解决方案


我刚遇到同样的情况。我的猜测是由于部署模式“集群”,火花驱动程序进程运行在具有火花启动程序进程的不同主机中;因此启动器进程“丢失”了与 spark 应用程序的连接。


推荐阅读