首页 > 解决方案 > Databricks 作业超时并出现错误:[IP] 上的执行程序 0 丢失。远程 RPC 客户端解除关联

问题描述

完全错误:Databricks 作业超时并出现错误:[IP] 上丢失了执行程序 0。远程 RPC 客户端已解除关联。可能是由于容器超过阈值或网络问题。检查驱动程序日志以获取 WARN 消息。

我们在 Azure Databricks 订阅上使用 Jobs API 2.0 运行作业,并使用 Pools 接口缩短生成时间,并将工作器/驱动程序用作 Standard_DS12_v2。

我们有一个工作(JAR main),它只有一个 SQL 过程调用。此调用需要 1.2 多个小时才能完成。恰好 1 小时后,工作节点将被终止,并且作业状态变为超时。我们认为这可能是因为节点在 1 小时的时间跨度内确实处于空闲状态,因此添加了一个嗅探器线程,该线程每 10 分钟连续记录一次。这并没有解决问题。请在下面找到日志:

20/01/16 10:49:43 INFO StaticConf$: DB_HOME: /databricks
20/01/16 10:49:43 INFO DriverDaemon$: ========== driver starting up ==========
20/01/16 10:49:43 INFO DriverDaemon$: Java: Private Build 1.8.0_232
20/01/16 10:49:43 INFO DriverDaemon$: OS: Linux/amd64 4.15.0-1050-azure
20/01/16 10:49:43 INFO DriverDaemon$: CWD: /databricks/driver
20/01/16 10:49:43 INFO DriverDaemon$: Mem: Max: 17.5G loaded GCs: PS Scavenge, PS MarkSweep
20/01/16 10:49:43 INFO DriverDaemon$: Logging multibyte characters: ✓
20/01/16 10:49:43 INFO DriverDaemon$: 'publicFile' appender in root logger: class com.databricks.logging.RedactionRollingFileAppender
20/01/16 10:49:43 INFO DriverDaemon$: 'org.apache.log4j.Appender' appender in root logger: class com.codahale.metrics.log4j.InstrumentedAppender
20/01/16 10:49:43 INFO DriverDaemon$: 'null' appender in root logger: class com.databricks.logging.RequestTracker
20/01/16 10:49:43 INFO DriverDaemon$: == Modules:
20/01/16 10:49:44 INFO DriverDaemon$: Starting prometheus metrics log export timer
20/01/16 10:49:44 INFO DriverDaemon$: Universe Git Hash: 422793c171cb2855a8f424d226006093e5349873
20/01/16 10:49:44 INFO DriverDaemon$: Spark Git Hash: 0c5791fc51d5c2b434155df16049c9f78e12e8fb
20/01/16 10:49:44 WARN RunHelpers$: Missing tag isolation client: java.util.NoSuchElementException: key not found: TagDefinition(clientType,The client type for a request, used for isolating resources for the request.)
20/01/16 10:49:44 INFO DatabricksILoop$: Creating throwaway interpreter
20/01/16 10:49:44 INFO SparkConfUtils$: Customize spark config according to file /tmp/custom-spark.conf
20/01/16 10:49:44 INFO SparkConfUtils$: new spark config: spark.databricks.delta.preview.enabled -> true
20/01/16 10:49:44 INFO SparkConfUtils$: new spark config: spark.network.timeout -> 4000
20/01/16 10:49:44 INFO SparkConfUtils$: new spark config: spark.driver.host -> 10.30.2.205
20/01/16 10:49:44 INFO SparkConfUtils$: new spark config: spark.executor.tempDirectory -> /local_disk0/tmp
20/01/16 10:49:44 INFO SparkConfUtils$: new spark config: spark.databricks.secret.envVar.keys.toRedact -> 
20/01/16 10:49:44 INFO SparkConfUtils$: new spark config: spark.driver.tempDirectory -> /local_disk0/tmp
20/01/16 10:49:44 INFO SparkConfUtils$: new spark config: spark.databricks.secret.sparkConf.keys.toRedact -> 


20/01/16 11:49:32 INFO DriverCorral$: Cleaning the wrapper ReplId-20fe8-56e17-17323-1 (currently in status Running(ReplId-20fe8-56e17-17323-1,ExecutionId(job-368-run-1-action-368),RunnableCommandId(6227870104230535817)))
20/01/16 11:49:32 INFO DAGScheduler: Asked to cancel job group 2377484361178493489_6227870104230535817_job-368-run-1-action-368
20/01/16 11:49:32 INFO ScalaDriverLocal: cancelled jobGroup:2377484361178493489_6227870104230535817_job-368-run-1-action-368 
20/01/16 11:49:32 INFO ScalaDriverWrapper: Stopping streams for commandId pattern: CommandIdPattern(2377484361178493489,None,Some(job-368-run-1-action-368)).
20/01/16 11:49:35 ERROR TaskSchedulerImpl: Lost executor 0 on 10.30.2.208: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
20/01/16 11:49:35 INFO DAGScheduler: Executor lost: 0 (epoch 1)
20/01/16 11:49:35 INFO BlockManagerMasterEndpoint: Trying to remove executor 0 from BlockManagerMaster.
20/01/16 11:49:35 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(0, 10.30.2.208, 41985, None)
20/01/16 11:49:35 INFO DBCEventLoggingListener: Rolling event log; numTimesRolledOver = 1
20/01/16 11:49:35 INFO BlockManagerMaster: Removed 0 successfully in removeExecutor
20/01/16 11:49:35 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200116104947-0000/0 is now LOST (worker lost)
20/01/16 11:49:35 INFO StandaloneSchedulerBackend: Executor app-20200116104947-0000/0 removed: worker lost
20/01/16 11:49:35 INFO BlockManagerMaster: Removal of executor 0 requested
20/01/16 11:49:35 INFO BlockManagerMasterEndpoint: Trying to remove executor 0 from BlockManagerMaster.
20/01/16 11:49:35 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 0
20/01/16 11:49:35 INFO DBCEventLoggingListener: Rolled active log file /databricks/driver/eventlogs/5656882020603523684/eventlog to /databricks/driver/eventlogs/5656882020603523684/eventlog-2020-01-16--11-00
20/01/16 11:49:35 INFO StandaloneAppClient$ClientEndpoint: Master removed worker worker-20200116104954-10.30.2.208-38261: 10.30.2.208:38261 got disassociated
20/01/16 11:49:35 INFO DBCEventLoggingListener: Logging events to eventlogs/5656882020603523684/eventlog
20/01/16 11:49:35 INFO StandaloneSchedulerBackend: Worker worker-20200116104954-10.30.2.208-38261 removed: 10.30.2.208:38261 got disassociated
20/01/16 11:49:35 INFO TaskSchedulerImpl: Handle removed worker worker-20200116104954-10.30.2.208-38261: 10.30.2.208:38261 got disassociated
20/01/16 11:49:35 INFO DAGScheduler: Shuffle files lost for worker worker-20200116104954-10.30.2.208-38261 on host 10.30.2.208

在集群作业页面上,我们可以看到事件日志:

在此处输入图像描述

在 Job-run 页面上,我们可以看到状态为 Timed Out,

在此处输入图像描述

正如我们在日志中看到的:

  1. 该错误发生在作业开始后正好 1 小时。
  2. 作业配置如下:

========================= 编辑1

每次我们运行简单的 jar 文件时,这个问题都是可重现的并且会发生。

在此处输入图像描述

标签: apache-sparkdriverdatabricksazure-databricksexecutor

解决方案


只是为了给出评论中提供的答案:

引用@ankur

抱歉,我们在 API 调用中做了一件非常非常愚蠢的事情。我们将 timeout_seconds 传递给 3600。真的很尴尬


推荐阅读