首页 > 解决方案 > 退出状态:-100。诊断:容器在 *lost* 节点上释放

问题描述

我有 2 个输入文件(一个在 JSON 中,另一个在镶木地板中),我正在尝试对这 2 个大数据帧进行连接并将连接的数据帧写入 s3(作为 JSON)。这项工作永远卡住了(在将连接的 JSON 写入 s3 时)。我正在使用 70 个 r3.4xlarge(从站)。

df1.rdd.partitions.size = 34234(大小~4 TB)

df2.rdd.partitions.size = 1200(大小~58GB)

我尝试过但仍然没有改善的事情:

最大资源设置为 true 的动态分配静态分配:spark.executor.cores = 5

spark.executor.memory = 40G

spark.executor.instances = 209

更改分区,我通过将 spark.default.parallelism 和 spark.sql.shuffle.partitions 设置为 2000、4000、8000、10000、20000、35000 但没有用。

中间持久化 - 持久化(memory_disk 和 disk_only 类型)加入的 df 持久化两个输入(加入之前),对两个 dfs 执行一些操作,然后加入并写入 s3

调整“mapreduce.input.fileinputformat.split.minsize”和“mapreduce.input.fileinputformat.split.maxsize”(到 750000000)。

我也尝试过使用 30 r3.8xlarge。没有改善☹</p>

我不断收到这 2 个错误之一 -</p>

zeppelin-interpreter-spark-zeppelin-ip-10-0-1-213.log: WARN [2019-02-12 04:54:43,437] ({dispatcher-event-loop-8} Logging.scala[logWarning]:66) - Lost task 24117.0 in stage 3.0 (TID 32666, ip-10-0-1-242.ec2.internal, executor 5): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1549914591854_0018_01_000010 on host: ip-10-0-1-242.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node

org.apache.spark.SparkException: Job aborted.
  at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:213)
  at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:166)
  at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:166)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
  at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:166)
  at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:145)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
  at org.apache.spark.sql.execution.datasources.DataSource.writeInFileFormat(DataSource.scala:435)
  at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:471)
  at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217)
  at org.apache.spark.sql.DataFrameWriter.json(DataFrameWriter.scala:487)
  ... 48 elided
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2234 in stage 15.0 failed 4 times, most recent failure: Lost task 2234.3 in stage 15.0 (TID 136390, ip-10-0-1-56.ec2.internal, executor 8): ExecutorLostFailure (executor 8 exited caused by one of the running tasks) Reason: Slave lost
Driver stacktrace:
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1708)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1696)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1695)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1695)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:855)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:855)
  at scala.Option.foreach(Option.scala:257)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:855)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1923)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1878)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1867)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:671)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)
  at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:186)
  ... 82 more

有人可以告诉我我在这里做错了什么吗?

标签: scalaapache-sparkhadoopapache-spark-sql

解决方案


由于内存问题,执行程序似乎丢失了。请尝试在 spark-default.cfg 文件中配置 spark 设置或尝试增加计算资源


推荐阅读