首页 > 解决方案 > Spark:未能将数据帧保存到 Google Cloud Storage

问题描述

Spark 作业的最后阶段是以 avro 格式将 37Gb 的数据保存到 GCS 存储桶。Spark 应用在 Dataproc 上运行。

我的集群包括:15 个 4 核和 15Gb RAM 的工作人员,1 个 4 核和 15Gb RAM 的主服务器。

我使用以下代码:

df.write.option("forceSchema", schema_str) \
            .format("avro") \
            .partitionBy('platform', 'cluster') \
            .save(f"gs://{output_path}")

执行者的最终统计数据: 在此处输入图像描述

在 Spark 运行失败任务之一的 4 次尝试中,我得到的错误代码是:

1/4. java.lang.StackOverflowError

2/4. Job aborted due to stage failure: Task 29 in stage 13.0 failed 4 times, most recent failure: Lost task 29.3 in stage 13.0 (TID 3048, ce-w1.internal, executor 17): ExecutorLostFailure (executor 17 exited caused by one of the running tasks) Reason: Container from a bad node: container_1607696154227_0002_01_000028 on host: ce-w1.internal. Exit status: 50. Diagnostics: [2020-12-11 15:46:19.880]Exception from container-launch.
Container id: container_1607696154227_0002_01_000028
Exit code: 50

[2020-12-11 15:46:19.881]Container exited with a non-zero exit code 50. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
readOrdinaryObject(ObjectInputStream.java:2187)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)

3/4. java.lang.StackOverflowError
4/4. Job aborted due to stage failure: Task 29 in stage 13.0 failed 4 times, most recent failure: Lost task 29.3 in stage 13.0 (TID 3048, ce-w1.internal, executor 17): ExecutorLostFailure (executor 17 exited caused by one of the running tasks) Reason: Container from a bad node: container_1607696154227_0002_01_000028 on host: ce-w1.internal. Exit status: 50. Diagnostics: [2020-12-11 15:46:19.880]Exception from container-launch.
    Container id: container_1607696154227_0002_01_000028
    Exit code: 50
    
    [2020-12-11 15:46:19.881]Container exited with a non-zero exit code 50. Error file: prelaunch.err.
    Last 4096 bytes of prelaunch.err :
    Last 4096 bytes of stderr :
    readOrdinaryObject(ObjectInputStream.java:2187)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)

Spark UI 给了我这个:

在此处输入图像描述

从 UI 可以明显看出,数据分发正在发生某些事情,但重新分区会产生相同的 StackOverflow 错误。

所以我想问的两个问题:

  1. 如何在 StackOverflow 错误的上下文中解码消息“容器预启动错误”?

  2. 尽管数据分布相同,为什么作业中的其他操作可以安全运行?

标签: apache-sparkgoogle-cloud-platformpysparkapache-spark-sqldistributed-computing

解决方案


问题不是由于您的集群容量,而是由于您正在使用 avro 格式并且您正在强制 spark 编写新模式,同时保存尝试不使用后定义模式,它会起作用。例如,如果您想在通过 withColumn 保存之前尝试模式,请检查 shuffle 的数量。

df.write.format("avro") \
            .partitionBy('platform', 'cluster') \
            .save(f"gs://{output_path}")

推荐阅读