首页 > 解决方案 > 将数据帧转换为 csv 会引发错误 pyspark

问题描述

我有大约 7GB 记录的巨大数据框。我正在尝试获取数据帧的计数并将其下载为 csv 它们都导致以下错误。有没有其他方法可以在没有多个分区的情况下下载数据帧

print(df.count())
df.coalesce(1).write.option("header", "true").csv('/user/ABC/Output.csv')



Error:
java.io.IOException: Stream is corrupted
    at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:202)
    at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:228)
    at net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157)
    at org.apache.spark.io.ReadAheadInputStream$1.run(ReadAheadInputStream.java:168)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
20/05/26 18:15:44 ERROR scheduler.TaskSetManager: Task 8 in stage 360.0 failed 1 times; aborting job
[Stage 360:=======>                                                (8 + 1) / 60]
Py4JJavaError: An error occurred while calling o18867.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 8 in stage 360.0 failed 1 times, most recent failure: Lost task 8.0 in stage 360.0 (TID 13986, localhost, executor driver): java.io.IOException: Stream is corrupted
    at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:202)
    at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:228)
    at net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157)
    at org.apache.spark.io.ReadAheadInputStream$1.run(ReadAheadInputStream.java:168)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

标签: dataframeapache-sparkpysparkpyspark-dataframescdsw

解决方案


推荐阅读