首页 > 解决方案 > YARN 关闭 Pyspark write.csv() 以超出内存限制

问题描述

前提:我无法控制我的集群,并且我的工作前提是问题出在我的代码中,而不是我学校使用的设置。也许这是错误的,但这是这个问题的基础。

为什么 write.csv() 会导致我的 pyspark/slurm 作业超出内存限制,而以前对较大版本数据的许多操作都成功了,我该怎么办?

我得到的错误是(许多迭代......):

18/06/02 16:13:41 ERROR YarnScheduler: Lost executor 21 on server.name.edu: Container killed by YARN for exceeding memory limits. 7.0 GB of 7 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.

我知道我可以更改内存限制,但我已经将它增加了好几次,结果没有改变,而且我非常确信无论如何我都不应该使用接近这个内存量的任何地方。作为参考,我的 slurm 电话是:

spark-submit \
    --master yarn \
    --num-executors 100 \
   --executor-memory 6g \
   3main.py

那么我到底想写什么呢?好吧,我已经阅读了 39G .bz2 json 到 RDD,

allposts = ss.read.json(filename)

过滤了一堆,计算了单词,对RDD进行了分组,进行了一些计算,过滤了更多,最后我有这两个打印语句来了解剩下的内容......

abscounts = calculatePosts2(postRDD, sc, spark)
abscounts.printSchema()
print(abscounts.count())

这些打印语句有效(输出如下)。生成的 RDD 大约是 60 列乘以 2000 行+/-。这 60 列包括 1 个子目录名称长度的字符串和 59 个双精度字符串。

root
 |-- subreddit: string (nullable = true)
 |-- count(1): long (nullable = false)
 |-- sum(wordcount): long (nullable = true)
 |-- ingestfreq: double (nullable = true)
 |-- causefreq: double (nullable = true)
 |-- insightfreq: double (nullable = true)
 |-- cogmechfreq: double (nullable = true)
 |-- sadfreq: double (nullable = true)
 |-- inhibfreq: double (nullable = true)
 |-- certainfreq: double (nullable = true)
 |-- tentatfreq: double (nullable = true)
 |-- discrepfreq: double (nullable = true)
 |-- spacefreq: double (nullable = true)
 |-- timefreq: double (nullable = true)
 |-- exclfreq: double (nullable = true)
 |-- inclfreq: double (nullable = true)
 |-- relativfreq: double (nullable = true)
 |-- motionfreq: double (nullable = true)
 |-- quantfreq: double (nullable = true)
 |-- numberfreq: double (nullable = true)
 |-- swearfreq: double (nullable = true)
 |-- functfreq: double (nullable = true)
 |-- absolutistfreq: double (nullable = true)
 |-- ppronfreq: double (nullable = true)
 |-- pronounfreq: double (nullable = true)
 |-- wefreq: double (nullable = true)
 |-- ifreq: double (nullable = true)
 |-- shehefreq: double (nullable = true)
 |-- youfreq: double (nullable = true)
 |-- ipronfreq: double (nullable = true)
 |-- theyfreq: double (nullable = true)
 |-- deathfreq: double (nullable = true)
 |-- biofreq: double (nullable = true)
 |-- bodyfreq: double (nullable = true)
 |-- hearfreq: double (nullable = true)
 |-- feelfreq: double (nullable = true)
 |-- perceptfreq: double (nullable = true)
 |-- seefreq: double (nullable = true)
 |-- fillerfreq: double (nullable = true)
 |-- healthfreq: double (nullable = true)
 |-- sexualfreq: double (nullable = true)
 |-- socialfreq: double (nullable = true)
 |-- familyfreq: double (nullable = true)
 |-- friendfreq: double (nullable = true)
 |-- humansfreq: double (nullable = true)
 |-- affectfreq: double (nullable = true)
 |-- posemofreq: double (nullable = true)
 |-- negemofreq: double (nullable = true)
 |-- anxfreq: double (nullable = true)
 |-- angerfreq: double (nullable = true)
 |-- assentfreq: double (nullable = true)
 |-- nonflfreq: double (nullable = true)
 |-- verbfreq: double (nullable = true)
 |-- articlefreq: double (nullable = true)
 |-- pastfreq: double (nullable = true)
 |-- auxverbfreq: double (nullable = true)
 |-- futurefreq: double (nullable = true)
 |-- presentfreq: double (nullable = true)
 |-- prepsfreq: double (nullable = true)
 |-- adverbfreq: double (nullable = true)
 |-- negatefreq: double (nullable = true)
 |-- conjfreq: double (nullable = true)
 |-- homefreq: double (nullable = true)
 |-- leisurefreq: double (nullable = true)
 |-- achievefreq: double (nullable = true)
 |-- workfreq: double (nullable = true)
 |-- religfreq: double (nullable = true)
 |-- moneyfreq: double (nullable = true)

...

2026

之后,我的代码中唯一剩下的一行是:

  abscounts.write.csv('bigoutput.csv', header=True)

这会因内存错误而崩溃。这绝对不应该占用空间......我在这里做错了什么?

谢谢你的帮助。

如果你好奇/它有帮助,我的全部代码都在 github 上

标签: apache-sparkpysparkapache-spark-sql

解决方案


首先 executor.memoryOverhead 与 executor-memory 不一样。正如你在这里看到的。

使用 Pyspark,memoryOverhead 很重要,因为它控制 python 执行某些操作可能需要的额外内存(参见此处),在您的情况下,每个分区收集和保存一个 CSV 文件。

为了帮助python,你也可以考虑在写之前使用coalesce 。


推荐阅读