python - 将 DataFrame 转换为字典列表时如何避免此火花问题？

问题描述

我想将我的 spark DataFrame 转换为字典列表。 new_df = list(map(lambda row: row.asDict(), df_base.collect()))

但是当我运行上述内容时，我不断收到以下错误。

org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 5 tasks (4.3 GiB) is bigger than spark.driver.maxResultSize 4.0 GiB.

我该如何解决这个问题？有可能做我想做的事吗？

标签： pythonapache-sparkpyspark

简短的回答是使用df_base.toLocalIterator()而不是收集（）。但是你真的需要将超过 4GB 的数据加载到本地 python 列表中吗？您是否考虑使用 df_base.toPandas() 或使用 spark 来运行所有代码。

 new_df = list(map(lambda row: row.asDict(), df_base.toLocalIterator()))

python - 将 DataFrame 转换为字典列表时如何避免此火花问题？

问题描述

解决方案

推荐阅读