首页 > 解决方案 > 从 RDD 转换为 DataFrame 时,我得到一个 EOFError。是什么导致了这种情况,我该如何阻止它?

问题描述

尝试将 RDD 转换为 DataFrame 时出现错误“EOFError”。我能做些什么来阻止这种情况?

我尝试以另一种方式创建 DataFrame,但这有其自身的复杂性。我认为我目前尝试创建 DataFrame 的方式是最简单的方法。

data = data.zip(bool_converted).map(lambda x: (x[0][1], x[0][2], x[0][3], x[1][1], x[0][5], x[0][6], x[0][7], x[0][8], x[0][9], x[0][10], x[0][11]))

data = data.toDF()

实际的错误信息是:

Traceback (most recent call last):
  File "/opt/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 170, in manager
  File "/opt/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 73, in worker
  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 402, in main
    if read_int(infile) == SpecialLengths.END_OF_STREAM:
  File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 717, in read_int
    raise EOFError
EOFError

标签: apache-sparkpysparkapache-spark-sql

解决方案


要完成这项工作,您必须将x[0][i]s 存储在字典中。

from pyspark.sql.types import Row

#put your data into a dictionary (can write your own function to do this)
dc = {1:x[0][1],2:x[0][2],3:x[0][3],...}

#then this
df = data.map(lambda x: Row(**dc)).toDF()

推荐阅读