首页 > 解决方案 > 循环通过 rdd.collect() 时创建新的 Spark DataFrame

问题描述

我有一个 Spark 数据框df2。我正在做一个for row in df2.rdd.collect():

df2 = spark.createDataFrame([
          ["PROG1","ACTION1","11","IN PROGRESS"],
          ["PROG2","ACTION2","12","NEW"],
          ["PROG3","ACTION1","20","FINISHED"],
          ["PROG4","ACTION4","14","IN PROGRESS"],
          ["PROG5","ACTION1","20","NEW"]
],["PROGRAM_NAME", "ACTION", "VALUE1", "STATUS"])

for row in DF2.rdd.collect():
   # Update sharepoint using patch and get response from Sharepoint (already have the code for this) 

就如何寻求帮助:

从中取出所有行df2,添加一个新列RESPONSE并创建一个新数据框df3

这就是两个数据框的样子

在此处输入图像描述

在此处输入图像描述

标签: apache-sparkpysparkapache-spark-sqlpyspark-dataframes

解决方案


您可以通过添加新字段来简单地更新 for 循环中的每一行RESPONSE,以创建一个新rdd3的数据框df3

rdd3 = []
for row in df2.rdd.collect():
    # other staff here
    api_response = 200  # set the one from Sharepoint
    rdd3.append(Row(**row.asDict(), RESPONSE=api_response))

df3 = spark.createDataFrame(rdd3, df2.columns + ["RESPONSE"])

df3.show()

#+------------+-------+------+-----------+--------+
#|PROGRAM_NAME| ACTION|VALUE1|     STATUS|RESPONSE|
#+------------+-------+------+-----------+--------+
#|       PROG1|ACTION1|    11|IN PROGRESS|     200|
#|       PROG2|ACTION2|    12|        NEW|     200|
#|       PROG3|ACTION1|    20|   FINISHED|     200|
#|       PROG4|ACTION4|    14|IN PROGRESS|     200|
#|       PROG5|ACTION1|    20|        NEW|     200|
#+------------+-------+------+-----------+--------+

推荐阅读