首页 > 解决方案 > Pyspark 比较两列是列表

问题描述

我有一个如下的数据框。这两列是列表。

df= sc.parallelize([
            {"subject_1":['A','B'],"subject_2":['A','B','C']  },            
            {"subject_1":['A','C'],"subject_2":['A','B','C']  },             
            {"subject_1":['A','B','D'],"subject_2":['A','B','E']  }  
 ]).toDF()
df.show()

在此处输入图像描述

我需要如下转换数据框。添加从前两列派生的三个新列。这需要比较两列列表中的项目。

在此处输入图像描述

做这个的最好方式是什么?

标签: pythonpyspark

解决方案


对于Spark2.4+, 使用array_intersectarray_except:

from pyspark.sql import functions as F

df.withColumn("both", F.array_intersect("subject_1","subject_2"))\
  .withColumn("only_1", F.array_except("subject_1","subject_2"))\
  .withColumn("only_2", F.array_except("subject_2","subject_1")).show()

#+---------+---------+------+------+------+
#|subject_1|subject_2|  both|only_1|only_2|
#+---------+---------+------+------+------+
#|   [A, B]|[A, B, C]|[A, B]|    []|   [C]|
#|   [A, C]|[A, B, C]|[A, C]|    []|   [B]|
#|[A, B, D]|[A, B, E]|[A, B]|   [D]|   [E]|
#+---------+---------+------+------+------+

推荐阅读