首页 > 解决方案 > pyspark dataframe: remove duplicates in an array column

问题描述

I would like to remove some duplicated words in a column of pyspark dataframe.

based on Remove duplicates from PySpark array column

My Spark:

  2.4.5

Py3 code:

  test_df = spark.createDataFrame([("I like this Book and this book be DOWNLOADED on line",)], ["text"])
  t3 = test_df.withColumn("text", F.array("text")) # have to convert it to array because the original large df is array type.

  t4 = t3.withColumn('text', F.expr("transform(text, x -> lower(x))"))
  t5 = t4.withColumn('text', F.array_distinct("text"))
  t5.show(1, 120)

but got

 +--------------------------------------------------------+
 |                                                    text| 
 +--------------------------------------------------------+
 |[i like this book and this book be downloaded on line]|
 +--------------------------------------------------------+

I need to remove

 book and this

It seems that the "array_distinct" cannot filter them out ?

thanks

标签: pythondataframeapache-sparkpyspark

解决方案


您可以使用pyspark 中的lcasesplitarray_distinctarray_join函数sql.functions

例如,F.expr("array_join(array_distinct(split(lcase(text),' ')),' ')")

这是工作代码

import pyspark.sql.functions as F
df
.withColumn("text_new",
   F.expr("array_join(array_distinct(split(lcase(text),' ')),' ')")) \
.show(truncate=False)

解释:

在这里,您首先将所有内容转换为小写,然后lcase(text)将数组拆分为空格split(text,' '),这会产生

[i, like, this, book, and, this, book, be, downloaded, on, line]|

然后你将它传递给array_distinct,它会产生

[i, like, this, book, and, be, downloaded, on, line]

最后,使用空格将其加入array_join

i like this book and be downloaded on line

推荐阅读