python - pyspark dataframe: remove duplicates in an array column
问题描述
I would like to remove some duplicated words in a column of pyspark dataframe.
based on Remove duplicates from PySpark array column
My Spark:
2.4.5
Py3 code:
test_df = spark.createDataFrame([("I like this Book and this book be DOWNLOADED on line",)], ["text"])
t3 = test_df.withColumn("text", F.array("text")) # have to convert it to array because the original large df is array type.
t4 = t3.withColumn('text', F.expr("transform(text, x -> lower(x))"))
t5 = t4.withColumn('text', F.array_distinct("text"))
t5.show(1, 120)
but got
+--------------------------------------------------------+
| text|
+--------------------------------------------------------+
|[i like this book and this book be downloaded on line]|
+--------------------------------------------------------+
I need to remove
book and this
It seems that the "array_distinct" cannot filter them out ?
thanks
解决方案
您可以使用pyspark 中的lcase、split、 array_distinct和array_join函数sql.functions
例如,F.expr("array_join(array_distinct(split(lcase(text),' ')),' ')")
这是工作代码
import pyspark.sql.functions as F
df
.withColumn("text_new",
F.expr("array_join(array_distinct(split(lcase(text),' ')),' ')")) \
.show(truncate=False)
解释:
在这里,您首先将所有内容转换为小写,然后lcase(text)
将数组拆分为空格split(text,' ')
,这会产生
[i, like, this, book, and, this, book, be, downloaded, on, line]|
然后你将它传递给array_distinct
,它会产生
[i, like, this, book, and, be, downloaded, on, line]
最后,使用空格将其加入array_join
i like this book and be downloaded on line
推荐阅读
- javascript - 在焦点上异步填充数据列表选项(获取)
- graphql - 我可以更改 Hasura Graphql 端点吗?
- kubernetes - 用于 Statefulset 的 Kubernetes Admission webhook
- python - Python绘制具有不同数量元素的列表列表
- python - python threading:当多个线程之一失败时退出程序
- python - 如何在保持原始链表不变的情况下反转链表
- jquery - Django Ajax 多个文件表单重定向到 Json
- reactjs - React.js,改进从表单输入生成对象
- oracle - 你能告诉我一些建议Oracle错误吗?
- java - Google Cloud Vision OCR 在 Google Cloud Shell 本地主机上返回“错误图像数据”