首页 > 解决方案 > 如何转换 DF 以添加包含在另一列中的字符串列表的列

问题描述

假设我在 scala 中有一个关键字列表

val keywords = List("pineapple", "lemon")

像这样的数据框

+---+-------------------------------------------+
|ID |Body                                       |
+---+-------------------------------------------+
|123|I contain both keywords pineapple and lemon|
|456|I sadly don't contain anything...          |
|789|Pineapple's are delicious                  |
+---+-------------------------------------------+

如何将此数据框转换为Body包含包含关键字的新列?我正在寻找的结果是

+---+-------------------------------------------+------------------+
|ID |Body                                       |Contains_Keywords |
+---+-------------------------------------------+------------------+
|123|I contain both keywords pineapple and lemon|[pineapple, lemon]|
|456|I sadly don't contain anything...          |[]                |
|789|Pineapple's are delicious                  |[pineapple]       |
+---+-------------------------------------------+------------------+

标签: scalaapache-spark

解决方案


检查下面的代码。

使用所需的示例数据创建数据框。

scala> val df = Seq(
      (123,"I contain both keywords pineapple and lemon"),
      (456,"I sadly don't contain anything"),
      (789,"Pineapple's are delicious")).toDF("id","body")

df: org.apache.spark.sql.DataFrame = [id: int, body: string]
scala> val keywords = List("pineapple", "lemon")
keywords: List[String] = List(pineapple, lemon)

typedLit添加keywords到数据框并使用filter高阶函数来检查是否keyword包含body列。

scala> df
.withColumn("keywords",typedLit(keywords))
.withColumn("Contains_Keywords",expr("filter(keywords,keyword -> instr(lower(body),keyword) > 0)"))
.show(false)

最终输出

+---+-------------------------------------------+------------------+------------------+
|id |body                                       |keywords          |Contains_Keywords |
+---+-------------------------------------------+------------------+------------------+
|123|I contain both keywords pineapple and lemon|[pineapple, lemon]|[pineapple, lemon]|
|456|I sadly don't contain anything             |[pineapple, lemon]|[]                |
|789|Pineapple's are delicious                  |[pineapple, lemon]|[pineapple]       |
+---+-------------------------------------------+------------------+------------------+

推荐阅读