scala - 如何转换 DF 以添加包含在另一列中的字符串列表的列
问题描述
假设我在 scala 中有一个关键字列表
val keywords = List("pineapple", "lemon")
像这样的数据框
+---+-------------------------------------------+
|ID |Body |
+---+-------------------------------------------+
|123|I contain both keywords pineapple and lemon|
|456|I sadly don't contain anything... |
|789|Pineapple's are delicious |
+---+-------------------------------------------+
如何将此数据框转换为Body
包含包含关键字的新列?我正在寻找的结果是
+---+-------------------------------------------+------------------+
|ID |Body |Contains_Keywords |
+---+-------------------------------------------+------------------+
|123|I contain both keywords pineapple and lemon|[pineapple, lemon]|
|456|I sadly don't contain anything... |[] |
|789|Pineapple's are delicious |[pineapple] |
+---+-------------------------------------------+------------------+
解决方案
检查下面的代码。
使用所需的示例数据创建数据框。
scala> val df = Seq(
(123,"I contain both keywords pineapple and lemon"),
(456,"I sadly don't contain anything"),
(789,"Pineapple's are delicious")).toDF("id","body")
df: org.apache.spark.sql.DataFrame = [id: int, body: string]
scala> val keywords = List("pineapple", "lemon")
keywords: List[String] = List(pineapple, lemon)
typedLit
添加keywords
到数据框并使用filter
高阶函数来检查是否keyword
包含body
列。
scala> df
.withColumn("keywords",typedLit(keywords))
.withColumn("Contains_Keywords",expr("filter(keywords,keyword -> instr(lower(body),keyword) > 0)"))
.show(false)
最终输出
+---+-------------------------------------------+------------------+------------------+
|id |body |keywords |Contains_Keywords |
+---+-------------------------------------------+------------------+------------------+
|123|I contain both keywords pineapple and lemon|[pineapple, lemon]|[pineapple, lemon]|
|456|I sadly don't contain anything |[pineapple, lemon]|[] |
|789|Pineapple's are delicious |[pineapple, lemon]|[pineapple] |
+---+-------------------------------------------+------------------+------------------+
推荐阅读
- reactjs - 添加加载直到模态api数据加载反应
- c# - How to sort list of Control type
- python - Python Facebook Prophet 模型,绘制逆 BoxCox 的结果
- android - 如何在 android studio 中检查 EditText 的值是否大于 500?
- sql - 根据数组中的名称从数组中拉取值到 SQL
- javascript - 在 JS 中获取“文档未定义”
- python - RuntimeError:“nll_loss_forward_reduce_cuda_kernel_2d_index”未为“Int”实现:Pytorch
- python - Python - 使用行分隔符时间戳迭代文件
- java - 如何更改从关系生成的中间表的架构和名称
- python - 有什么方法可以将excel列号作为数据框