首页 > 解决方案 > 使用 spark scala 中的元组列表过滤数据框

问题描述

我试图通过将其两个列(在本例中为主题和流)与元组列表进行比较来过滤 scala 中的数据框。如果列值和元组值相等,则过滤行。

val df = Seq(
  (0, "Mark", "Maths", "Science"),
  (1, "Tyson", "History", "Commerce"),
  (2, "Gerald", "Maths", "Science"),
  (3, "Katie", "Maths", "Commerce"),
  (4, "Linda", "History", "Science")).toDF("id", "name", "subject", "stream")

样本输入:

+---+------+-------+--------+
| id|  name|subject|  stream|
+---+------+-------+--------+
|  0|  Mark|  Maths| Science|
|  1| Tyson|History|Commerce|
|  2|Gerald|  Maths| Science|
|  3| Katie|  Maths|Commerce|
|  4| Linda|History| Science|
+---+------+-------+--------+

需要过滤上述df的元组列表

  val listOfTuples = List[(String, String)] (
    ("Maths" , "Science"),
    ("History" , "Commerce")
)

预期结果 :

+---+------+-------+--------+
| id|  name|subject|  stream|
+---+------+-------+--------+
|  0|  Mark|  Maths| Science|
|  1| Tyson|History|Commerce|
|  2|Gerald|  Maths| Science|
+---+------+-------+--------+

标签: scalaapache-spark

解决方案


您可以isin使用结构来做到这一点(需要 spark 2.2+):

val df_filtered = df
    .where(struct($"subject",$"stream").isin(listOfTuples.map(typedLit(_)):_*))

或 leftsemi 加入:

val df_filtered = df
.join(listOfTuples.toDF("subject","stream"),Seq("subject","stream"),"leftsemi")

推荐阅读