首页 > 解决方案 > 从连接两个数据帧的 Spark SQL 中删除重复项

问题描述

我有两个我需要加入的 spark DF。仅从 df2 中选择 df1 中存在的值,不应重复行。

例如:

df1:

+-------------+---------------+----------+
|a            |b              |val       |
+-------------+---------------+----------+
| 202003101750|   202003101700|1712384842|
| 202003101740|   202003101700|1590554927|
| 202003101730|   202003101700|1930860788|
| 202003101730|   202003101600|    101713|
| 202003101720|   202003101700|1261542412|
| 202003101720|   202003101600|   1824155|
| 202003101710|   202003101700| 912601761|
+-------------+---------------+----------+

df2:

+-------------+---------------+
|a            |b              |
+-------------+---------------+
| 202003101800|   202003101700|
| 202003101800|   202003101700|
| 202003101750|   202003101700|
| 202003101750|   202003101700|
| 202003101750|   202003101700|
| 202003101750|   202003101700|
| 202003101740|   202003101700|
| 202003101740|   202003101700|
+-------------+---------------+

我正在执行以下操作:

df1.join(df2, Seq("a", "b"), "leftouter").where(col("val").isNotNull) 但是我的输出有几个重复的行。

+-------------+---------------+----------+
|a            |b              |val       |
+-------------+---------------+----------+
| 202003101750|   202003101700|1712384842|
| 202003101750|   202003101700|1712384842|
| 202003101750|   202003101700|1712384842|
| 202003101750|   202003101700|1712384842|
| 202003101740|   202003101700|1590554927|
| 202003101740|   202003101700|1590554927|
| 202003101740|   202003101700|1590554927|
| 202003101740|   202003101700|1590554927||
+-------------+---------------+----------+

如果从 df1 删除 val,我正在尝试实现类似操作。但except似乎不起作用。例如以下是所需的操作 df1.drop(col("val")).except("df2") df1 的架构如下:

root
 |-- a: String (nullable = true)
 |-- b: String (nullable = true)
 |-- val: long (nullable = true)

另外,left-outer join和except之间到底有什么区别?预期输出:

+-------------+---------------+----------+
|a            |b              |val       |
+-------------+---------------+----------+
| 202003101750|   202003101700|1712384842|
| 202003101740|   202003101700|1590554927||
+-------------+---------------+----------+

标签: scalaapache-sparkapache-spark-sql

解决方案


LeftOuterjoin 将从左表中获取所有行并从右表中获取匹配行。

Except与第一个数据帧(没有重复)相比,将给出第二个数据帧中不存在的行。

对于您的情况,您可以使用inner(或)outer加入 dropDuplicates。

df1.join(df2, Seq("a", "b"), "inner").dropDuplicates().show()
//+------------+------------+----------+
//|           a|           b|       val|
//+------------+------------+----------+
//|202003101740|202003101700|1590554927|
//|202003101750|202003101700|1712384842|
//+------------+------------+----------+

df1.join(df2, Seq("a", "b"), "rightouter").where(col("val").isNotNull).dropDuplicates().show()
//+------------+------------+----------+
//|           a|           b|       val|
//+------------+------------+----------+
//|202003101740|202003101700|1590554927|
//|202003101750|202003101700|1712384842|
//+------------+------------+----------+

推荐阅读