首页 > 解决方案 > 如何从 Scala 中数据框 2 中没有的数据框 1 内容中检索数据

问题描述

我有两个数据框,如下所示

DF1 内容

+------------------------------------+------------------+
|REQ_ID                              |PRS_ID            |
+------------------------------------+------------------+
|999999000185425636asdasd12321312321 |999999000185425636|
|999999000185425636asdasd12321312321 |999999000185425636|
|999999000185425636asdasd12321312321 |999999000185425636|
|999999000185425636asdasd12321312321 |999999000185425636|
|999999000185425636asdasd12321312321 |999999000185425636|
|999999000185425636asdasd12321312321 |999999000185425636|
|999999000185392677asdasd12321312321 |999999000185392677|
|999999000185392677asdasd12321312321 |999999000185392677|
|999999000185392677asdasd12321312321 |999999000185392677|
|be1e63ce-cdf6-407d-abf3-f818e0872e92|999999000185254510|
|048022cc-9c26-4c0d-a9a8-551f4a364510|999999000185298297|
|cd66629d-14db-42df-a558-49e78c3ae320|999999000185320831|
|999999000185386838asdasd12321312321 |999999000185386838|
|999999000185386838asdasd12321312321 |999999000185386838|
|999999000185386838asdasd12321312321 |999999000185386838|
|999999000185386838asdasd12321312321 |999999000185386838|
|d2824085-65d3-432f-a4dd-73e31453733a|999999000185266094|
|ebfde7dc-9352-42d4-816b-d2f01653c1c9|999999000185266027|
|dc8b5731-8d1a-4394-ae9d-f74098462be4|999999000185250909|
|9c642932-7a95-4bfe-ae75-687af9151fc8|990000000061356494|
|6469d0dd-0d1d-454b-96f3-87ea9de6db29|999999000185048192|
+------------------------------------+------------------+

DF2 Contents

+------------------------------------+------------------+
|REQ_ID                              |PRS_ID            |
+------------------------------------+------------------+
|6b65a7c7-1c88-4aa8-9a22-ae8d17d4b276|990000000061357568|
|d713ed24-cbc0-4880-89ad-cabbd65e57f2|999999000184600448|
|7c8996fc-84a4-4cf0-a429-7c809281a7cc|999999000184649344|
|fdf784ee-ba8f-4efb-ab6e-41aa483b6b70|999999000184709120|
|6469d0dd-0d1d-454b-96f3-87ea9de6db29|999999000185048192|
|5b240d5a-c76e-4a27-aaaf-781250e2beda|999999000185064192|
|0cee0936-b0e7-4331-abdb-6ab388402d0b|999999000185200256|
|33d89b0f-2ad2-43aa-82f3-730d44e03b36|999999000185200384|
|9934f51e-fc31-4f2c-915b-fd47eab029a2|999999000185206656|
|75a94671-7baf-4237-927c-b713efe10412|999999000185216128|
|29d362df-bae8-41f0-b9bd-bbd4a386b480|999999000185216256|
|95a909c5-3d9d-4c95-a567-0e296761a8e2|999999000185217920|
|cd07591c-cda2-4900-914f-8b06d39f9357|999999000185252992|
|2f2eb612-484b-4b5b-9d6f-068a689a4738|999999000185258368|
|3bef0390-6540-4105-be5d-e8978d4414b8|999999000185271168|
|09d16ad0-50db-4f32-b98b-45c848804073|999999000185274880|
|037dbce6-bb13-4404-88af-a855216e2946|999999000185306112|
|efe3e3fd-1d3f-4d41-9c9c-863c04b7d94d|999999000185307136|
|1e18f1d8-cf34-49f4-aeb9-42c00baddd90|999999000185417856|
|b999ef86-6118-4560-8d5f-157882dc1bfc|999999000185456512|
|ebfde7dc-9352-42d4-816b-d2f01653c1c9|999999000185266027|
|999999000185386838asdasd12321312321,999999000185386838|
+------------------------------------+------------------+

我只需要 DF2 中没有的 DF1 记录,我的意思是我只想要 DF1 记录,请忽略常见记录,只需要 DF2 记录

最终输出应该是这样的

[REQ_ID                              ,PRS_ID]
[048022cc-9c26-4c0d-a9a8-551f4a364510,999999000185298297]
[999999000185392677asdasd12321312321,999999000185392677]
[999999000185425636asdasd12321312321,999999000185425636]
[9c642932-7a95-4bfe-ae75-687af9151fc8,990000000061356494]
[be1e63ce-cdf6-407d-abf3-f818e0872e92,999999000185254510]
[cd66629d-14db-42df-a558-49e78c3ae320,999999000185320831]
[d2824085-65d3-432f-a4dd-73e31453733a,999999000185266094]
[dc8b5731-8d1a-4394-ae9d-f74098462be4,999999000185250909]

请尽快帮助我,感谢您的帮助。

标签: scalaspark-dataframe

解决方案


有一个except功能应该可以解决您的要求。做就是了

df1.except(df2)

你会有

+------------------------------------+------------------+
|REQ_ID                              |PRS_ID            |
+------------------------------------+------------------+
|048022cc-9c26-4c0d-a9a8-551f4a364510|999999000185298297|
|d2824085-65d3-432f-a4dd-73e31453733a|999999000185266094|
|9c642932-7a95-4bfe-ae75-687af9151fc8|990000000061356494|
|999999000185425636asdasd12321312321 |999999000185425636|
|cd66629d-14db-42df-a558-49e78c3ae320|999999000185320831|
|dc8b5731-8d1a-4394-ae9d-f74098462be4|999999000185250909|
|be1e63ce-cdf6-407d-abf3-f818e0872e92|999999000185254510|
|999999000185392677asdasd12321312321 |999999000185392677|
+------------------------------------+------------------+

dropDuplicates洗牌发生时是一种昂贵的方法。但如果你想避免重复,那么你可以这样做

df1.except(df2).dropDuplicates("REQ_ID", "PRS_ID")

Op 有 Spark 1.6.2,所以上面的 dropDuplicates 不起作用,他有错误

val df3 = df1.except(df2).dropDuplicates("REQ_ID","PRS_ID") :35: 错误: 重载方法值 dropDuplicates 与替代品: (colNames: Array[String])org.apache.spark.sql.DataFrame ( colNames: Seq[String])org.apache.spark.sql.DataFrame ()org.apache.spark.sql.DataFrame 不能应用于 (String, String) val df3 = df1.except(df2).dropDuplicates("REQ_ID ","PRS_ID")

这是相关的 Spark_Improvements-15807

所以你应该使用

df1.except(df2).dropDuplicates(Seq("REQ_ID", "PRS_ID"))

推荐阅读