首页 > 解决方案 > Spark Dataframe Union 提供重复项

问题描述

我有一个基本数据集,其中一列具有空值而不是空值。所以我这样做:

val nonTrained_ds = base_ds.filter(col("col_name").isNull)
val trained_ds = base_ds.filter(col("col_name").isNotNull)

当我打印出来时,我清楚地分开了行。但当我这样做时,

val combined_ds = nonTrained_ds.union(trained_ds)

我从 得到重复的行记录,nonTrained_ds奇怪的是,来自trained_ds的行不再在组合 ds 中。

为什么会这样?

的值为trained_ds

+----------+----------------+
|unique_no |      running_id|
+----------+----------------+
|0456700001|16              |
|0456700004|16              |
|0456700007|16              |
|0456700010|16              |
|0456700013|16              |
|0456700016|16              |
|0456700019|16              |
|0456700022|16              |
|0456700025|16              |
|0456700028|16              |
|0456700031|16              |
|0456700034|16              |
|0456700037|16              |
|0456700040|16              |
|0456700043|16              |
|0456700046|16              |
|0456700049|16              |
|0456700052|16              |
|0456700055|16              |
|0456700058|16              |
|0456700061|16              |
|0456700064|16              |
|0456700067|16              |
|0456700070|16              |
+----------+----------------+

的值为nonTrained_ds

+----------+----------------+
|unique_no |      running_id|
+----------+----------------+
|0456700002|null            |
|0456700003|null            |
|0456700005|null            |
|0456700006|null            |
|0456700008|null            |
|0456700009|null            |
|0456700011|null            |
|0456700012|null            |
|0456700014|null            |
|0456700015|null            |
|0456700017|null            |
|0456700018|null            |
|0456700020|null            |
|0456700021|null            |
|0456700023|null            |
|0456700024|null            |
|0456700026|null            |
|0456700027|null            |
|0456700029|null            |
|0456700030|null            |
|0456700032|null            |
|0456700033|null            |
|0456700035|null            |
|0456700036|null            |
|0456700038|null            |
|0456700039|null            |
|0456700041|null            |
|0456700042|null            |
|0456700044|null            |
|0456700045|null            |
|0456700047|null            |
|0456700048|null            |
|0456700050|null            |
|0456700051|null            |
|0456700053|null            |
|0456700054|null            |
|0456700056|null            |
|0456700057|null            |
|0456700059|null            |
|0456700060|null            |
|0456700062|null            |
|0456700063|null            |
|0456700065|null            |
|0456700066|null            |
|0456700068|null            |
|0456700069|null            |
|0456700071|null            |
|0456700072|null            |
+----------+----------------+

组合 ds 的值为:

+----------+----------------+
|unique_no |      running_id|
+----------+----------------+
|0456700002|null            |
|0456700003|null            |
|0456700005|null            |
|0456700006|null            |
|0456700008|null            |
|0456700009|null            |
|0456700011|null            |
|0456700012|null            |
|0456700014|null            |
|0456700015|null            |
|0456700017|null            |
|0456700018|null            |
|0456700020|null            |
|0456700021|null            |
|0456700023|null            |
|0456700024|null            |
|0456700026|null            |
|0456700027|null            |
|0456700029|null            |
|0456700030|null            |
|0456700032|null            |
|0456700033|null            |
|0456700035|null            |
|0456700036|null            |
|0456700038|null            |
|0456700039|null            |
|0456700041|null            |
|0456700042|null            |
|0456700044|null            |
|0456700045|null            |
|0456700047|null            |
|0456700048|null            |
|0456700050|null            |
|0456700051|null            |
|0456700053|null            |
|0456700054|null            |
|0456700056|null            |
|0456700057|null            |
|0456700059|null            |
|0456700060|null            |
|0456700062|null            |
|0456700063|null            |
|0456700065|null            |
|0456700066|null            |
|0456700068|null            |
|0456700069|null            |
|0456700071|null            |
|0456700072|null            |
|0456700002|16              |
|0456700005|16              |
|0456700008|16              |
|0456700011|16              |
|0456700014|16              |
|0456700017|16              |
|0456700020|16              |
|0456700023|16              |
|0456700026|16              |
|0456700029|16              |
|0456700032|16              |
|0456700035|16              |
|0456700038|16              |
|0456700041|16              |
|0456700044|16              |
|0456700047|16              |
|0456700050|16              |
|0456700053|16              |
|0456700056|16              |
|0456700059|16              |
|0456700062|16              |
|0456700065|16              |
|0456700068|16              |
|0456700071|16              |
+----------+----------------+

标签: scalaapache-spark

解决方案


这成功了,

val nonTrained_ds = base_ds.filter(col("primary_offer_id").isNull).distinct()
    val trained_ds = base_ds.filter(col("primary_offer_id").isNotNull).distinct()

推荐阅读