scala - Spark Dataframe Union 提供重复项
问题描述
我有一个基本数据集,其中一列具有空值而不是空值。所以我这样做:
val nonTrained_ds = base_ds.filter(col("col_name").isNull)
val trained_ds = base_ds.filter(col("col_name").isNotNull)
当我打印出来时,我清楚地分开了行。但当我这样做时,
val combined_ds = nonTrained_ds.union(trained_ds)
我从 得到重复的行记录,nonTrained_ds
奇怪的是,来自trained_ds
的行不再在组合 ds 中。
为什么会这样?
的值为trained_ds
:
+----------+----------------+
|unique_no | running_id|
+----------+----------------+
|0456700001|16 |
|0456700004|16 |
|0456700007|16 |
|0456700010|16 |
|0456700013|16 |
|0456700016|16 |
|0456700019|16 |
|0456700022|16 |
|0456700025|16 |
|0456700028|16 |
|0456700031|16 |
|0456700034|16 |
|0456700037|16 |
|0456700040|16 |
|0456700043|16 |
|0456700046|16 |
|0456700049|16 |
|0456700052|16 |
|0456700055|16 |
|0456700058|16 |
|0456700061|16 |
|0456700064|16 |
|0456700067|16 |
|0456700070|16 |
+----------+----------------+
的值为nonTrained_ds
:
+----------+----------------+
|unique_no | running_id|
+----------+----------------+
|0456700002|null |
|0456700003|null |
|0456700005|null |
|0456700006|null |
|0456700008|null |
|0456700009|null |
|0456700011|null |
|0456700012|null |
|0456700014|null |
|0456700015|null |
|0456700017|null |
|0456700018|null |
|0456700020|null |
|0456700021|null |
|0456700023|null |
|0456700024|null |
|0456700026|null |
|0456700027|null |
|0456700029|null |
|0456700030|null |
|0456700032|null |
|0456700033|null |
|0456700035|null |
|0456700036|null |
|0456700038|null |
|0456700039|null |
|0456700041|null |
|0456700042|null |
|0456700044|null |
|0456700045|null |
|0456700047|null |
|0456700048|null |
|0456700050|null |
|0456700051|null |
|0456700053|null |
|0456700054|null |
|0456700056|null |
|0456700057|null |
|0456700059|null |
|0456700060|null |
|0456700062|null |
|0456700063|null |
|0456700065|null |
|0456700066|null |
|0456700068|null |
|0456700069|null |
|0456700071|null |
|0456700072|null |
+----------+----------------+
组合 ds 的值为:
+----------+----------------+
|unique_no | running_id|
+----------+----------------+
|0456700002|null |
|0456700003|null |
|0456700005|null |
|0456700006|null |
|0456700008|null |
|0456700009|null |
|0456700011|null |
|0456700012|null |
|0456700014|null |
|0456700015|null |
|0456700017|null |
|0456700018|null |
|0456700020|null |
|0456700021|null |
|0456700023|null |
|0456700024|null |
|0456700026|null |
|0456700027|null |
|0456700029|null |
|0456700030|null |
|0456700032|null |
|0456700033|null |
|0456700035|null |
|0456700036|null |
|0456700038|null |
|0456700039|null |
|0456700041|null |
|0456700042|null |
|0456700044|null |
|0456700045|null |
|0456700047|null |
|0456700048|null |
|0456700050|null |
|0456700051|null |
|0456700053|null |
|0456700054|null |
|0456700056|null |
|0456700057|null |
|0456700059|null |
|0456700060|null |
|0456700062|null |
|0456700063|null |
|0456700065|null |
|0456700066|null |
|0456700068|null |
|0456700069|null |
|0456700071|null |
|0456700072|null |
|0456700002|16 |
|0456700005|16 |
|0456700008|16 |
|0456700011|16 |
|0456700014|16 |
|0456700017|16 |
|0456700020|16 |
|0456700023|16 |
|0456700026|16 |
|0456700029|16 |
|0456700032|16 |
|0456700035|16 |
|0456700038|16 |
|0456700041|16 |
|0456700044|16 |
|0456700047|16 |
|0456700050|16 |
|0456700053|16 |
|0456700056|16 |
|0456700059|16 |
|0456700062|16 |
|0456700065|16 |
|0456700068|16 |
|0456700071|16 |
+----------+----------------+
解决方案
这成功了,
val nonTrained_ds = base_ds.filter(col("primary_offer_id").isNull).distinct()
val trained_ds = base_ds.filter(col("primary_offer_id").isNotNull).distinct()
推荐阅读
- google-cloud-platform - GCP:到 VM 公共 IP 的流量是否在单个区域内加密?
- python - 对每列不同行数的张量求和
- c# - 如何使用 moq 在单元测试中模拟 Nhibernate .ToListAsync()?
- javascript - asp.net MVC 从文本框 OnChange 事件更新 TextBoxFor 值
- c++ - 多线程 Mandelbrot 集代码输出不正确的图像
- javascript - 将数组值插入另一个数组Javascript的随机位置
- firebase - Firebase 部署错误。服务器错误。客户端网络套接字在建立安全 TLS 连接之前断开
- multithreading - 在为 Runnable 提供 lambda 时,为什么我不必重写 run 方法?
- reactjs - 基于反应原生中TextInput的变化渲染视图
- c++ - 在构建编译器时出现错误“int 之前的预期初始化程序”一直在尝试学习 c++,我只是一直卡住