scala - Spark 数据框连接显示意外结果 - 0 行
问题描述
我正在使用 spark-1.6.0,我想加入 2 个数据帧,它们显示在 YARN 日志中,如下所示。
df_train_raw
df_user_clicks_info
我试图用代码内部加入它们:
val df_tmp_tmp_0 = df_train_raw.join(df_user_clicks_info, Seq("subscriberid"))
df_tmp_tmp_0.show()
而我得到的结果完全没有!我的天啊!
+------------+--------+-----+------------+-------------------------------+-------------------------------+--------------------------------+---------------------------------+---------------------------------+---------------------------------+--------------------------------+--------------------------------+--------------------------------+---------------------------------+----------------------------------+----------------------------------+----------------------------------+---------------------------------+---------------------------------+---------------------------------+----------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+----------------------------------+
|subscriberid|objectid|label|subscriberid|user_clicks_avg_everyday_a_week|user_clicks_sum_time_1_9_a_week|user_clicks_sum_time_9_14_a_week|user_clicks_sum_time_14_17_a_week|user_clicks_sum_time_17_19_a_week|user_clicks_sum_time_19_23_a_week|user_clicks_sum_time_23_1_a_week|user_clicks_avg_everyday_weekday|user_clicks_sum_time_1_9_weekday|user_clicks_sum_time_9_14_weekday|user_clicks_sum_time_14_17_weekday|user_clicks_sum_time_17_19_weekday|user_clicks_sum_time_19_23_weekday|user_clicks_sum_time_23_1_weekday|user_clicks_avg_everyday_weekdend|user_clicks_sum_time_1_9_weekdend|user_clicks_sum_time_9_14_weekdend|user_clicks_sum_time_14_17_weekdend|user_clicks_sum_time_17_19_weekdend|user_clicks_sum_time_19_23_weekdend|user_clicks_sum_time_23_1_weekdend|
+------------+--------+-----+------------+-------------------------------+-------------------------------+--------------------------------+---------------------------------+---------------------------------+---------------------------------+--------------------------------+--------------------------------+--------------------------------+---------------------------------+----------------------------------+----------------------------------+----------------------------------+---------------------------------+---------------------------------+---------------------------------+----------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+----------------------------------+
+------------+--------+-----+------------+-------------------------------+-------------------------------+--------------------------------+---------------------------------+---------------------------------+---------------------------------+--------------------------------+--------------------------------+--------------------------------+---------------------------------+----------------------------------+----------------------------------+----------------------------------+---------------------------------+---------------------------------+---------------------------------+----------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+----------------------------------+
我不知道为什么?这里好像没什么问题?希望对大家有所帮助~谢谢~
在 2 位朋友关于空间的建议之后,我会再试一次:
df_train_raw
————————————
+------------+-----------+-----+
|subscriberid| objectid|label|
+------------+-----------+-----+
| 104752237|11029932485| 0|
| 105246837|11029932485| 0|
| 105517237|11029932485| 0|
| 108917037|11030797988| 0|
| 108917037|11029648595| 0|
| 109901037|11029648595| 0|
| 105517237|11030720502| 0|
| 105246837|11029986502| 0|
| 104752237|11029191717| 0|
| 105246837|11029191717| 0|
| 105517237|11029191717| 0|
| 109901037|11030138623| 0|
| 105517237|11014105538| 0|
| 105517237|11014105543| 0|
| 105517237|11016478156| 0|
| 105517237|11023285357| 0|
| 105246837|11026067980| 0|
| 105246837|11030797988| 0|
| 108917037|11029932485| 0|
| 109901037|11029932485| 0|
+------------+-----------+-----+
only showing top 20 rows
————————————
root
|-- subscriberid: long (nullable = true)
|-- objectid: long (nullable = true)
|-- label: integer (nullable = true)
并打印“subscriberid”列,这显示的不是空格。
df_train_raw.select("subscriberid").take(20).foreach(println)
结果
[104752237]
[105246837]
[105517237]
[108917037]
[108917037]
[109901037]
[105517237]
[105246837]
[104752237]
[105246837]
[105517237]
[109901037]
[105517237]
[105517237]
[105517237]
[105517237]
[105246837]
[105246837]
[108917037]
[109901037]
并获取 df_user_clicks_info
+------------+-------------------------------+-------------------------------+--------------------------------+---------------------------------+---------------------------------+---------------------------------+--------------------------------+--------------------------------+--------------------------------+---------------------------------+----------------------------------+----------------------------------+----------------------------------+---------------------------------+---------------------------------+---------------------------------+----------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+----------------------------------+
|subscriberid|user_clicks_avg_everyday_a_week|user_clicks_sum_time_1_9_a_week|user_clicks_sum_time_9_14_a_week|user_clicks_sum_time_14_17_a_week|user_clicks_sum_time_17_19_a_week|user_clicks_sum_time_19_23_a_week|user_clicks_sum_time_23_1_a_week|user_clicks_avg_everyday_weekday|user_clicks_sum_time_1_9_weekday|user_clicks_sum_time_9_14_weekday|user_clicks_sum_time_14_17_weekday|user_clicks_sum_time_17_19_weekday|user_clicks_sum_time_19_23_weekday|user_clicks_sum_time_23_1_weekday|user_clicks_avg_everyday_weekdend|user_clicks_sum_time_1_9_weekdend|user_clicks_sum_time_9_14_weekdend|user_clicks_sum_time_14_17_weekdend|user_clicks_sum_time_17_19_weekdend|user_clicks_sum_time_19_23_weekdend|user_clicks_sum_time_23_1_weekdend|
+------------+-------------------------------+-------------------------------+--------------------------------+---------------------------------+---------------------------------+---------------------------------+--------------------------------+--------------------------------+--------------------------------+---------------------------------+----------------------------------+----------------------------------+----------------------------------+---------------------------------+---------------------------------+---------------------------------+----------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+----------------------------------+
| 104752237| 1.71| 0| 0| 0| 4| 4| 4| 0.8| 0| 0| 0| 0| 4| 0| 4.0| 0| 0| 0| 4| 0| 4|
| 105517237| 17.14| 12| 36| 12| 0| 60| 0| 9.6| 0| 0| 0| 0| 48| 0| 36.0| 12| 36| 12| 0| 12| 0|
| 109901037| 2.14| 0| 3| 3| 6| 3| 0| 2.4| 0| 0| 3| 6| 3| 0| 1.5| 0| 3| 0| 0| 0| 0|
| 105246837| 8.0| 8| 0| 0| 16| 32| 0| 8.0| 8| 0| 0| 8| 24| 0| 8.0| 0| 0| 0| 8| 8| 0|
+------------+-------------------------------+-------------------------------+--------------------------------+---------------------------------+---------------------------------+---------------------------------+--------------------------------+--------------------------------+--------------------------------+---------------------------------+----------------------------------+----------------------------------+----------------------------------+---------------------------------+---------------------------------+---------------------------------+----------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+----------------------------------+
————————————
root
|-- subscriberid: string (nullable = true)
|-- user_clicks_avg_everyday_a_week: double (nullable = false)
|-- user_clicks_sum_time_1_9_a_week: long (nullable = false)
|-- user_clicks_sum_time_9_14_a_week: long (nullable = false)
|-- user_clicks_sum_time_14_17_a_week: long (nullable = false)
|-- user_clicks_sum_time_17_19_a_week: long (nullable = false)
|-- user_clicks_sum_time_19_23_a_week: long (nullable = false)
|-- user_clicks_sum_time_23_1_a_week: long (nullable = false)
|-- user_clicks_avg_everyday_weekday: double (nullable = false)
|-- user_clicks_sum_time_1_9_weekday: long (nullable = false)
|-- user_clicks_sum_time_9_14_weekday: long (nullable = false)
|-- user_clicks_sum_time_14_17_weekday: long (nullable = false)
|-- user_clicks_sum_time_17_19_weekday: long (nullable = false)
|-- user_clicks_sum_time_19_23_weekday: long (nullable = false)
|-- user_clicks_sum_time_23_1_weekday: long (nullable = false)
|-- user_clicks_avg_everyday_weekdend: double (nullable = false)
|-- user_clicks_sum_time_1_9_weekdend: long (nullable = false)
|-- user_clicks_sum_time_9_14_weekdend: long (nullable = false)
|-- user_clicks_sum_time_14_17_weekdend: long (nullable = false)
|-- user_clicks_sum_time_17_19_weekdend: long (nullable = false)
|-- user_clicks_sum_time_19_23_weekdend: long (nullable = false)
|-- user_clicks_sum_time_23_1_weekdend: long (nullable = false)
df_user_clicks_info.select("subscriberid").take(20).foreach(println)
[104752237]
[105517237]
[109901037]
[105246837]
它也没有工作:(
解决方案
感谢帮助过我的朋友的帮助。我认为这是 SPARK-1.6.0 中的一个错误,我通过更改我的数据处理而不更新 SPARK 解决了这个问题。我的意思是一开始,我想从df_1和df_2中得到df_3,但是由于我在问题中提到的错误,它没有得到我想要的结果,所以我尝试了另一种方法来获取df_tmp_1和df_tmp_2,然后加入他们并得到结果。我也不知道为什么,但如果你使用 SPARK-1.6.0 并遇到像我一样的加入错误,这似乎是个好主意。
推荐阅读
- react-native - React Native 嵌套滚动视图奇怪的填充
- javascript - Why are event argument optional in anonymous functions in Javascript?
- c# - 调用“参数”的方法,即使有更具体的方法
- reactjs - 使用图层或其他分组方法显示/隐藏标记组
- ssl - certbot 通配符证书不匹配
- python - 如何在 Python 中将作业发送到网络打印机
- angular - ng2-pdfjs-viewer - 错误:无法匹配任何路由
- pine-script - 如何在 Pine 脚本(Tradingview)中绘制这样的线条?
- ios - RPScreenRecorder.shared().startCapture 不会写入/一直失败
- powerbi - 如何在 Power BI 矩阵的列中添加多个字段并在不向下钻取的情况下查看它们