首页 > 解决方案 > Spark 数据框连接显示意外结果 - 0 行

问题描述

我正在使用 spark-1.6.0,我想加入 2 个数据帧,它们显示在 YARN 日志中,如下所示。

df_train_raw

df_user_clicks_info

我试图用代码内部加入它们:

val df_tmp_tmp_0 = df_train_raw.join(df_user_clicks_info, Seq("subscriberid"))

df_tmp_tmp_0.show()

而我得到的结果完全没有!我的天啊!

+------------+--------+-----+------------+-------------------------------+-------------------------------+--------------------------------+---------------------------------+---------------------------------+---------------------------------+--------------------------------+--------------------------------+--------------------------------+---------------------------------+----------------------------------+----------------------------------+----------------------------------+---------------------------------+---------------------------------+---------------------------------+----------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+----------------------------------+
|subscriberid|objectid|label|subscriberid|user_clicks_avg_everyday_a_week|user_clicks_sum_time_1_9_a_week|user_clicks_sum_time_9_14_a_week|user_clicks_sum_time_14_17_a_week|user_clicks_sum_time_17_19_a_week|user_clicks_sum_time_19_23_a_week|user_clicks_sum_time_23_1_a_week|user_clicks_avg_everyday_weekday|user_clicks_sum_time_1_9_weekday|user_clicks_sum_time_9_14_weekday|user_clicks_sum_time_14_17_weekday|user_clicks_sum_time_17_19_weekday|user_clicks_sum_time_19_23_weekday|user_clicks_sum_time_23_1_weekday|user_clicks_avg_everyday_weekdend|user_clicks_sum_time_1_9_weekdend|user_clicks_sum_time_9_14_weekdend|user_clicks_sum_time_14_17_weekdend|user_clicks_sum_time_17_19_weekdend|user_clicks_sum_time_19_23_weekdend|user_clicks_sum_time_23_1_weekdend|
+------------+--------+-----+------------+-------------------------------+-------------------------------+--------------------------------+---------------------------------+---------------------------------+---------------------------------+--------------------------------+--------------------------------+--------------------------------+---------------------------------+----------------------------------+----------------------------------+----------------------------------+---------------------------------+---------------------------------+---------------------------------+----------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+----------------------------------+
+------------+--------+-----+------------+-------------------------------+-------------------------------+--------------------------------+---------------------------------+---------------------------------+---------------------------------+--------------------------------+--------------------------------+--------------------------------+---------------------------------+----------------------------------+----------------------------------+----------------------------------+---------------------------------+---------------------------------+---------------------------------+----------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+----------------------------------+

我不知道为什么?这里好像没什么问题?希望对大家有所帮助~谢谢~


在 2 位朋友关于空间的建议之后,我会再试一次:

df_train_raw
————————————

+------------+-----------+-----+
|subscriberid|   objectid|label|
+------------+-----------+-----+
|   104752237|11029932485|    0|
|   105246837|11029932485|    0|
|   105517237|11029932485|    0|
|   108917037|11030797988|    0|
|   108917037|11029648595|    0|
|   109901037|11029648595|    0|
|   105517237|11030720502|    0|
|   105246837|11029986502|    0|
|   104752237|11029191717|    0|
|   105246837|11029191717|    0|
|   105517237|11029191717|    0|
|   109901037|11030138623|    0|
|   105517237|11014105538|    0|
|   105517237|11014105543|    0|
|   105517237|11016478156|    0|
|   105517237|11023285357|    0|
|   105246837|11026067980|    0|
|   105246837|11030797988|    0|
|   108917037|11029932485|    0|
|   109901037|11029932485|    0|
+------------+-----------+-----+
only showing top 20 rows

————————————

root
 |-- subscriberid: long (nullable = true)
 |-- objectid: long (nullable = true)
 |-- label: integer (nullable = true)

并打印“subscriberid”列,这显示的不是空格。

df_train_raw.select("subscriberid").take(20).foreach(println)

结果

[104752237]
[105246837]
[105517237]
[108917037]
[108917037]
[109901037]
[105517237]
[105246837]
[104752237]
[105246837]
[105517237]
[109901037]
[105517237]
[105517237]
[105517237]
[105517237]
[105246837]
[105246837]
[108917037]
[109901037]

并获取 df_user_clicks_info

+------------+-------------------------------+-------------------------------+--------------------------------+---------------------------------+---------------------------------+---------------------------------+--------------------------------+--------------------------------+--------------------------------+---------------------------------+----------------------------------+----------------------------------+----------------------------------+---------------------------------+---------------------------------+---------------------------------+----------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+----------------------------------+
|subscriberid|user_clicks_avg_everyday_a_week|user_clicks_sum_time_1_9_a_week|user_clicks_sum_time_9_14_a_week|user_clicks_sum_time_14_17_a_week|user_clicks_sum_time_17_19_a_week|user_clicks_sum_time_19_23_a_week|user_clicks_sum_time_23_1_a_week|user_clicks_avg_everyday_weekday|user_clicks_sum_time_1_9_weekday|user_clicks_sum_time_9_14_weekday|user_clicks_sum_time_14_17_weekday|user_clicks_sum_time_17_19_weekday|user_clicks_sum_time_19_23_weekday|user_clicks_sum_time_23_1_weekday|user_clicks_avg_everyday_weekdend|user_clicks_sum_time_1_9_weekdend|user_clicks_sum_time_9_14_weekdend|user_clicks_sum_time_14_17_weekdend|user_clicks_sum_time_17_19_weekdend|user_clicks_sum_time_19_23_weekdend|user_clicks_sum_time_23_1_weekdend|
+------------+-------------------------------+-------------------------------+--------------------------------+---------------------------------+---------------------------------+---------------------------------+--------------------------------+--------------------------------+--------------------------------+---------------------------------+----------------------------------+----------------------------------+----------------------------------+---------------------------------+---------------------------------+---------------------------------+----------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+----------------------------------+
|   104752237|                           1.71|                              0|                               0|                                0|                                4|                                4|                               4|                             0.8|                               0|                                0|                                 0|                                 0|                                 4|                                0|                              4.0|                                0|                                 0|                                  0|                                  4|                                  0|                                 4|
|   105517237|                          17.14|                             12|                              36|                               12|                                0|                               60|                               0|                             9.6|                               0|                                0|                                 0|                                 0|                                48|                                0|                             36.0|                               12|                                36|                                 12|                                  0|                                 12|                                 0|
|   109901037|                           2.14|                              0|                               3|                                3|                                6|                                3|                               0|                             2.4|                               0|                                0|                                 3|                                 6|                                 3|                                0|                              1.5|                                0|                                 3|                                  0|                                  0|                                  0|                                 0|
|   105246837|                            8.0|                              8|                               0|                                0|                               16|                               32|                               0|                             8.0|                               8|                                0|                                 0|                                 8|                                24|                                0|                              8.0|                                0|                                 0|                                  0|                                  8|                                  8|                                 0|
+------------+-------------------------------+-------------------------------+--------------------------------+---------------------------------+---------------------------------+---------------------------------+--------------------------------+--------------------------------+--------------------------------+---------------------------------+----------------------------------+----------------------------------+----------------------------------+---------------------------------+---------------------------------+---------------------------------+----------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+----------------------------------+

————————————

root
 |-- subscriberid: string (nullable = true)
 |-- user_clicks_avg_everyday_a_week: double (nullable = false)
 |-- user_clicks_sum_time_1_9_a_week: long (nullable = false)
 |-- user_clicks_sum_time_9_14_a_week: long (nullable = false)
 |-- user_clicks_sum_time_14_17_a_week: long (nullable = false)
 |-- user_clicks_sum_time_17_19_a_week: long (nullable = false)
 |-- user_clicks_sum_time_19_23_a_week: long (nullable = false)
 |-- user_clicks_sum_time_23_1_a_week: long (nullable = false)
 |-- user_clicks_avg_everyday_weekday: double (nullable = false)
 |-- user_clicks_sum_time_1_9_weekday: long (nullable = false)
 |-- user_clicks_sum_time_9_14_weekday: long (nullable = false)
 |-- user_clicks_sum_time_14_17_weekday: long (nullable = false)
 |-- user_clicks_sum_time_17_19_weekday: long (nullable = false)
 |-- user_clicks_sum_time_19_23_weekday: long (nullable = false)
 |-- user_clicks_sum_time_23_1_weekday: long (nullable = false)
 |-- user_clicks_avg_everyday_weekdend: double (nullable = false)
 |-- user_clicks_sum_time_1_9_weekdend: long (nullable = false)
 |-- user_clicks_sum_time_9_14_weekdend: long (nullable = false)
 |-- user_clicks_sum_time_14_17_weekdend: long (nullable = false)
 |-- user_clicks_sum_time_17_19_weekdend: long (nullable = false)
 |-- user_clicks_sum_time_19_23_weekdend: long (nullable = false)
 |-- user_clicks_sum_time_23_1_weekdend: long (nullable = false)


df_user_clicks_info.select("subscriberid").take(20).foreach(println)


[104752237]
[105517237]
[109901037]
[105246837]

它也没有工作:(

标签: scalaapache-sparkjoinapache-spark-sql

解决方案


感谢帮助过我的朋友的帮助。我认为这是 SPARK-1.6.0 中的一个错误,我通过更改我的数据处理而不更新 SPARK 解决了这个问题。我的意思是一开始,我想从df_1和df_2中得到df_3,但是由于我在问题中提到的错误,它没有得到我想要的结果,所以我尝试了另一种方法来获取df_tmp_1和df_tmp_2,然后加入他们并得到结果。我也不知道为什么,但如果你使用 SPARK-1.6.0 并遇到像我一样的加入错误,这似乎是个好主意。


推荐阅读