首页 > 解决方案 > 使用不同的数据框创建带有信息的新数据框(Scala Spark)

问题描述

我有一个包含游戏的数据框和来自不同评论的每款游戏的三个评估,如您所见,每个评估都在另一个数据框中引用:

Df_reviews
+--------+-------+-------+--------+
|Game    | rev_1 | rev_2 | rev_3  |
+- ------+-------+-------+--------+
|CA      |XX+    | K2    | L1     |
|FT      |Z-     | K1+   | L3     |

Df_rev1
+----------+-------------+ 
| review_1 | Equivalence | 
+----------+-------------+
|XX+       | 9           |      
|Y         | 6           |
|Z-        | 3           |   

Df_rev2
+----------+-------------+
| review_2 | Equivalence | 
+----------+-------------+
|K2        | 7           |      
|K1+       | 6           |
|K3        | 10          | 

Df_rev3
+----------+-------------+
| review_3 | Equivalence | 
+----------+-------------+
|L3        | 10          |      
|L2        | 9           |
|L1        | 8           | 

我必须在一个新的数据框中使用 valoration 进行 traduce,并添加一个具有第二好的 valoration 的列,因为这个例子是:

Df_output
+--------+---------+---------+----------+-------------+
|Game    | rev_1_t | rev_2_t | rev_3_t  | second_best |
+--------+---------+---------+----------+-------------+
|CA      |  9      |  7      | 8        |     8       |
|FT      |  3      |  6      | 10       |     6       |

为了引述它,我正在尝试左连接,但我迷路了。我该如何处理?

####### 第二部分###### 如何将一个数据帧中的列转换为另一个数据帧中的其他列,并加入多列而不是一列?例如:

Df_revuews
+--------+-------+-------+--------+
|Game    | rev_1 | rev_2 | rev_3  |
+- ------+-------+-------+--------+
|CA      |XX+    | K2    | L1     |
|FT      |Z-     | K1+   | L3     |


Df_equiv
+--------+-------+
|Valorat | num   |
+- ------+-------+
|X       |3      |
|XX+     |5      |
|Z       |7      |
|Z-      |6      |
|K1+     |6      |
|K2      |4      |
|L1      |5      |
|L2      |6      |
|L3      |7      |

Output
+--------+-------+-------+--------+
|Game    | rev_1 | rev_2 | rev_3  |
+- ------+-------+-------+--------+
|CA      |5      | 4     | 5      |
|FT      |6      | 6     | 7      |

如您所见,我正在这样做:

val joined = df_reviews
.join(df_equiv, df_reviews("rev_1") === df_equiv("num") && df_reviews("rev_2") === df_equiv("num")
&& df_reviews("rev_3") === df_equiv("num"), "left")
.select(df_reviews("Game"), 
        df_equiv("num").as("rev_1_t"), 
        df_equiv("num").as("rev_2_t"), 
        df_equiv("num").as("rev_3_t")
       )

提前致谢!

标签: scalaapache-sparkapache-spark-sql

解决方案


您可以使用以下方法进行一些左连接并获得第二高的列sort_array

val joined = df_reviews
    .join(df_rev1, df_reviews("rev_1") === df_rev1("review_1"), "left")
    .join(df_rev2, df_reviews("rev_2") === df_rev2("review_2"), "left")
    .join(df_rev3, df_reviews("rev_3") === df_rev3("review_3"), "left")
    .select(df_reviews("Game"), 
            df_rev1("Equivalence").as("rev_1_t"), 
            df_rev2("Equivalence").as("rev_2_t"), 
            df_rev3("Equivalence").as("rev_3_t")
           )

val array_sort_udf = udf((x: Seq[Int]) => x.sortBy(_ != null))

val result = joined.withColumn(
    "second_best",
    coalesce(
        array_sort_udf(
            array(col("rev_1_t").cast("int"), col("rev_2_t").cast("int"), col("rev_3_t").cast("int"))
        )(1),
        greatest(col("rev_1_t").cast("int"), col("rev_2_t").cast("int"), col("rev_3_t").cast("int"))
    )
)

result.show
+----+-------+-------+-------+-----------+
|Game|rev_1_t|rev_2_t|rev_3_t|second_best|
+----+-------+-------+-------+-----------+
|  CA|      9|      7|      8|          8|
|  FT|      3|      6|     10|          6|
+----+-------+-------+-------+-----------+

对于你的第二个问题:

val joined = df_reviews.as("r1")
    .join(df_equiv.as("e1"), expr("r1.rev_1 = e1.Valorat"), "left")
    .selectExpr("Game", "e1.num as rev_1", "rev_2", "rev_3")
    .as("r2")
    .join(df_equiv.as("e2"), expr("r2.rev_2 = e2.Valorat"), "left")
    .selectExpr("Game", "rev_1", "e2.num as rev_2", "rev_3")
    .as("r3")
    .join(df_equiv.as("e3"), expr("r3.rev_3 = e3.Valorat"), "left")
    .selectExpr("Game", "rev_1", "rev_2", "e3.num as rev_3")

joined.show
+----+-----+-----+-----+
|Game|rev_1|rev_2|rev_3|
+----+-----+-----+-----+
|  CA|    5|    4|    5|
|  FT|    6|    6|    7|
+----+-----+-----+-----+

推荐阅读