首页 > 解决方案 > 在 Spark Structured Streaming JAVA 中合并两个具有不同列的数据集

问题描述

试图找出一种方法来合并两个不同的数据集以形成一个包含所有列的组合数据集。

Dataset<Row> dataActual = rowExtracted.selectExpr(
                "split(value,\"[|]\")[3] as sub_date",
                "split(value,\"[|]\")[7] as status",
                "split(value,\"[|]\")[14] as email_add",
                "split(value,\"[|]\")[15] as source_currency",
                "split(value,\"[|]\")[19] as processing_date"
        );


Dataset<Row> dataStatus = dataActual.select("status").map(
                (MapFunction<Row, String>)row-> mapStatus(row.toString()), 
                Encoders.STRING()).selectExpr("value as txn_latest_status").toDF();


尝试使用 union , join 等但没有任何效果

    Dataset<Row> data = dataActual.union(dataStatus);

实际的:

Dataset 1 :
root
 |-- sub_date: string (nullable = true)
 |-- status: string (nullable = true)
 |-- email_add: string (nullable = true)
 |-- source_currency: string (nullable = true)
 |-- processing_date: string (nullable = true)

Dataset 2 :
root
 |-- txn_latest_status: string (nullable = true)

预期结果:组合数据集

root
 |-- sub_date: string (nullable = true)
 |-- status: string (nullable = true)
 |-- email_add: string (nullable = true)
 |-- source_currency: string (nullable = true)
 |-- processing_date: string (nullable = true)
 |-- txn_latest_status: string (nullable = true)

标签: spark-structured-streaming

解决方案


请在下面找到/

scala> res18.show
+-----+
|names|
+-----+
|    A|
|    B|
+-----+


scala> res19.show
+-------+
|numbers|
+-------+
|      1|
|      2|
+-------+
scala>res18.join(res19).show
+-----+-------+
|names|numbers|
+-----+-------+
|    A|      1|
|    A|      2|
|    B|      1|
|    B|      2|
+-----+-------+

推荐阅读