首页 > 解决方案 > 加入同一张表的增量数据的表

问题描述

我正在尝试在 Redshift Spectrum 中实现一个逻辑,其中我的原始表如下所示:

学生表中的记录:

1 || student1 || Boston   || 2019-01-01  
2 || student2 || New York || 2019-02-01
3 || student3 || Chicago  || 2019-03-01 
1 || student1 || Dallas   || 2019-03-01

增量表 studentinc 中的记录如下所示:

1 || student1 || SFO       || 2019-04-01
4 || student4 || Detroit   || 2019-04-01

通过加入 student 和 studentinc 表,我试图获取最新的记录集,如下所示:

2 || student2 || New York  || 2019-02-01
3 || student3 || Chicago   || 2019-03-01
1 || student1 || SFO       || 2019-04-01
4 || student4 || Detroit   || 2019-04-01

我通过对 student 和 studentinc 进行 UNION 得到了这个解决方案,然后根据 max(modified_ts) 查询 union 的结果。但是,此解决方案不适用于大表,是否有更好的解决方案可以通过连接两个表来工作?

标签: mysqlsqlapache-spark-sqlamazon-redshift-spectrum

解决方案


1. 使用 Spark-SQL可以通过使用 not in 和 union 来实现

scala> var df1 = Seq((1 ,"student1","Boston  " , "2019-01-01"  ),(2 ,"student2","New York" , "2019-02-01"),(3 ,"student3","Chicago " , "2019-03-01" ),(1 ,"student1","Dallas  " , "2019-03-01")).toDF("id","name","country","_date")

注册为临时表

scala> df1.registerTempTable("temp1")
scala> sql("select * from temp1") .show
+---+--------+--------+----------+
| id|    name| country|     _date|
+---+--------+--------+----------+
|  1|student1|Boston  |2019-01-01|
|  2|student2|New York|2019-02-01|
|  3|student3|Chicago |2019-03-01|
|  1|student1|Dallas  |2019-03-01|
+---+--------+--------+----------+

第二个数据框

scala> var df3 = Seq((1 , "student1", "SFO", "2019-04-01"),(4 , "student4", "Detroit", "2019-04-01")).toDF("id","name","country","_date")

scala> df3.show
+---+--------+-------+----------+
| id|    name|country|     _date|
+---+--------+-------+----------+
|  1|student1|    SFO|2019-04-01|
|  4|student4|Detroit|2019-04-01|
+---+--------+-------+----------+

不符合联合条款

scala> sql("select * from (select * from temp1 where id not in (select id from temp2 ) )tt") .union(df3).show
+---+--------+--------+----------+
| id|    name| country|     _date|
+---+--------+--------+----------+
|  2|student2|New York|2019-02-01|
|  3|student3|Chicago |2019-03-01|
|  1|student1|     SFO|2019-04-01|
|  4|student4| Detroit|2019-04-01|
+---+--------+--------+----------+

第二次使用 Spark Dataframe,这比 IN 查询更快,因为 IN 执行逐行操作。

scala> df1.join(df3,Seq("id"),"left_anti").union (df3).show
+---+--------+--------+----------+
| id|    name| country|     _date|
+---+--------+--------+----------+
|  2|student2|New York|2019-02-01|
|  3|student3|Chicago |2019-03-01|
|  1|student1|     SFO|2019-04-01|
|  4|student4| Detroit|2019-04-01|
+---+--------+--------+----------+

希望它可以帮助你。如果您有任何与此相关的查询,请告诉我


推荐阅读