首页 > 解决方案 > Hive join between two tables

问题描述

This is the problema: I've got this staging table:

key0    key1    timestamp   partition_key
5   5   2020-03-03 14:42:21.548 1
5   4   2020-03-03 14:40:11.871 1
4   3   2020-03-03 14:43:47.602 2

And this target table:

key0    key1    timestamp   partition_key
5   4   2020-03-03 13:43:16.695 1
5   5   2020-03-03 13:45:24.793 1
5   2   2020-03-03 13:47:30.668 1
5   1   2020-03-03 13:48:30.669 1
4   3   2020-03-03 13:53:47.602 2
43  3   2020-03-03 14:00:14.016 2

I want to get this output:

key0    key1    timestamp   partition_key
5   5   2020-03-03 14:42:21.548 1
5   4   2020-03-03 14:40:11.871 1
5   2   2020-03-03 13:47:30.668 1
5   1   2020-03-03 13:48:30.669 1
4   3   2020-03-03 14:43:47.602 2
43  3   2020-03-03 14:00:14.016 2

In the timestamp field, I want the most updated record when key0, key1, and partition_key. In addition, I want already existing records in the target table but that doesn't exist in the staging table

I tried first with this query:

select 
t1.key0,
t1.key1,
t1.timestamp,
t2.partition_key
from staging_table t2 
left outer join target_table t1 on 
t1.key0=t2.key0 AND
t1.key1=t2.key1 AND
t1.timestamp=t2.timestamp; 






标签: sqlhadoopjoinhive

解决方案


This looks like a prioritization query -- take everything from staging and then unmatched rows from the target. I'm going to recommend union all:

select s.*
from staging s
union all
select t.*
from target t left join
     staging s
     on t.key0 = s.key0 and t.key1 = s.key1
where s.key0 is null;

This does assume that staging has the most recent rows -- which is true in your sample data. If not, I would phrase this as:

select key0, key1, timestamp, partition_key
from (select st.*,
             row_number() over (partition by key0, key1 order by timestamp desc) as seqnum
      from ((select s.* from source s
            ) union all
            (select t.* from target t
            )
           ) st
     ) st
where seqnum = 1;

推荐阅读