首页 > 解决方案 > 按 JSON 数组中的元素选择前 3 个元素

问题描述

源数据是

user_id video_interest
1 [{"category":"a","score":1},{"category":"b","score":2},{"category":"c","score":3},{"category":"d","score":4}]
2 [{"category":"e","score":1},{"category":"f","score":2},{"category":"g","score":-3}]

输出是

user_id video_interest_top3
1 [{"category":"d","score":4},{"category":"c","score":3},{"category":"b","score":2}]
2 [{"category":"f","score":2},{"category":"e","score":1}]

我需要过滤score>0,然后按照score的降序选择每个user_id的top3 video_interest

标签: sqlarraysjsonhivehiveql

解决方案


分解 JSON 数组,提取分数,计算每个用户的最大分数(如果需要按分数 desc 排序最终输出)和 row_number 按分数过滤前 3 个,再次收集数组并在必要时将其连接到 STRING。请参阅代码中的注释。我添加了排序数组和整个输出,因为最初并不清楚应该对什么进行排序:数组或最终输出,如果不需要,请删除 max_score 排序。

演示:

with mytable as (
select stack(2,  
1,'[{"category":"a","score":1},{"category":"b","score":2},{"category":"c","score":3},{"category":"d","score":4}]',
2,'[{"category":"e","score":1},{"category":"f","score":2},{"category":"g","score":-3}]'
) as (user_id,video_interest)
)

select --collect array and convert to JSON string
      user_id, max_score, concat('[',concat_ws(',',collect_list(category_score)),']') as video_interest
from
(
select user_id, category_score, max_score, score
from
(  
select --extract score, filter and sort
      user_id, vi.category_score, get_json_object(vi.category_score, '$.score') as score,
      row_number() over(partition by user_id order by get_json_object(vi.category_score, '$.score') desc) rn, 
      max(get_json_object(vi.category_score, '$.score')) over (partition by user_id) max_score
from
(--prepare for exploding array
select user_id, regexp_replace(regexp_replace(video_interest,'\\[|\\]',''), --remove []
                          '\\},\\{', '},,,{') as video_interest --replace , between array elements with ,,, to split
  from mytable
)s 
--split and explode
lateral view outer explode(split(video_interest,',,,')) vi as category_score
where get_json_object(vi.category_score, '$.score')>0
)s
where rn<=3 --filter top 3
distribute by user_id sort by score desc --Sort collection, remove if not necessary
)s
group by user_id, max_score
order by max_score desc --Sorting users by max_score desc, remove if not necessary

结果:

user_id max_score   video_interest
1       4           [{"category":"d","score":4},{"category":"c","score":3},{"category":"b","score":2}]
2       2           [{"category":"f","score":2},{"category":"e","score":1}]

推荐阅读