首页 > 解决方案 > 通过配置单元中的 get_json_object 将一行分隔为具有唯一 ID 的几行

问题描述

我正在尝试通过 Hive 中的 get_json_object 将一行分成具有唯一问题引用的几行,我可以这样做吗?

我只是在 Hive 中触摸 get_json_object,尝试将一行分隔为具有唯一问题引用(来自一个 json 格式列)的几行,并包含其他列信息。

SELECT  reference AS item_reference, 
        get_json_object(questions, '$.reference') AS question_reference,
        get_json_object(questions, '$.type') AS question_type
from sandbox.exportitems limit 10;

例如,给定输入:

reference  | questions  
NP002_025  | [{"reference":"3dfc54c0","type":"clozeformula"}]
DP001_1_10 | [{"reference":"73879547","type":"imageclozeformula"},
              {"reference":"466a5b88","type":"clozedropdown"}]

预期的输出是:

reference  | questions_reference | questions_type
NP002_025  | 3dfc54c0            | clozeformula  
DP001_1_10 | 73879547            | imageclozeformula  
DP001_1_10 | 466a5b88            | clozedropdown

标签: jsonhive

解决方案


好的,下面是一个例子:

with core as (
SELECT 
    'DP001_1_10' as reference,
    explode(
        split(
            regexp_replace(
                regexp_replace(
                    regexp_replace('[{"reference":"73879547","type":"imageclozeformula"},{"reference":"466a5b88","type":"clozedropdown"}]', '\\]','')
                ,'\\}\\,\\{','\\}\\;\\{')
            ,'\\[','')
        ,'\\;')
    ) as json_str
)
select 
    reference,
    get_json_object(json_str,'$.reference') as questions_reference,
    get_json_object(json_str,'$.type') as questions_type
from 
    core;
+-------------+----------------------+--------------------+--+
|  reference  | questions_reference  |   questions_type   |
+-------------+----------------------+--------------------+--+
| DP001_1_10  | 73879547             | imageclozeformula  |
| DP001_1_10  | 466a5b88             | clozedropdown      |
+-------------+----------------------+--------------------+--+

您可以将示例字符串'DP001_1_10'和替换'[{"reference":"73879547","type":"imageclozeformula"},{"reference":"466a5b88","type":"clozedropdown"}]'为您的列名referenceand questions。所以你想要的最终 hql 可能如下所示:

with core as (
select 
    reference,
    explode(
        split(
            regexp_replace(
                regexp_replace(
                    regexp_replace(questions, '\\]','')
                ,'\\}\\,\\{','\\}\\;\\{')
            ,'\\[','')
        ,'\\;')
    ) as json_str
from
    sandbox.exportitems
)
select 
    reference,
    get_json_object(json_str,'$.reference') as questions_reference,
    get_json_object(json_str,'$.type') as questions_type
from 
    core;

推荐阅读