首页 > 解决方案 > how to count the frequency of occurrence of some specific words of my data in hive?

问题描述

I have a table of twitter objects (json formatted), n rows and 1 column in my hive. The task is to count the frequency of occurrence of some words like 'hon', 'han' in different objects (each object has a attribute called 'text', which includes some texts( string type)), which means even if a word is occurred in an object more than one time, but it only counts one. I write a query like below.

select count(*) from table_name
where regexp(get_json_object(col_name, '$.text'), 'han')
limit 10

And get an error message like

FAILED: ParseException line 2:6 cannot recognize input near 'regexp' '(' 'get_json_object' in expression specification`

How can I do this query task? And I don't know how to ignore case in the regular expression.

标签: jsonhivemapreduce

解决方案


使用(?i)修饰符进行不区分大小写的比较:

select 
      sum(case when text rlike '(?i)han' then 1 else 0 end) cnt_han,
      sum(case when text rlike '(?i)hon' then 1 else 0 end) cnt_hon
  from
(
select get_json_object(col_name, '$.text') as text 
  from table_name
)s;

推荐阅读