json - how to count the frequency of occurrence of some specific words of my data in hive?
问题描述
I have a table of twitter objects (json formatted), n rows and 1 column in my hive. The task is to count the frequency of occurrence of some words like 'hon', 'han' in different objects (each object has a attribute called 'text', which includes some texts( string type)), which means even if a word is occurred in an object more than one time, but it only counts one. I write a query like below.
select count(*) from table_name
where regexp(get_json_object(col_name, '$.text'), 'han')
limit 10
And get an error message like
FAILED: ParseException line 2:6 cannot recognize input near 'regexp' '(' 'get_json_object' in expression specification`
How can I do this query task? And I don't know how to ignore case in the regular expression.
解决方案
使用(?i)
修饰符进行不区分大小写的比较:
select
sum(case when text rlike '(?i)han' then 1 else 0 end) cnt_han,
sum(case when text rlike '(?i)hon' then 1 else 0 end) cnt_hon
from
(
select get_json_object(col_name, '$.text') as text
from table_name
)s;
推荐阅读
- mongodb - MongoDB:如何将一个集合拆分为 n 个集合?
- javascript - 数字范围的正则表达式和 -1
- variables - 如何理解下面makefile中的变量设置
- javascript - 单击按钮后组件未读取新的状态更改
- javascript - 我无法在 redux 存储中保存数组?
- c - C中的二进制到十进制转换器在某个数字后不起作用
- google-cloud-run - 如何从 Google Cloud Run 获取执行时间
- authentication - ldapadd 失败并出现错误:不适当的身份验证 (48)
- python - 熊猫字符串正则表达式问题提取多个数字
- linux - mongodb.conf 或 mongod.conf - Debian GNU/Linux 9 上的 MongoDB 配置(拉伸)