首页 > 解决方案 > 需要帮助弄清楚 REGEX_EXTRACT 在 PIG LATIN 中的工作原理

问题描述

我试图了解以下用于提取推文中提到的第一个推特句柄的代码示例如何工作:

a = load '/user/pig/full_text.txt' AS (id:chararray, ts:chararray, location:chararray, lat:float, lon:float, tweet:chararray);
b = foreach a generate id, ts, location, LOWER(tweet) as tweet;
c = foreach b generate id, ts, location, REGEX_EXTRACT(tweet, '(.*)@user_(\\S{8})([:| ])(.*)',2) as tweet;
d = limit c 5;
dump d;

文件full_text.txt中的数据格式如下:

USER_79321756   2010-03-03T04:15:26 ÜT: 47.528139,-122.197916   47.528139   -122.197916 RT @USER_2ff4faca: IF SHE DO IT 1 MORE TIME......IMA KNOCK HER DAMN KOOFIE OFF.....ON MY MOMMA>>haha. #cutthatout
USER_79321756   2010-03-03T04:55:32 ÜT: 47.528139,-122.197916   47.528139   -122.197916 @USER_77a4822d @USER_2ff4faca okay:) lol. Saying ok to both of yall about to different things!:*
USER_79321756   2010-03-03T05:13:34 ÜT: 47.528139,-122.197916   47.528139   -122.197916 RT @USER_5d4d777a: YOURE A FOR GETTING IN THE MIDDLE OF THIS @USER_ab059bdc WHO THE FUCK ARE YOU ? A FUCKING NOBODY !!!!>>Lol! Dayum! Aye!
USER_79321756   2010-03-03T05:28:02 ÜT: 47.528139,-122.197916   47.528139   -122.197916 @USER_77a4822d yea ok..well answer that cheap as Sweden phone you came up on when I call.
USER_79321756   2010-03-03T05:56:13 ÜT: 47.528139,-122.197916   47.528139   -122.197916 A sprite can disappear in her mouth - lil kim hmmmmm the can not the bottle right?

但是,我无法理解该功能是如何REGEX_EXTRACT(tweet, '(.*)@user_(\\S{8})([:| ])(.*)',2)工作的。有人可以简单地解释一下这种情况下的正则表达式正在搜索什么以及索引如何选择第一个 twitter 句柄。

标签: regexapache-pig

解决方案


推荐阅读