apache-pig - 计算数据中有多少个不同长度的单词,例如,(8,1) (words, length)
问题描述
该函数应输出具有格式和示例 <”Length 8”, 1> 或 <”Length 7”, 1> 或类似的对,例如 <"8",1>。
要在 Pig 中获取字符串“theWord”的长度,您需要对每个单词使用 SIZE 函数。要将单词的大小与字符串“Length”连接起来,您需要对每个大小使用函数 CONCAT。最后,我知道为了将整数转换为字符串,以便将它与另一个使用 (CHARARRAY) 强制转换的字符串连接起来。例如,我会使用“(CHARARRAY)SIZE(word)”。
我已经编写了代码,但是当我尝试转储数据时,它并没有达到我的预期。我想我可能需要做一个计数功能,但我对此有点困惑。
p1 = LOAD 'poems/input/Poem1.txt' USING TextLoader AS(line:Chararray);
p2 = LOAD 'poems/input/Poem2.txt' USING TextLoader AS(line:Chararray);
p3 = LOAD 'poems/input/Poem3.txt' USING TextLoader AS(line:Chararray);
p4 = LOAD 'poems/input/Poem4.txt' USING TextLoader AS(line:Chararray);
p5 = LOAD 'poems/input/Poem5.txt' USING TextLoader AS(line:Chararray);
p6 = LOAD 'poems/input/Poem6.txt' USING TextLoader AS(line:Chararray);
p = UNION p1, p2, p3, p4, p5, p6;
words = foreach p generate flatten(TOKENIZE(line , ' ,;:!?\t\n\r\f\\.\\-')) as word;
words_lower = foreach words generate LOWER(word) as word_lower;
words_unique = group words_lower by word_lower;
words_with_size = foreach words_unique generate SIZE(words_lower) as size, group;
words_with_size_concat = CONCAT words_with_count BY (CHARARRAY)size(words_lower) DESC, group;
解决方案
我想到了!代码应该是这样的:
p1 = LOAD 'poems/input/Poem1.txt' USING TextLoader AS(line:Chararray);
p2 = LOAD 'poems/input/Poem2.txt' USING TextLoader AS(line:Chararray);
p3 = LOAD 'poems/input/Poem3.txt' USING TextLoader AS(line:Chararray);
p4 = LOAD 'poems/input/Poem4.txt' USING TextLoader AS(line:Chararray);
p5 = LOAD 'poems/input/Poem5.txt' USING TextLoader AS(line:Chararray);
p6 = LOAD 'poems/input/Poem6.txt' USING TextLoader AS(line:Chararray);
p = UNION p1, p2, p3, p4, p5, p6;
words = foreach p generate flatten(TOKENIZE(line , ' ,;:!?\t\n\r\f\\.\\-')) as word;
words_lower = foreach words generate LOWER(word) as word_lower;
words_length = foreach words generate CONCAT('Length ', (CHARARRAY)SIZE(word)) as word_length;
words_unique = group words_length by word_length
words_with_count = foreach words_unique generate COUNT(words_length) as cnt, group;
words_with_count_sorted = ORDER words_with_count BY cnt DESC, group;
store words_with_count_sorted into 'poems/output/wordcount1';
推荐阅读
- java - (当我输入输入 '(2,we,we,2,30.0)' 我有错误 com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException:
- javascript - sequelize.sync({force:true}) 不起作用并且方法的日志记录为空
- sql - 如何将此查询从相关更改为不相关
- java - 循环直到输入标记值的Java程序
- javascript - 如何使用 React 在对话框(material-ui)中传递组件?
- azure - 我可以使用 Azure MFA 而不使用其 SSO 登录吗?
- python - 打印 0 之前的最后一个结果
- c# - 更新/删除列表视图中包含倒数计时器的行
- javascript - 如何自动复制输入的文本?
- javascript - Nuxt Axios 在页面之间移动时返回错误