python - Python mrjob - 找到 10 个最长的单词,但 mrjob 返回重复的单词
问题描述
我正在使用 Python mrjob 从文本文件中查找 10 个最长的单词。我得到了一个结果,但结果包含重复的单词。我如何只获得唯一的单词(即删除重复的单词)?
%%file most_chars.py
from mrjob.job import MRJob
from mrjob.step import MRStep
import re
WORD_RE = re.compile(r"[\w']+") # any whitespace or apostrophe, used to split lines below
class MostChars(MRJob):
def steps(self):
return [
MRStep(mapper=self.mapper_get_words,
reducer=self.reducer_find_longest_words)
]
def mapper_get_words(self, _, line):
for word in WORD_RE.findall(line):
yield None, (len(word), word.lower().strip())
# discard the key; it is just None
def reducer_find_longest_words(self, _, word_count_pairs):
# each item of word_count_pairs is (count, word),
# so yielding one results in key=counts, value=word
sorted_pair = sorted(word_count_pairs, reverse=True)
for pair in sorted_pair[0:10]:
yield pair
if __name__ == '__main__':
MostChars.run()
实际输出:
18 "overcapitalization"
18 "overcapitalization"
18 "overcapitalization"
17 "uncomprehendingly"
17 "misunderstandings"
17 "disinterestedness"
17 "disinterestedness"
17 "disinterestedness"
17 "disinterestedness"
17 "conventionalities"
预期输出:
18 "overcapitalization"
17 "uncomprehendingly"
17 "misunderstandings"
17 "disinterestedness"
17 "conventionalities"
和另外 5 个独特的词
解决方案
更新reducer_find_longest_words
以仅获取唯一元素。注意使用list(set())
.
def reducer_find_longest_words(self, _, word_count_pairs):
# each item of word_count_pairs is (count, word),
# so yielding one results in key=counts, value=word
unique_pairs = [list(x) for x in set(tuple(x) for x in word_count_pairs)]
sorted_pair = sorted(unique_pairs, reverse=True)
for pair in sorted_pair[0:10]:
yield pair
推荐阅读
- macos - 尽管 chmod 更改了权限和双击可执行文件的能力,但 Mac “zsh:权限被拒绝”
- python - Python 嵌套列表中的第二个参数是什么意思?
- python - 如何在 django 中固定帖子?
- stm32 - STM32F103 闪存保护部分
- python - 在保存期间如何在 Django 中编辑多对多字段?
- python - ModuleNotFoundError:从单元测试导入时没有名为“XXXXX”的模块
- latex - 在 LaTeX 中将表格拟合到列时出现问题
- sql-server - 强制回滚到整个脚本(包括存储过程)
- reactjs - Chart Js 折线图,点击时填充其图例文本的完整信息
- javascript - 带有假插入符号的中型编辑器多用户