首页 > 解决方案 > 如何使用 MapReduce 方法提取 MrJob 中的行索引?

问题描述

您如何提取 MrJob 中任何给定行的行索引?

index_words = ["before", "remove"]

class MRWordInvertedIndex(MRJob):

    # how to make the key(index) the line index of the corresponding value(line) in the input text file?
    def mapper(self, index, line):
        words = WORD_RE.findall(line.lower())

        for word in words:
            # obtain the line index where 'word' occurs
            if word in index_words:         
                yield word.lower(), index    # where index is the line number

是否可以使index映射器的 key(aka ) 参数成为相应输入文本文件中行的实际行索引或以其他方式获取行索引?

例如,假设输入文本文件是:

# copyright laws for your country before downloading or redistributing 
# this or any other Project Gutenberg eBook. BLANK LINE BELOW.

# This header should be the first thing seen when viewing this Project 
# Gutenberg file.  Please do not remove it.

然后索引词的行索引beforeremove将是:

"before": 1
"remove": 4

由于before发生在第 1 行并remove发生在第 4 行。

标签: pythonmultithreadingparallel-processingmapreducemrjob

解决方案


推荐阅读