首页 > 解决方案 > 用于字数统计垃圾邮件/火腿分类的缩减程序

问题描述

我正在为 Hadoop 流编写一​​个 reducer (python3),它不能正常工作,例如以下输入:

数据 = '狗\t1\t1\ndog\t1\t1\ndog\t0\t1\ndog\t0\t1\ncat\t0\t1\ncat\t0\t1\ncat\t1\t1\n'

import re
import sys

# initialize trackers
current_word = None

spam_count, ham_count = 0,0

# read from standard input
# Substitute read from a file


for line in data.splitlines():
#for line in sys.stdin:
# parse input
    word, is_spam, count = line.split('\t')
    count = int(count)

    if word == current_word:

        if is_spam == '1':
            spam_count += count
        else:
            ham_count += count
    else:
        if current_word:
        # word to emit...
            if spam_count:
               print("%s\t%s\t%s" % (current_word, '1', spam_count))
            print("%s\t%s\t%s" % (current_word, '0', ham_count))

        if is_spam == '1':
            current_word, spam_count = word, count
        else:
            current_word, ham_count = word, count



if current_word == word:
    if is_spam == '1':
        print(f'{current_word}\t{is_spam}\t{spam_count}')
    else:
        print(f'{current_word}\t{is_spam}\t{spam_count}')

我有 :

#dog    1   2
#dog    0   2
#cat    1   3

两只“垃圾邮件”狗和两只“火腿”狗都可以。猫的表现不太好。应该是:

#dog    1   2
#dog    0   2
#cat    0   2
#cat    1   1

标签: pythonhadoop-streamingword-countreducers

解决方案


原因是:您应该使 无效ham_count,而不仅仅是更新spam_count,反之亦然。

重写

if is_spam == '1':
    current_word, spam_count = word, count
else:
    current_word, ham_count = word, count

作为

if is_spam == '1':
    current_word, spam_count = word, count
    ham_count = 0
else:
    current_word, ham_count = word, count
    spam_count = 0

然而,输出不会与您的输出
1) 中的完全一样,因为您总是spam_count先打印(但在示例输出中,“cat ham”更早发出)
2) 输出块仅发出垃圾邮件或仅发出 ham,具体取决于当前状态的is_spam变量,但我想,你打算发出所有的,对吧?

The output: 
dog 1   2
dog 0   2
cat 1   1

- “猫垃圾邮件”的计数正确,但没有“猫火腿” - 我想,你至少应该打印这样的东西:

重写这段代码

if current_word == word:
    if is_spam == '1':
        print(f'{current_word}\t{is_spam}\t{spam_count}')
    else:
        print(f'{current_word}\t{is_spam}\t{spam_count}')

作为

print(f'{current_word}\t{1}\t{spam_count}')
print(f'{current_word}\t{0}\t{ham_count}')

- 完整的输出将是

dog 1   2
dog 0   2
cat 1   1
cat 0   2

Itertools
此外,itertools模块非常适合类似的任务:

import itertools    

splitted_lines = map(lambda x: x.split('\t'), data.splitlines())
grouped = itertools.groupby(splitted_lines, lambda x: x[0])

grouped是 itertools.goupby 对象,它是生成器 - 所以,小心,它是惰性的,它只返回一次值(所以,我在这里显示输出只是作为示例,因为它消耗生成器值)

[(gr_name, list(gr)) for gr_name, gr in grouped] 
Out:
[('dog',
  [['dog', '1', '1'],
   ['dog', '1', '1'],
   ['dog', '0', '1'],
   ['dog', '0', '1']]),
 ('cat', [['cat', '0', '1'], ['cat', '0', '1'], ['cat', '1', '1']])]

好的,现在每个组都可以按其特征再次分组is_spam

import itertools    

def sum_group(group):
    """
    >>> sum_group([('1', [['dog', '1', '1'], ['dog', '1', '1']]), ('0', [['dog', '0', '1'], ['dog', '0', '1']])])
    [('1', 2), ('0', 2)]
    """
    return sum([int(i[-1]) for i in group])

splitted_lines = map(lambda x: x.split('\t'), data.splitlines())
grouped = itertools.groupby(splitted_lines, lambda x: x[0])

[(name, [(tag_name, sum_group(sub_group))
         for tag_name, sub_group 
         in itertools.groupby(group, lambda x: x[1])])
 for name, group in grouped]
Out:
[('dog', [('1', 2), ('0', 2)]), ('cat', [('0', 2), ('1', 1)])]

通过 itertools 完成示例:

import itertools 


def emit_group(name, tag_name, group):
    tag_sum = sum([int(i[-1]) for i in group])
    print(f"{name}\t{tag_name}\t{tag_sum}")  # emit here
    return (name, tag_name, tag_sum)  # return the same data


splitted_lines = map(lambda x: x.split('\t'), data.splitlines())
grouped = itertools.groupby(splitted_lines, lambda x: x[0])


emitted = [[emit_group(name, tag_name, sub_group) 
            for tag_name, sub_group 
            in itertools.groupby(group, lambda x: x[1])]
            for name, group in  grouped]
Out:
dog 1   2
dog 0   2
cat 0   2
cat 1   1

-emitted包含具有相同数据的元组列表。由于它是惰性方法,因此可以与流完美配合;如果您有兴趣,这里是很好的 iterools 教程。


推荐阅读