首页 > 解决方案 > pyspark中的地图地图

问题描述

我有一个看起来像这样的 RDD:

print rdd.collect():
[
    ('id2', u'lion'),
    ('id5', u'dolphin'),
    ('id2', u'tiger'),
    ('id2', u'lion'),
    ('id3', u'dolphin'),
    ('id3', u'monkey'),
]

是否可以在 pyspark 中创建一张地图,以按 id 计算每只动物的出现次数?例如:

id2: {lion: 2, tiger: 1}, id3: {dolphin:1, monkey: 1}, id5: {dolphin: 1}

标签: pythonapache-sparkpyspark

解决方案


With Python, you can use collections.Counter to count the number of occurence of each animal. But you need a counter for each item ID.

You can create a dictionary of counters like this:

import collections

items = [
    ('id2', u'lion'),
    ('id5', u'dolphin'),
    ('id2', u'tiger'),
    ('id2', u'lion'),
    ('id3', u'dolphin'),
    ('id3', u'monkey'),
]

counters = collections.defaultdict(collections.Counter)
for item_id, animal in items:
    counters[item_id][animal] += 1
print(counters)

Output:

defaultdict(<class 'collections.Counter'>,
            {'id2': Counter({'lion': 2, 'tiger': 1}),
             'id3': Counter({'dolphin': 1, 'monkey': 1}),
             'id5': Counter({'dolphin': 1})})

The result is a dictionary of counters.


推荐阅读