python - pyspark中的地图地图
问题描述
我有一个看起来像这样的 RDD:
print rdd.collect():
[
('id2', u'lion'),
('id5', u'dolphin'),
('id2', u'tiger'),
('id2', u'lion'),
('id3', u'dolphin'),
('id3', u'monkey'),
]
是否可以在 pyspark 中创建一张地图,以按 id 计算每只动物的出现次数?例如:
id2: {lion: 2, tiger: 1}, id3: {dolphin:1, monkey: 1}, id5: {dolphin: 1}
解决方案
With Python, you can use collections.Counter
to count the number of occurence of each animal. But you need a counter for each item ID.
You can create a dictionary of counters like this:
import collections
items = [
('id2', u'lion'),
('id5', u'dolphin'),
('id2', u'tiger'),
('id2', u'lion'),
('id3', u'dolphin'),
('id3', u'monkey'),
]
counters = collections.defaultdict(collections.Counter)
for item_id, animal in items:
counters[item_id][animal] += 1
print(counters)
Output:
defaultdict(<class 'collections.Counter'>,
{'id2': Counter({'lion': 2, 'tiger': 1}),
'id3': Counter({'dolphin': 1, 'monkey': 1}),
'id5': Counter({'dolphin': 1})})
The result is a dictionary of counters.
推荐阅读
- python-3.8 - python 3.8 and pip gives this error[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate is not yet valid (_ssl.c:1123)')))
- grep - 如何在 Linux 中用 grep 查找带有星号的行?
- python - 按项目布尔过滤字典 - API Python
- python - 如何在 python 字典中迭代公式并将结果保存在 pandas dataFrame 中?
- c# - Why does pattern matching not compile with JToken
- webpack - 如何使用 Webpack 删除开发代码?
- java - Java:如何分离子进程或创建分离的进程
- vue.js - Vue.js - 如何在 vue v-for 循环中按值 2 递增索引?
- vba - 查询短文本字段中的文本返回“类型不匹配”
- python - 分批拆分张量流数据集