python-3.x - 加入两个字典并按不同键对它们进行分组并总结值的最有效方法
问题描述
目前我正在努力解决需要我加入字典的案例,然后按不同的键对它们进行分组,最后总结价值。老实说,如果不对我的字典进行几次迭代,我就找不到有效的方法。我将快速解释我的步骤是怎样的。
- 我正在查询给定集合的 mongodb,并以格式返回字典列表
# Each document has unique _id
results = [{"_id": "111111111111111111111111", "countryCode": "SG"},
{"_id": "111111111111111111111112", "countryCode": "SG"},
{"_id": "111111111111111111111113", "countryCode": "CN"},
{"_id": "111111111111111111111114", "countryCode": "SG"},
{"_id": "111111111111111111111115", "countryCode": "SG"},
{"_id": "111111111111111111111116", "countryCode": "DE"}]
我正在使用分页来执行此操作,因此我以 1000 个文档大小的懒惰方式从查询中返回。(文件清单)
- 对于
_id
上面列表中的每个,我使用聚合查询查询另一个存储,我基本上在其中进行分组并计算给定的一些统计信息_id
。所以这是我的查询示例:
def query_events(collection_id: str, ids: list):
q = [
{"$match": {"recordId": {"$in": ids}}},
{"$unwind": {...}},
{"$project": {...}},
{"$group": {...}},
{"$group": {...}}
]
coll = db.get_collection(collection_id)
return list(coll.aggregate(q))
- 使用此查询的结果,我进行了一些处理并将其转换为一个大字典。格式如下:
changes = {
"111111111111111111111111": {"ADDED": 15, "MODIFIED" : 12, "REMOVED" : 3},
"111111111111111111111112": {"ADDED": 12, "REMOVED" : 12},
"111111111111111111111113": {"ADDED": 3, "MODIFIED" : 3, "REMOVED" : 55},
"111111111111111111111114": {},
"111111111111111111111115": {"ADDED": 10, "MODIFIED" : 15, "REMOVED" : 43},
"111111111111111111111116": {"ADDED": 44, "MODIFIED" : 52, "REMOVED" : 11},
}
现在我的目标是将这本字典(第 3 点)与(第 1 点)中的字典列表合并,以便在我的字典(第 3 点)中获得有关国家/地区的信息。我试过这样的事情:
for item in results:
_id = item['_id']
if _id in changes:
del item['_id']
changes[_id].update(item)
之后我得到这样的东西:
changes = {
"111111111111111111111111": {"ADDED": 15, "MODIFIED" : 12, "REMOVED" : 3, "countryCode": "SG"},
"111111111111111111111112": {"ADDED": 12, "REMOVED" : 12, "countryCode": "SG"},
"111111111111111111111113": {"ADDED": 3, "MODIFIED" : 3, "REMOVED" : 55, "countryCode": "CN"},
"111111111111111111111114": {"countryCode": "SG"},
"111111111111111111111115": {"ADDED": 10, "MODIFIED" : 15, "REMOVED" : 43, "countryCode": "SG"},
"111111111111111111111116": {"ADDED": 44, "MODIFIED" : 52, "REMOVED" : 11, "countryCode": "DE"},
}
但是现在我需要收到的最终结果是一个字典或字典列表,其中我的键是 countryCode,值是字典{'ADDED': <sum>, 'MODIFIED" : <sum>, "REMOVED" : <sum>}
像这儿:
# as a dict
{
"DE" : {"ADDED" : 44, "MODIFIED" : 52 , "REMOVED" :11},
"SG" : {"ADDED" : 37, "MODIFIED" : 27 , "REMOVED" :58},
"CN" : {"ADDED": 3, 'MODIFIED" : 3, "REMOVED" : 55},
}
现在我唯一的想法是以某种方式迭代我的字典并尝试添加 country do new 字典作为键(如果它不存在)并{"ADDED" : 44, "MODIFIED" : 52 , "REMOVED" :11}
作为该键的值,但如果 country 已经存在于 dict 中,则迭代记录值和每个类别的(ADDED, REMOVED, MODIFIED)
总和值。但我认为它可能没有效率。我必须从拥有超过 1000 万条记录的集合中获取数据,所以如果我遍历这些字典可能会非常慢。
我的方法(可能效率不高):
results = [{"_id": "111111111111111111111111", "countryCode": "SG"},
{"_id": "111111111111111111111112", "countryCode": "SG"},
{"_id": "111111111111111111111113", "countryCode": "CN"},
{"_id": "111111111111111111111114", "countryCode": "SG"},
{"_id": "111111111111111111111115", "countryCode": "SG"},
{"_id": "111111111111111111111116", "countryCode": "DE"}]
changes = {
"111111111111111111111111": {"ADDED": 15, "MODIFIED" : 12, "REMOVED" : 3},
"111111111111111111111112": {"ADDED": 12, "REMOVED" : 12},
"111111111111111111111113": {"ADDED": 3, "MODIFIED" : 3, "REMOVED" : 55},
"111111111111111111111114": {},
"111111111111111111111115": {"ADDED": 10, "MODIFIED" : 15, "REMOVED" : 43},
"111111111111111111111116": {"ADDED": 44, "MODIFIED" : 52, "REMOVED" : 11},
}
# Update dict with countryCode
for item in results:
_id = item.pop('_id')
if _id in changes:
changes[_id].update(item)
# Now changes looks like:
changes = {
"111111111111111111111111": {"ADDED": 15, "MODIFIED" : 12, "REMOVED" : 3, "countryCode": "SG"},
"111111111111111111111112": {"ADDED": 12, "REMOVED" : 12, "countryCode": "SG"},
"111111111111111111111113": {"ADDED": 3, "MODIFIED" : 3, "REMOVED" : 55, "countryCode": "CN"},
"111111111111111111111114": {"countryCode": "SG"},
"111111111111111111111115": {"ADDED": 10, "MODIFIED" : 15, "REMOVED" : 43, "countryCode": "SG"},
"111111111111111111111116": {"ADDED": 44, "MODIFIED" : 52, "REMOVED" : 11, "countryCode": "DE"},
}
# Last step group data by country and sum-up statistics
final_results = {}
for k, v in changes.items():
country = v.pop("countryCode", "UNKNOWN")
if country not in final_results:
final_results[country] = v
else:
for ck, cv in v.items():
if ck not in final_results[country]:
final_results[country][ck] = cv
else:
final_results[country][ck] += cv
# After this for loop I achieved my results:
final_results = {
"DE" : {"ADDED" : 44, "MODIFIED" : 52 , "REMOVED" :11},
"SG" : {"ADDED" : 37, "MODIFIED" : 27 , "REMOVED" :58},
"CN" : {"ADDED": 3, 'MODIFIED" : 3, "REMOVED" : 55},
}
即使我取得了成绩,也许还有更好的方法可以做到这一点。
你知道我怎样才能实现我的目标——在国家层面为每个类别总结价值(ADDED, REMOVED, MODIFIED)
吗?
谢谢!
解决方案
您可以使用itertools.groupby
+ functools.reduce
+ collections.Counter
+ operator.add
:
- 导入必要的库:
from functools import reduce
from collections import Counter
import operator as op
import itertools as it
- 我们将改变
results
列表的结构,以_id
results = [{r['_id']: {'countryCode': value} for value in r.values()} for r in results]
- 然后,我们将
changes
使用列表更新字典results
。
for index, key in enumerate(changes.keys()):
changes[key].update(results[index][key])
- 最后,我们将使用
itertools.groupby
以根据countryCode
键对数据进行分组。aggregations
是 的列表Counters
,例如:[Counter({'ADDED': 3, 'MODIFIED': 3, 'REMOVED': 1, 'countryCode': 'DE'}), Counter(...)]
。我们将使用reduce
来对上面列表中的每个 Counter 对象求和。
output = dict()
for g, iter in it.groupby(changes.values(), lambda d: d['countryCode']):
aggregations = [Counter(i) for i in iter]
for agg in aggregations:
del agg['countryCode']
aggregations = reduce(op.add, aggregations)
output[g] = aggregations if g not in output.keys() else reduce(op.add, [output[g], aggregations])
# If you don't make this, then you'll get {'CN': Counter({...}), ...}
output = {key: dict(value) for key, value in output.items()}
输出:
{'CN': {'ADDED': 3, 'MODIFIED': 3, 'REMOVED': 55},
'DE': {'ADDED': 44, 'MODIFIED': 52, 'REMOVED': 11},
'SG': {'ADDED': 37, 'MODIFIED': 27, 'REMOVED': 58}}
推荐阅读
- java - 通过 curl 命令在 Keyclock 10.0.1 上创建新用户
- javascript - 使用 Javascript 检查 url 中的参数以更改按钮颜色
- swift - 如何创建一个空范围的foreach循环
- python - 有没有办法用 Python 类型提示为函数参数定义几种类型?
- excel - 如何使用 Excel VBA 循环将文本框内容链接到特定的 Excel 工作表
- excel - 评估从 Cell Validation 中获取的 INDIRECT 函数的使用
- android - Error Including *.so library in Android Studio NDK Project
- python - 对象在时区支持处于活动状态时收到了一个简单的日期时间 (...)
- ios - password issue during build ios app with Phonegap
- android - A failure occurred while executing org.jetbrains.kotlin.gradle.internal.KaptExecution