首页 > 解决方案 > 加入两个字典并按不同键对它们进行分组并总结值的最有效方法

问题描述

目前我正在努力解决需要我加入字典的案例,然后按不同的键对它们进行分组,最后总结价值。老实说,如果不对我的字典进行几次迭代,我就找不到有效的方法。我将快速解释我的步骤是怎样的。

  1. 我正在查询给定集合的 mongodb,并以格式返回字典列表
# Each document has unique _id

results = [{"_id": "111111111111111111111111", "countryCode": "SG"}, 
{"_id": "111111111111111111111112", "countryCode": "SG"}, 
{"_id": "111111111111111111111113", "countryCode": "CN"}, 
{"_id": "111111111111111111111114", "countryCode": "SG"}, 
{"_id": "111111111111111111111115", "countryCode": "SG"}, 
{"_id": "111111111111111111111116", "countryCode": "DE"}]

我正在使用分页来执行此操作,因此我以 1000 个文档大小的懒惰方式从查询中返回。(文件清单)

  1. 对于_id上面列表中的每个,我使用聚合查询查询另一个存储,我基本上在其中进行分组并计算给定的一些统计信息_id。所以这是我的查询示例:

def query_events(collection_id: str, ids: list):
    q = [
        {"$match": {"recordId": {"$in": ids}}},
        {"$unwind": {...}},
        {"$project": {...}},
        {"$group": {...}},
        {"$group": {...}}
    ]
    coll = db.get_collection(collection_id)
    return list(coll.aggregate(q))
  1. 使用此查询的结果,我进行了一些处理并将其转换为一个大字典。格式如下:

changes = {
"111111111111111111111111": {"ADDED": 15, "MODIFIED" : 12, "REMOVED" : 3}, 
"111111111111111111111112": {"ADDED": 12, "REMOVED" : 12}, 
"111111111111111111111113": {"ADDED": 3, "MODIFIED" : 3, "REMOVED" : 55}, 
"111111111111111111111114": {}, 
"111111111111111111111115": {"ADDED": 10, "MODIFIED" : 15, "REMOVED" : 43}, 
"111111111111111111111116": {"ADDED": 44, "MODIFIED" : 52, "REMOVED" : 11}, 
}

现在我的目标是将这本字典(第 3 点)与(第 1 点)中的字典列表合并,以便在我的字典(第 3 点)中获得有关国家/地区的信息。我试过这样的事情:

for item in results:
    _id = item['_id']
    if _id in changes:
        del item['_id']
        changes[_id].update(item)

之后我得到这样的东西:


changes = {
"111111111111111111111111": {"ADDED": 15, "MODIFIED" : 12, "REMOVED" : 3, "countryCode": "SG"}, 
"111111111111111111111112": {"ADDED": 12, "REMOVED" : 12, "countryCode": "SG"}, 
"111111111111111111111113": {"ADDED": 3, "MODIFIED" : 3, "REMOVED" : 55, "countryCode": "CN"}, 
"111111111111111111111114": {"countryCode": "SG"}, 
"111111111111111111111115": {"ADDED": 10, "MODIFIED" : 15, "REMOVED" : 43, "countryCode": "SG"}, 
"111111111111111111111116": {"ADDED": 44, "MODIFIED" : 52, "REMOVED" : 11, "countryCode": "DE"}, 
}


但是现在我需要收到的最终结果是一个字典或字典列表,其中我的键是 countryCode,值是字典{'ADDED': <sum>, 'MODIFIED" : <sum>, "REMOVED" : <sum>}

像这儿:

# as a dict

{
"DE" : {"ADDED" : 44, "MODIFIED" : 52 , "REMOVED" :11},
"SG" : {"ADDED" : 37, "MODIFIED" : 27 , "REMOVED" :58},
"CN" : {"ADDED": 3, 'MODIFIED" : 3, "REMOVED" : 55},
}


现在我唯一的想法是以某种方式迭代我的字典并尝试添加 country do new 字典作为键(如果它不存在)并{"ADDED" : 44, "MODIFIED" : 52 , "REMOVED" :11}作为该键的值,但如果 country 已经存在于 dict 中,则迭代记录值和每个类别的(ADDED, REMOVED, MODIFIED)总和值。但我认为它可能没有效率。我必须从拥有超过 1000 万条记录的集合中获取数据,所以如果我遍历这些字典可能会非常慢。

我的方法(可能效率不高):


results = [{"_id": "111111111111111111111111", "countryCode": "SG"}, 
{"_id": "111111111111111111111112", "countryCode": "SG"}, 
{"_id": "111111111111111111111113", "countryCode": "CN"}, 
{"_id": "111111111111111111111114", "countryCode": "SG"}, 
{"_id": "111111111111111111111115", "countryCode": "SG"}, 
{"_id": "111111111111111111111116", "countryCode": "DE"}]

changes = {
"111111111111111111111111": {"ADDED": 15, "MODIFIED" : 12, "REMOVED" : 3}, 
"111111111111111111111112": {"ADDED": 12, "REMOVED" : 12}, 
"111111111111111111111113": {"ADDED": 3, "MODIFIED" : 3, "REMOVED" : 55}, 
"111111111111111111111114": {}, 
"111111111111111111111115": {"ADDED": 10, "MODIFIED" : 15, "REMOVED" : 43}, 
"111111111111111111111116": {"ADDED": 44, "MODIFIED" : 52, "REMOVED" : 11}, 
}

# Update dict with countryCode

for item in results:
    _id = item.pop('_id')
    if _id in changes:
        changes[_id].update(item)


# Now changes looks like:

changes = {
"111111111111111111111111": {"ADDED": 15, "MODIFIED" : 12, "REMOVED" : 3, "countryCode": "SG"}, 
"111111111111111111111112": {"ADDED": 12, "REMOVED" : 12, "countryCode": "SG"}, 
"111111111111111111111113": {"ADDED": 3, "MODIFIED" : 3, "REMOVED" : 55, "countryCode": "CN"}, 
"111111111111111111111114": {"countryCode": "SG"}, 
"111111111111111111111115": {"ADDED": 10, "MODIFIED" : 15, "REMOVED" : 43, "countryCode": "SG"}, 
"111111111111111111111116": {"ADDED": 44, "MODIFIED" : 52, "REMOVED" : 11, "countryCode": "DE"}, 
}


# Last step group data by country and sum-up statistics

final_results = {}

for k, v in changes.items():
    country = v.pop("countryCode", "UNKNOWN")
    if country not in final_results:
        final_results[country] = v
    else:
        for ck, cv in v.items():
            if ck not in final_results[country]:
                final_results[country][ck] = cv
            else:
                final_results[country][ck] += cv


# After this for loop I achieved my results:

final_results = {
"DE" : {"ADDED" : 44, "MODIFIED" : 52 , "REMOVED" :11},
"SG" : {"ADDED" : 37, "MODIFIED" : 27 , "REMOVED" :58},
"CN" : {"ADDED": 3, 'MODIFIED" : 3, "REMOVED" : 55},
}


即使我取得了成绩,也许还有更好的方法可以做到这一点。

你知道我怎样才能实现我的目标——在国家层面为每个类别总结价值(ADDED, REMOVED, MODIFIED)吗?

谢谢!

标签: python-3.xmongodblistdictionaryfor-loop

解决方案


您可以使用itertools.groupby+ functools.reduce+ collections.Counter+ operator.add

  1. 导入必要的库:
from functools import reduce
from collections import Counter
import operator as op
import itertools as it
  1. 我们将改变results列表的结构,以_id
results = [{r['_id']: {'countryCode': value} for value in r.values()} for r in results]
  1. 然后,我们将changes使用列表更新字典results
for index, key in enumerate(changes.keys()):
    changes[key].update(results[index][key])
  1. 最后,我们将使用itertools.groupby以根据countryCode键对数据进行分组。aggregations是 的列表Counters,例如:[Counter({'ADDED': 3, 'MODIFIED': 3, 'REMOVED': 1, 'countryCode': 'DE'}), Counter(...)]。我们将使用reduce来对上面列表中的每个 Counter 对象求和。
output = dict()
for g, iter in it.groupby(changes.values(), lambda d: d['countryCode']):
  aggregations = [Counter(i) for i in iter]
  for agg in aggregations:
    del agg['countryCode']
  aggregations = reduce(op.add, aggregations)
  output[g] = aggregations if g not in output.keys() else reduce(op.add, [output[g], aggregations])

# If you don't make this, then you'll get {'CN': Counter({...}), ...}
output = {key: dict(value) for key, value in output.items()}

输出:

{'CN': {'ADDED': 3, 'MODIFIED': 3, 'REMOVED': 55},
 'DE': {'ADDED': 44, 'MODIFIED': 52, 'REMOVED': 11},
 'SG': {'ADDED': 37, 'MODIFIED': 27, 'REMOVED': 58}}

推荐阅读