首页 > 解决方案 > 如何使用字典来加快查找和计数的任务?

问题描述

考虑以下代码段:

data = {"col1":["aaa","bbb","ccc","aaa","ddd","bbb"],
       "col2":["fff","aaa","ggg","eee","ccc","ttt"]}
df = pd.DataFrame(data,columns=["col1","col2"]) # my actual dataframe has
                                                # 20,00,000 such rows

list_a = ["ccc","aaa","mmm","nnn","ccc"]
list_b = ["ggg","fff","eee","ooo","ddd"]

# After doing a combination of 2 elements between the 2 lists in both orders,
# we get a list that resembles something like this:
new_list = ["ccc-ggg", "ggg-ccc", "aaa-fff", "fff-aaa", ..."ccc-fff", "fff-ccc", ...]

给定一个巨大的数据框和 2 个列表,我想计算 new_list 中与数据框相同的元素的数量。在上面的伪示例中,结果将是 3,因为:“aaa-fff”、“ccc-ggg”和“ddd-ccc”在数据帧的同一行中。

现在,我正在使用线性搜索算法,但它非常慢,因为我必须扫描整个数据帧。

df['col3']=df['col1']+"-"+df['col2']
for a in list_a:
    c1 = 0
    for b in list_b:
        str1=a+"-"+b
        str2=b+"-"+a
        str1=a+"-"+b
        c2 = (df['col3'].str.contains(str1).sum())+(df['col3'].str.contains(str2).sum())
    c1+=c2
return c1

有人可以帮我实现一个更快的算法,最好使用字典数据结构吗?

注意:我必须遍历另一个数据框的 7,000 行并动态创建 2 个列表,并获取每行的聚合计数。

标签: pythonpandasdataframedictionary

解决方案


这是另一种方式。首先,我使用了您对 df(有 2 列)、list_a 和 list_b 的定义。

# combine two columns in the data frame
df['col3'] = df['col1'] + '-' + df['col2']

# create set with list_a and list_b pairs
s = ({ f'{a}-{b}' for a, b in zip(list_a, list_b)} | 
     { f'{b}-{a}' for a, b in zip(list_a, list_b)})

# find intersection
result = set(df['col3']) & s
print(len(result), '\n', result)

3 
 {'ddd-ccc', 'ccc-ggg', 'aaa-fff'}

更新处理重复值。

# build list (not set) from list_a and list_b
idx =  ([ f'{a}-{b}' for a, b in zip(list_a, list_b) ] +
        [ f'{b}-{a}' for a, b in zip(list_a, list_b) ])

# create `col3`, and do `value_counts()` to preserve info about duplicates
df['col3'] = df['col1'] + '-' + df['col2']
tmp = df['col3'].value_counts()

# use idx to sub-select from to value counts:
tmp[ tmp.index.isin(idx) ]

# results:
ddd-ccc    1
aaa-fff    1
ccc-ggg    1
Name: col3, dtype: int64

推荐阅读