python - 如何使用字典来加快查找和计数的任务?
问题描述
考虑以下代码段:
data = {"col1":["aaa","bbb","ccc","aaa","ddd","bbb"],
"col2":["fff","aaa","ggg","eee","ccc","ttt"]}
df = pd.DataFrame(data,columns=["col1","col2"]) # my actual dataframe has
# 20,00,000 such rows
list_a = ["ccc","aaa","mmm","nnn","ccc"]
list_b = ["ggg","fff","eee","ooo","ddd"]
# After doing a combination of 2 elements between the 2 lists in both orders,
# we get a list that resembles something like this:
new_list = ["ccc-ggg", "ggg-ccc", "aaa-fff", "fff-aaa", ..."ccc-fff", "fff-ccc", ...]
给定一个巨大的数据框和 2 个列表,我想计算 new_list 中与数据框相同的元素的数量。在上面的伪示例中,结果将是 3,因为:“aaa-fff”、“ccc-ggg”和“ddd-ccc”在数据帧的同一行中。
现在,我正在使用线性搜索算法,但它非常慢,因为我必须扫描整个数据帧。
df['col3']=df['col1']+"-"+df['col2']
for a in list_a:
c1 = 0
for b in list_b:
str1=a+"-"+b
str2=b+"-"+a
str1=a+"-"+b
c2 = (df['col3'].str.contains(str1).sum())+(df['col3'].str.contains(str2).sum())
c1+=c2
return c1
有人可以帮我实现一个更快的算法,最好使用字典数据结构吗?
注意:我必须遍历另一个数据框的 7,000 行并动态创建 2 个列表,并获取每行的聚合计数。
解决方案
这是另一种方式。首先,我使用了您对 df(有 2 列)、list_a 和 list_b 的定义。
# combine two columns in the data frame
df['col3'] = df['col1'] + '-' + df['col2']
# create set with list_a and list_b pairs
s = ({ f'{a}-{b}' for a, b in zip(list_a, list_b)} |
{ f'{b}-{a}' for a, b in zip(list_a, list_b)})
# find intersection
result = set(df['col3']) & s
print(len(result), '\n', result)
3
{'ddd-ccc', 'ccc-ggg', 'aaa-fff'}
更新处理重复值。
# build list (not set) from list_a and list_b
idx = ([ f'{a}-{b}' for a, b in zip(list_a, list_b) ] +
[ f'{b}-{a}' for a, b in zip(list_a, list_b) ])
# create `col3`, and do `value_counts()` to preserve info about duplicates
df['col3'] = df['col1'] + '-' + df['col2']
tmp = df['col3'].value_counts()
# use idx to sub-select from to value counts:
tmp[ tmp.index.isin(idx) ]
# results:
ddd-ccc 1
aaa-fff 1
ccc-ggg 1
Name: col3, dtype: int64
推荐阅读
- python-3.x - 连接由 pd.read_html 制成的多个 df
- angular - Angular 8 中的 FontAwesome,最好的导入方式?
- c# - EF Core Savechanges 不适用于删除
- yii2 - data not saving in database with fileupload in yii2
- azure - Adding a connection the Virtual Network Gateway
- awk - 如何替换多列上的字符串
- sharepoint - 如何在不使用 Sharepoint 身份验证(用户名密码)的情况下通过 csom 调用搜索
- networkx - pygraphviz/networkx external label (xlabel)
- in-app-purchase - Does Play Billing Library support UPI with IAP?
- c# - SignalR send message from client to server error