duplicates - 有没有办法减少此代码删除部分重复项的运行时间?
问题描述
所以这是在数据的同一列中删除部分重复的代码,但是,我猜由于将每一行与其他行匹配的过程,代码需要大量时间才能在甚至 2000 行的数据集上运行。有什么办法可以减少运行时间?
这是代码-
from fuzzywuzzy import fuzz,process
rows = ["I have your Body Wash and I wonder if it contains animal ingredients. Also, which animal ingredients? I prefer not to use product with animal ingredients.","This also doesn't have the ADA on there. Is this a fake toothpaste an imitation of yours?","I have your Body Wash and I wonder if it contains animal ingredients. I prefer not to use product with animal ingredients.","I didn't see the ADA stamp on this box. I just want to make sure it was still safe to use?","Hello, I was just wondering if the new toothpaste is ADA approved? It doesn’t say on the packaging","Hello, I was just wondering if the new toothpaste is ADA approved? It doesn’t say on the box."]
clean = []
threshold = 80 # this is arbitrary
for row in rows:
# score each sentence against each other sentence
# [('string', score),..]
scores = process.extract(row, rows, scorer=fuzz.token_set_ratio)
# basic idea is if there is a close second match we want to evaluate
# and keep the longer of the two
if scores[1][1] > threshold:
clean.append(max([x[0] for x in scores[:2]],key=len))
else:
clean.append(scores[0][0])
# remove dupes
clean = set(clean)
解决方案
推荐阅读
- ruby-on-rails - When I making a dump in controller its size 0
- html - Adsense 匹配的内容 - 阻止列表不起作用
- spring-boot - PagingAndSortingRepository - 自定义可分页响应结构
- c# - Nhibernate:无法识别的 Guid 格式 - 无法执行查询
- javascript - JS 上的快速排序
- virtual-machine - 当我们创建虚拟机快照时会发生什么?
- android - 网站网址的深度链接不要求意图选择器中的其他应用程序
- rust - 解析rust proc_macro中括号的内容
- c++ - 如何从用户定义的函数返回 3D 数组?
- postgresql - Postgresql TRIGGER 显示不正确的新值