首页 > 解决方案 > 有没有办法减少此代码删除部分重复项的运行时间?

问题描述

所以这是在数据的同一列中删除部分重复的代码,但是,我猜由于将每一行与其他行匹配的过程,代码需要大量时间才能在甚至 2000 行的数据集上运行。有什么办法可以减少运行时间?

这是代码-

from fuzzywuzzy import fuzz,process

rows = ["I have your Body Wash and I wonder if it contains animal ingredients. Also, which animal ingredients? I prefer not to use product with animal ingredients.","This also doesn't have the ADA on there. Is this a fake toothpaste an imitation of yours?","I have your Body Wash and I wonder if it contains animal ingredients. I prefer not to use product with animal ingredients.","I didn't see the ADA stamp on this box. I just want to make sure it was still safe to use?","Hello, I was just wondering if the new toothpaste is ADA approved? It doesn’t say on the packaging","Hello, I was just wondering if the new toothpaste is ADA approved? It doesn’t say on the box."]

clean = []
threshold = 80 # this is arbitrary
for row in rows:
    # score each sentence against each other sentence
    # [('string', score),..]
    scores = process.extract(row, rows, scorer=fuzz.token_set_ratio)
    # basic idea is if there is a close second match we want to evaluate 
    # and keep the longer of the two
    if scores[1][1] > threshold:
        clean.append(max([x[0] for x in scores[:2]],key=len))
    else:
        clean.append(scores[0][0])
# remove dupes
clean = set(clean)

标签: duplicatesruntimedata-cleaningdata-processingfuzzywuzzy

解决方案


推荐阅读