首页 > 解决方案 > 列出元素以考虑每个最近的 3 个邻居

问题描述

字符串列表,其中一些实际上具有相同的内容(如概述),但差异很小。

在此处输入图像描述

我想找出类似的字符串。一种可能的方法是使用来自( difflib )的SequenceMatcher的相似度。

from difflib import SequenceMatcher
import itertools

mylist = [

"I say,",
"It's in the reach of my arms",
"The span of my hips,",
"The stride of my step,",
"The curl of my lips.",
"I'm a woman",
"Phenomenally.",
"Phenomenal woman,",
"That's me.",
"I say.",
"It's the fire in my eyes,",
"And the flash of my teeth,",
"The swing in my waist,",
"And the joy in my feet.",
"I'm a woman.",
"Phenomenally!",
"Phenomenal women,",
"That's us.",
]

for a, b in itertools.combinations(mylist, 2):
    score = SequenceMatcher(None, a, b).ratio()
    if score >= 0.90:
        print (a + " TO " + b + " : " + str(SequenceMatcher(None, a, b).ratio()))

输出:

I'm a woman TO I'm a woman. : 0.9565217391304348
Phenomenally. TO Phenomenally! : 0.9230769230769231
Phenomenal woman, TO Phenomenal women, : 0.9411764705882353

当列表变得很长时,生成输出需要很长时间,所以我正在考虑对列表进行排序,并且只测量每个字符串/元素最近的 3 个邻居的相似度。

例如,对于排序列表中的元素 #1,它仅根据 #2、#3、#4 衡量自身。对于排序列表中的元素 #10,它仅根据 [#7,#8,#9] 和 [#11,#12,#13] 衡量自身。

所以我尝试了:

mylist.sort(reverse=False)

for num, content in enumerate(mylist):
    for a in mylist[num+1:num+4]:
        score = SequenceMatcher(None, a, content).ratio()
        if score >= 0.90:
            print (a + " TO " + content + " : " + score)


for num, content in enumerate(mylist):
    if num >= 4:
        for a in mylist[num-1:num-4]:
            score = SequenceMatcher(None, a, content).ratio()
            if score >= 0.90:
                print (a + " TO " + content + " : " + str(score))

它与长列表一起工作要快得多。但我想知道,有没有更好的方法?谢谢你。

标签: pythonlist

解决方案


在我看来,使用levenshtein distance可能会更好。它的计算复杂度为O(n^(2 - ε)). 根据维基:

两个单词之间的 Levenshtein 距离是将一个单词更改为另一个单词所需的最小单字符编辑(插入、删除或替换)次数。

您可以检查此实现以供参考。链接摘录:

import numpy as np

def levenshtein(seq1, seq2):
    size_x = len(seq1) + 1
    size_y = len(seq2) + 1
    matrix = np.zeros ((size_x, size_y))
    for x in xrange(size_x):
        matrix [x, 0] = x
    for y in xrange(size_y):
        matrix [0, y] = y

    for x in xrange(1, size_x):
        for y in xrange(1, size_y):
            if seq1[x-1] == seq2[y-1]:
                matrix [x,y] = min(
                    matrix[x-1, y] + 1,
                    matrix[x-1, y-1],
                    matrix[x, y-1] + 1
                )
            else:
                matrix [x,y] = min(
                    matrix[x-1,y] + 1,
                    matrix[x-1,y-1] + 1,
                    matrix[x,y-1] + 1
                )
    print (matrix)
    return (matrix[size_x - 1, size_y - 1])

或者如果你想使用一个库,你可以自由使用python-Levenshtein


推荐阅读