首页 > 解决方案 > 在python中计算列表中的字符串,然后过滤和匹配

问题描述

我有一个单词列表,并且使用 python3 我计算了每个单词组合之间的字母差异(使用这个站点的一个聪明的 diff_summing 算法):

import itertools

def diff_letters(a,b):
    return sum ( a[i] != b[i] for i in range(len(a)) )

w = ['AAHS','AALS','DAHS','XYZA']

for x,y in itertools.combinations(w,2):
    if diff_letters(x,y) == 1:
        print(x,y)

这打印:

AAHS AALS
AAHS DAHS

我的问题:我如何计算记录字符串“DAHS”和“AALS”只有一个合作伙伴,而“AAHS”有两个合作伙伴?我将过滤方向组合,每个组合target_string都有一个 near_matching_word,所以我的最终数据(作为 JSON)看起来像这样:

[
 {
   "target_word": "DAHS",
   "near_matching_word": "AAHS"
 },
 {
   "target_word": "AALS",
   "near_matching_word": "AAHS"
 }
]

(注意 AAHS 没有显示为target_word

我有一个版本使用functools.reduce

import itertools
import functools
import operator

def diff_letters(a,b):
    return sum ( a[i] != b[i] for i in range(len(a)) )

w = ['AAHS','AALS','DAHS','XYZA']

pairs = []
for x,y in itertools.combinations(w,2):
    if diff_letters(x,y) == 1:
        #print(x,y)
        pairs.append((x,y))

full_list = functools.reduce(operator.add, pairs)
for x in full_list:
    if full_list.count(x) == 1:
        print (x)

哪个打印

AALS
DAHS

但后来我必须回到我的大名单pairs才能找到near_matching_word. 当然,在我的最终版本中,列表pairs会更大,并且target_word可以是元组 (x,y) 中的第一项或第二项。

标签: pythonarrayspython-3.xalgorithmstring-matching

解决方案


即使找到不止一个,其他答案也会保留所有对。由于不需要它们,这似乎浪费了内存。这个答案只为每个字符串保留最多一对。

import collections
import itertools

def diff_letters(a,b):
    return sum ( a[i] != b[i] for i in range(len(a)) )

w = ['AAHS','AALS','DAHS','XYZA']

# Marker for pairs that have not been found yet.
NOT_FOUND = object()

# Collection of found pairs x => y. Each item is in one of three states:
# - y is NOT_FOUND if x has not been seen yet
# - y is a string if it is the only accepted pair for x
# - y is None if there is more than one accepted pair for x
pairs = collections.defaultdict(lambda: NOT_FOUND)

for x,y in itertools.combinations(w,2):
    if diff_letters(x,y) == 1:
        if pairs[x] is NOT_FOUND:
            pairs[x] = y
        else:
            pairs[x] = None
        if pairs[y] is NOT_FOUND:
            pairs[y] = x
        else:
            pairs[y] = None

# Remove None's and change into normal dict.
pairs = {x: y for x, y in pairs.items() if y}

for x, y in pairs.items():
    print("Target = {}, Only near matching word = {}".format(x, y))

输出:

Target = AALS, Only near matching word = AAHS
Target = DAHS, Only near matching word = AAHS

推荐阅读