首页 > 解决方案 > 一次遍历两个列表一个元素

问题描述

我有两个相同的列表。我想从列表 1 中取出第一个元素并比较列表 2 中的每个元素,一旦完成,我想从列表 1 中取出第二个元素并重复,直到每个元素都从两个列表中相互比较。

我已经创建了一个 Levenshtein 距离模型,并且能够通过我的第二个列表成功循环 1 个字符串(我硬编码)。但是,我需要使它更实用,并将目标字符串作为一个列表,并在完成前一个元素与第二个列表的比较后切换到下一个元素。然后我只希望它返回大于特定阈值 ex 的值。80.00

my_list = address['Street'].tolist()
my_list

# Import numpy to perform the matrix algebra necessary to calculate the fuzzy match
import numpy as np
# Define a function that will become the fuzzy match
# I decided to use Levenshtein Distance due to the formulas ability to handle string comparisons of two unique lengths
def string_match(seq1, seq2, ratio_calc = False):
    """ levenshtein_ratio_and_distance:
        Calculates levenshtein distance between two strings.
        If ratio_calc = True, the function computes the
        levenshtein distance ratio of similarity between two strings
        For all i and j, distance[i,j] will contain the Levenshtein
        distance between the first i characters of seq1 and the
        first j characters of seq2
    """
    # Initialize matrix of zeros
    rows = len(seq1)+1
    cols = len(seq2)+1
    distance = np.zeros((rows,cols),dtype = int)

    # Populate matrix of zeros with the indeces of each character of both strings
    for i in range(1, rows):
        for k in range(1,cols):
            distance[i][0] = i
            distance[0][k] = k

    # loop through the matrix to compute the cost of deletions,insertions and/or substitutions    
    for col in range(1, cols):
        for row in range(1, rows):
            if seq1[row-1] == seq2[col-1]:
                cost = 0 # If the characters are the same in the two strings in a given position [i,j] then the cost is 0
            else:
                # In order to align the results with those of the Python Levenshtein package, if we choose to calculate the ratio
                # the cost of a substitution is 2. If we calculate just distance, then the cost of a substitution is 1.
                if ratio_calc == True:
                    cost = 2
                else:
                    cost = 1
            distance[row][col] = min(distance[row-1][col] + 1,      # Cost of deletions
                                 distance[row][col-1] + 1,          # Cost of insertions
                                 distance[row-1][col-1] + cost)     # Cost of substitutions
    if ratio_calc == True:
        # Computation of the Levenshtein Distance Ratio
        Ratio = round(((len(seq1)+len(seq2)) - distance[row][col]) / (len(seq1)+len(seq2)) * 100, 2)
        return Ratio
    else:
        # print(distance) # Uncomment if you want to see the matrix showing how the algorithm computes the cost of deletions,
        # insertions and/or substitutions
        # This is the minimum number of edits needed to convert seq1 to seq2
        return distance[row][col]


Prev_addrs = my_list

target_addr = "830 Amsterdam ave"
for addr in Prev_addrs:
    distance = string_match(target_addr, addr, ratio_calc = True)
    print(distance)

标签: python

解决方案


忽略我认为您问题中所有不相关的代码,以下是如何从标题和第一段中完成我认为是您问题的本质的内容。

import itertools
from pprint import pprint

def compare(a, b):
    print('compare({}, {}) called'.format(a, b))

list1 = list('ABCD')
list2 = list('EFGH')

for a, b in itertools.product(list1, list2):
    compare(a, b)

输出:

compare(A, E) called
compare(A, F) called
compare(A, G) called
compare(A, H) called
compare(B, E) called
compare(B, F) called
compare(B, G) called
compare(B, H) called
compare(C, E) called
compare(C, F) called
compare(C, G) called
compare(C, H) called
compare(D, E) called
compare(D, F) called
compare(D, G) called
compare(D, H) called

推荐阅读