首页 > 解决方案 > fuzz.token_sort_ratio() 结果不一致?

问题描述

我对 fuzz.token_sort_ratio() 函数有疑问:假设我有一个包含多个实体的数据集,可以根据名称和地址进行协调。我正在使用 rapidfuzz 生成分数相似性,但在分数相似性方面没有得到一致的结果。

请参阅以下代码 - fuzz.token_sort_ratio() 生成 59.30 的分数相似性“地址相似性 2”但是如果我运行最后一行来评估两个地址的分数相似性 - 结果是 28.09?

我的错误在哪里?

from rapidfuzz import process, utils, fuzz
import pandas as pd
import numpy as no

test_anui_data = {'Processed Client Name': ['anhui jinhan clothing co ltd'], 'Processed Aruvio Name': ['anhui jinhan clothing co ltd'], 'Processed Client Address': ['high new technology development zones huainan city anhui province china anhui anhui any city'] , 'Processed Aruvio Address': ['industrial park of funan city'],  'Name Similarity': [89.2857142857142],  'Address Similarity': [np.nan]}  
  
# Create DataFrame  
test_anui = pd.DataFrame(test_anui_data)  
test_anui


no_name_match = test_anui[(test_anui['Name Similarity'].isnull()) & (test_anui['Name Similarity']!='')]
no_name_match['Name Similarity 2'] = fuzz.token_sort_ratio(str(no_name_match['Processed Client Name']), str(no_name_match['Processed Aruvio Name']))


no_address_match = test_anui[(test_anui['Address Similarity'].isnull()) & (test_anui['Address Similarity']!='')]
no_address_match['Address Similarity 2'] = fuzz.token_sort_ratio(str(no_address_match['Processed Client Address']), str(no_address_match['Processed Aruvio Address']))

test_anui_append = no_name_match.append(no_address_match)
test_anui_append.to_clipboard()

test_anui_append

print('the address similarity is different? ', fuzz.token_sort_ratio('high new technology development zones huainan city anhui province china anhui anhui any city', 'industrial park of funan city'))

标签: pythonpandasmergefuzzyrapidfuzz

解决方案


推荐阅读