首页 > 解决方案 > 如何检查参考字符串列表和目标字符串之间是否匹配?

问题描述

我有一个从数据框转换的参考字符串列表。

参考字符串列表

brand_list = ['scurfa', 'seagull', 'seagull', 'seiko']

description_list 的示例输入 1

VINTAGE KING SEIKO 44-9990 Gold Medallion,Manual Winding with mod caseback.Serviced 2019.

description_list 的示例输入 2

Power reserve function at 12; push-pull crown at 4
Seiko NE57 auto movement with power reserve
Multilayered dial with SuperLuminova BG-W9

期望的输出

SEIKO 44-9990 #extract together with model name
Seiko NE57 #extract together with model name

这是我的示例代码,但输出不是我想要的

import nltk
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 
import numpy as np

stop_words = set(stopwords.words('english'))

def clean(doc):
    no_punct = ""
    word_tokens = word_tokenize(doc.lower()) 
    filtered_sentence = [w for w in word_tokens if not w in stop_words] 

    for w in word_tokens: 
        if w not in stop_words: 
            filtered_sentence.append(w) 

    return filtered_sentence

description_list = clean(soup_content.find('blockquote', { "class": "postcontent restore" }).text)

if pandas.Series(np.array(description_list)).isin(np.array(brand_list)).any() == True:
    brand_result = [i for i in description_list if i in brand_list] 
    print(brand_result[0])

    if pandas.Series(np.array(description_list)).isin(np.array(model_list)).any() == True:
        model_result = [i for i in description_list if i in model_list] 
        print(model_result[0])
    else:
        print('Unknown')
else:
    print('Unknown')
    print('Unknown')

标签: pythonpandasnumpy

解决方案


我会去一个正则表达式。

brand_list = ['scurfa', 'seagull', 'seagull', 'seiko']
regular_expression = rf"({'|'.join(brand_list)}) ([^\s]+)"

关于这个正则表达式的一些话:

  • 我们使用字符串构造函数rf"",这意味着您希望此字符串既是rawre模块所需)又formattable是(使用括号包含变量{}
  • '|'.join(brand_list)能够获得类似于(scurfa|seagull)匹配任何所需品牌的东西brand_list
  • 添加([^\s]+)使能在品牌后面抓住这个词(假设是型号名称)

最后:

import re

description = """
VINTAGE KING SEIKO 44-9990 Gold Medallion,Manual Winding with mod caseback.Serviced 2019.
Power reserve function at 12; push-pull crown at 4
Seiko NE57 auto movement with power reserve
Multilayered dial with SuperLuminova BG-W9
Testing for a ScURFA 42342
"""

print([" ".join(t) for t in re.findall(regular_expression, description, re.IGNORECASE)])

这使:

['SEIKO 44-9990', 'Seiko NE57', 'ScURFA 42342']

推荐阅读