首页 > 解决方案 > 如果项目与子字符串不匹配,则从列表中删除项目,无论格式如何

问题描述

我有以下数据框:

df = pd.DataFrame()
df['full_string'] = [['apples and bananas', 'applesandbananasamongstothers', 'something else'], 
          ['ApplesandBananas', 'apples and Bananas', 'bananas']]
df['substring'] = ['apples and bananas', 'apples and bananas']

期望的结果是将包含在 df['substring'] 中找到的文本的项目保留在 df['full_string'] 中,同时考虑到:

期望的结果:

df['outcome'] = [['apples and bananas', 'applesandbananasamongstothers'], 
      ['ApplesandBananas', 'apples and Bananas', 'bananas']]

我尝试的是让 df['substring'] 的第一个关键字将其用作 df['full_string'] 的匹配器,但是,这不允许我在第二行保留 'bananas' 元素数据框。

这在虚拟数据上效果不佳):

first_keyword = []
for i in df['substring']:
    first_keyword.append(i.split(' ', 1)[0])

df['first_keyword'] = first_keyword

df['C'] = [x[0].lower() in (x[1].lower()) for x in zip(df['first_keyword'], df['full_string'])]

标签: pythonstringsubstringmatching

解决方案


为了简化示例,我选择使用包含您的虚拟数据的列表。你需要让它适应你的问题。此外,我将您的句子“期望的结果是将包含在 df['substring'] 中找到的文本的项目保留在 df['full_string'] 中”解释为 text = word。

full_str = ['apples and bananas', 'applesandbananasamongstothers', 'something else', 
           'ApplesandBananas', 'apples and Bananas', 'bananas']
sub_str = ['apples and bananas', 'red and blue']

# Extract words from sub strings
words_in_sub = [elt.split() for elt in sub_str]
# Flatten and remove duplicates
words_in_sub = list(set([item for sublist in words_in_sub for item in sublist]))

# Init output
output = list()
# Loop on the strings in full string
for full_s in full_str:
    # Loop on the words to look for
    for word in words_in_sub:
        if word.lower() in full_s.lower():
            output.append(full_s)
            break

输出:

In: output
Out: 
['apples and bananas',
 'applesandbananasamongstothers',
 'ApplesandBananas',
 'apples and Bananas',
 'bananas']

在 if 条件下处理小写/大写。间距由in语句处理。其他文本的存在full_sin语句处理。如果单词出现在字符串中的in某处,则语句返回 True。当单词可能被认为存在于字符串中时,它会返回 False 的唯一情况是,如果单词被空格一分为二,例如'bana naan dapp les'. 此示例不会保留在输出列表中。

编辑:多行。您也可以展平列表并使用第一个代码。

full_str = [['apples and bananas', 'applesandbananasamongstothers', 'something else'], 
            ['ApplesandBananas', 'apples and Bananas', 'bananas']]
sub_str = [['apples and bananas'], ['apples and bananas']]

# Assuming same number of rows between full_str and sub_str
# And you want to keep element of full_str[k] according to sub strings in sub_str[k]
number_of_rows = len(full_str)
for k in range(number_of_rows):
    # Extract words from sub strings
    words_in_sub = [elt.split() for elt in sub_str[k]]
    # Flatten and remove duplicates
    words_in_sub = list(set([item for sublist in words_in_sub for item in sublist]))

    # Init output
    output = list()
    # Loop on the strings in full string
    for full_s in full_str[k]:
        # Loop on the words to look for
        for word in words_in_sub:
            if word.lower() in full_s.lower():
                output.append(full_s)
                break

推荐阅读