首页 > 解决方案 > 使用自定义函数查找列中的所有单词

问题描述

背景

以下问题是Unnest grab keywords/nextwords/beforewords function的变体。

1)我有以下word_list

word_list = ['crayons', 'cars', 'camels']

2)df1

l = ['there are many crayons, in the blue box crayons that are',
     'cars! i like a lot of sports cars because they go fast',
     'the camels, in the middle east have many camels to ride ']
df1 = pd.DataFrame(l, columns=['Text'])

df1
         Text
0   there are many crayons, in the blue box crayons that are
1   cars! i like a lot of sports cars because they go fast
2   the camels, in the middle east have many camels to ride

3)我还有一个功能find_next_words,用于word_listText列中抓取单词df1

def find_next_words(row, word_list):

    sentence = row[0]

    trigger_words = []
    next_words = []

    for keyword in word_list:

        words = sentence.split()

        for index in range(0, len(words) - 1):
            if words[index] == keyword:
                trigger_words.append(keyword)
                next_words.append(words[index + 1:index + 3]) 

    return pd.Series([trigger_words, next_words], index = ['TriggerWords','NextWords'])

4)它与以下内容拼凑在一起

df2 = df1.join(df.apply(lambda x: find_next_words(x, word_list), axis=1))

输出

    Text           TriggerWords        NextWords
0                   [crayons]        [[that, are]]
1                   [cars]           [[because, they]]
2                   [camels]         [[to, ride]]

问题

5)输出遗漏了以下内容

crayons,从列的0Textdf1

cars! 从列的1Textdf1

camels, 从列的2Textdf1

目标

6)抓取所有对应的词,df1即使其中的词df1有轻微的变化,例如crayons, cars!从词中的词word_list

(对于这个玩具示例,我知道我可以通过将这些单词变体添加到word_list = ['crayons,','crayons', 'cars!',汽车来轻松解决这个问题,, 'camels,', 'camels'].但这对于我的真实 word_list 是不切实际的,其中包含约 20K 个单词)

期望的输出

Text           TriggerWords              NextWords
0               [crayons, crayons]  [[in, the], [that, are]]
1               [cars, cars]        [[i,like],[because, they]]
2               [camels, camels]    [[in, the], [to, ride]]

问题

我如何 1) 调整我的word_list(例如正则表达式?) 2) 或find_next_words函数以实现我想要的输出?

标签: regexpython-3.xstringpandasfunction

解决方案


你可以像这样调整你的正则表达式

\b(crayons|cars|camels)\b(?:[^a-z\n]*([a-z]*)[^a-z\n]*([a-z]*))

在此处输入图像描述

Regex Demo


推荐阅读