python - 使用正则表达式，在数据框中的列表中搜索模式并将匹配结果放入熊猫的新列中

问题描述

我有一个带有文本列的 csv 文件，PF 示例数据如下

text
['Hello world', 'Welcome to the universe.']
['Hey Hello world', 'I am learning Pandas Welcome to the universe.']
['Hello world how are you', 'Good Morning', 'I am learning Pandas.']
['Hi', 'Iam version 3.6 Welcome', 'Nice to meet you.']

我想迭代每一行并检查模式

如果句子有模式 Hello 或 world 那么我想把那个句子放在一个新的列 text1
如果这个句子有一个模式 Welcome 或 Universe 那么我想把那个句子放在一个新的列 text2

所以在搜索模式并将其放入新列后，我的输出如下所示

text,text1,text2
['Hello world', 'Welcome to the universe.'],Hello world,Welcome to the universe.
['Hey Hello world', 'I am learning Pandas Welcome to the universe.'],Hey Hello world,I am learning Pandas Welcome to the universe.
['Hello how are you', 'Good Morning', 'I am learning Pandas.'],Hello how are you,None
['Hi', 'Iam version 3.6 Welcome', 'Nice to meet you.'],None,Iam version 3.6 Welcome

谁能指导我如何做到这一点？

标签： pythonregexpandasstringdataframe

从你的DataFrame：

>>> df = pd.DataFrame({'text': ["['Hello world', 'Welcome to the universe.']",
...                             "['Hey Hello world', 'I am learning Pandas Welcome to the universe.']",
...                             "['Hello world how are you', 'Good Morning', 'I am learning Pandas.']",
...                             "['Hi', 'Iam version 3.6 Welcome', 'Nice to meet you.']"]}, 
...                   index = [0, 1, 2, 3])
>>> df
    text
0   ['Hello world', 'Welcome to the universe.']
1   ['Hey Hello world', 'I am learning Pandas Welc...
2   ['Hello world how are you', 'Good Morning', 'I...
3   ['Hi', 'Iam version 3.6 Welcome', 'Nice to mee...

我们可以使用apply两个函数，find_substring_text1并find_substring_text2在text列上，eval即为list：

>>> def find_substring_text1(row):
...     return [s for s in row if any(k in s for k in ['Hello', 'world'])]
    
>>> def find_substring_text2(row):
...     return [s for s in row if any(k in s for k in ['Welcome', 'universe'])]

>>> df['text1'] = df['text'].apply(eval).apply(find_substring_text1)
>>> df['text2'] = df['text'].apply(eval).apply(find_substring_text2)

然后我们得到预期的结果：

>>> df
    text                                                text1                       text2
0   ['Hello world', 'Welcome to the universe.']         [Hello world]               [Welcome to the universe.]
1   ['Hey Hello world', 'I am learning Pandas Welc...   [Hey Hello world]           [I am learning Pandas Welcome to the universe.]
2   ['Hello world how are you', 'Good Morning', 'I...   [Hello world how are you]   []
3   ['Hi', 'Iam version 3.6 Welcome', 'Nice to mee...   []                          [Iam version 3.6 Welcome]

如果需要，我们甚至可以将list格式更改为string：

>>> df['text1'] = [','.join(map(str, l)) for l in df['text1']]
>>> df['text2'] = [','.join(map(str, l)) for l in df['text2']]
>>> df
    text                                                text1                    text2
0   ['Hello world', 'Welcome to the universe.']         Hello world              Welcome to the universe.
1   ['Hey Hello world', 'I am learning Pandas Welc...   Hey Hello world          I am learning Pandas Welcome to the universe.
2   ['Hello world how are you', 'Good Morning', 'I...   Hello world how are you 
3   ['Hi', 'Iam version 3.6 Welcome', 'Nice to mee...                            Iam version 3.6 Welcome

python - 使用正则表达式，在数据框中的列表中搜索模式并将匹配结果放入熊猫的新列中

问题描述

解决方案

推荐阅读