首页 > 解决方案 > 使用正则表达式,在数据框中的列表中搜索模式并将匹配结果放入熊猫的新列中

问题描述

我有一个带有文本列的 csv 文件,PF 示例数据如下

text
['Hello world', 'Welcome to the universe.']
['Hey Hello world', 'I am learning Pandas Welcome to the universe.']
['Hello world how are you', 'Good Morning', 'I am learning Pandas.']
['Hi', 'Iam version 3.6 Welcome', 'Nice to meet you.']

我想迭代每一行并检查模式

如果句子有模式 Hello 或 world 那么我想把那个句子放在一个新的列 text1
如果这个句子有一个模式 Welcome 或 Universe 那么我想把那个句子放在一个新的列 text2

所以在搜索模式并将其放入新列后,我的输出如下所示

text,text1,text2
['Hello world', 'Welcome to the universe.'],Hello world,Welcome to the universe.
['Hey Hello world', 'I am learning Pandas Welcome to the universe.'],Hey Hello world,I am learning Pandas Welcome to the universe.
['Hello how are you', 'Good Morning', 'I am learning Pandas.'],Hello how are you,None
['Hi', 'Iam version 3.6 Welcome', 'Nice to meet you.'],None,Iam version 3.6 Welcome

谁能指导我如何做到这一点?

标签: pythonregexpandasstringdataframe

解决方案


从你的DataFrame

>>> df = pd.DataFrame({'text': ["['Hello world', 'Welcome to the universe.']",
...                             "['Hey Hello world', 'I am learning Pandas Welcome to the universe.']",
...                             "['Hello world how are you', 'Good Morning', 'I am learning Pandas.']",
...                             "['Hi', 'Iam version 3.6 Welcome', 'Nice to meet you.']"]}, 
...                   index = [0, 1, 2, 3])
>>> df
    text
0   ['Hello world', 'Welcome to the universe.']
1   ['Hey Hello world', 'I am learning Pandas Welc...
2   ['Hello world how are you', 'Good Morning', 'I...
3   ['Hi', 'Iam version 3.6 Welcome', 'Nice to mee...

我们可以使用apply两个函数,find_substring_text1find_substring_text2text列上,eval即为list

>>> def find_substring_text1(row):
...     return [s for s in row if any(k in s for k in ['Hello', 'world'])]
    
>>> def find_substring_text2(row):
...     return [s for s in row if any(k in s for k in ['Welcome', 'universe'])]

>>> df['text1'] = df['text'].apply(eval).apply(find_substring_text1)
>>> df['text2'] = df['text'].apply(eval).apply(find_substring_text2)

然后我们得到预期的结果:

>>> df
    text                                                text1                       text2
0   ['Hello world', 'Welcome to the universe.']         [Hello world]               [Welcome to the universe.]
1   ['Hey Hello world', 'I am learning Pandas Welc...   [Hey Hello world]           [I am learning Pandas Welcome to the universe.]
2   ['Hello world how are you', 'Good Morning', 'I...   [Hello world how are you]   []
3   ['Hi', 'Iam version 3.6 Welcome', 'Nice to mee...   []                          [Iam version 3.6 Welcome]

如果需要,我们甚至可以将list格式更改为string

>>> df['text1'] = [','.join(map(str, l)) for l in df['text1']]
>>> df['text2'] = [','.join(map(str, l)) for l in df['text2']]
>>> df
    text                                                text1                    text2
0   ['Hello world', 'Welcome to the universe.']         Hello world              Welcome to the universe.
1   ['Hey Hello world', 'I am learning Pandas Welc...   Hey Hello world          I am learning Pandas Welcome to the universe.
2   ['Hello world how are you', 'Good Morning', 'I...   Hello world how are you 
3   ['Hi', 'Iam version 3.6 Welcome', 'Nice to mee...                            Iam version 3.6 Welcome

推荐阅读