首页 > 解决方案 > 从熊猫数据框列中提取所有模式(python3)

问题描述

我正在使用 jupyter 笔记本(python 3)。我正在尝试从我的列表中提取 pandas 数据框关键字。我将在列表中包含大约 50 个关键字。

例子:

import pandas as pd
import re

rgx_words1 = ['algaecid','algaecide','algaecides','anti-bakterien']

 

pattern = "\\b("+'|'.join(rgx_words1)+")\\b"

re_patt = re.compile(pattern)

 

pattern2 = "("+'|'.join(rgx_words1)+")"

re_patt2 = re.compile(pattern2)

 

data = [[1, 'I, will, find, algaecide, dd, algaecid, algaecides'], [2, 'fff, algaecid, dd, algaecide'], [3, 'ssssalgaecidllll, algaecides']]

  

# Create the pandas DataFrame

mydf = pd.DataFrame(data, columns = ['id', 'text'])

 

mydf['matches'] = mydf.apply(lambda x : re.findall(re_patt,x['text']),axis=1)

mydf['matches2'] = mydf.apply(lambda x : re.findall(re_patt2,x['text']),axis=1)

使用 re_patt,我可以提取准确的单词并且得到正确的结果。在 id 1 中,我的输出是除藻剂、除藻剂、除藻剂。使用 re_patt2 我希望所有模式都像 ''ssssalgaecidllll' 和想要的输出'algaecid'。id 1 中带有 re_patt2 的输出是除藻剂、除藻剂、除藻剂,而我想要的输出是除藻剂、除藻剂、除藻剂。我将不胜感激任何建议。先感谢您。

标签: pythonregex

解决方案


您可以更改pattern2为可选地匹配非空白字符,除了[^\s,]*左侧和右侧的逗号。

pattern2 = "[^\s,]*(?:"+'|'.join(rgx_words1)+")[^\s,]*"

代码可能看起来像

import pandas as pd
import re

rgx_words1 = ['algaecid','algaecide','algaecides','anti-bakterien']

pattern = "\\b("+'|'.join(rgx_words1)+")\\b"
re_patt = re.compile(pattern)

pattern2 = "[^\s,]*(?:"+'|'.join(rgx_words1)+")[^\s,]*"
re_patt2 = re.compile(pattern2)

data = [[1, 'I, will, find, algaecide, dd, algaecid, algaecides'], [2, 'fff, algaecid, dd, algaecide'], [3, 'ssssalgaecidllll, algaecides']]
mydf = pd.DataFrame(data, columns = ['id', 'text'])

mydf['matches'] = mydf.apply(lambda x : re.findall(re_patt, x['text']), axis=1)
mydf['matches2'] = mydf.apply(lambda x : re.findall(re_patt2, x['text']), axis=1)

print(mydf)

输出

   id                                               text                            matches                           matches2
0   1  I, will, find, algaecide, dd, algaecid, algaec...  [algaecide, algaecid, algaecides]  [algaecide, algaecid, algaecides]
1   2                       fff, algaecid, dd, algaecide              [algaecid, algaecide]              [algaecid, algaecide]
2   3                       ssssalgaecidllll, algaecides                       [algaecides]     [ssssalgaecidllll, algaecides]

推荐阅读