python - 从熊猫数据框列中提取所有模式(python3)
问题描述
我正在使用 jupyter 笔记本(python 3)。我正在尝试从我的列表中提取 pandas 数据框关键字。我将在列表中包含大约 50 个关键字。
例子:
import pandas as pd
import re
rgx_words1 = ['algaecid','algaecide','algaecides','anti-bakterien']
pattern = "\\b("+'|'.join(rgx_words1)+")\\b"
re_patt = re.compile(pattern)
pattern2 = "("+'|'.join(rgx_words1)+")"
re_patt2 = re.compile(pattern2)
data = [[1, 'I, will, find, algaecide, dd, algaecid, algaecides'], [2, 'fff, algaecid, dd, algaecide'], [3, 'ssssalgaecidllll, algaecides']]
# Create the pandas DataFrame
mydf = pd.DataFrame(data, columns = ['id', 'text'])
mydf['matches'] = mydf.apply(lambda x : re.findall(re_patt,x['text']),axis=1)
mydf['matches2'] = mydf.apply(lambda x : re.findall(re_patt2,x['text']),axis=1)
使用 re_patt,我可以提取准确的单词并且得到正确的结果。在 id 1 中,我的输出是除藻剂、除藻剂、除藻剂。使用 re_patt2 我希望所有模式都像 ''ssssalgaecidllll' 和想要的输出'algaecid'。id 1 中带有 re_patt2 的输出是除藻剂、除藻剂、除藻剂,而我想要的输出是除藻剂、除藻剂、除藻剂。我将不胜感激任何建议。先感谢您。
解决方案
您可以更改pattern2
为可选地匹配非空白字符,除了[^\s,]*
左侧和右侧的逗号。
pattern2 = "[^\s,]*(?:"+'|'.join(rgx_words1)+")[^\s,]*"
代码可能看起来像
import pandas as pd
import re
rgx_words1 = ['algaecid','algaecide','algaecides','anti-bakterien']
pattern = "\\b("+'|'.join(rgx_words1)+")\\b"
re_patt = re.compile(pattern)
pattern2 = "[^\s,]*(?:"+'|'.join(rgx_words1)+")[^\s,]*"
re_patt2 = re.compile(pattern2)
data = [[1, 'I, will, find, algaecide, dd, algaecid, algaecides'], [2, 'fff, algaecid, dd, algaecide'], [3, 'ssssalgaecidllll, algaecides']]
mydf = pd.DataFrame(data, columns = ['id', 'text'])
mydf['matches'] = mydf.apply(lambda x : re.findall(re_patt, x['text']), axis=1)
mydf['matches2'] = mydf.apply(lambda x : re.findall(re_patt2, x['text']), axis=1)
print(mydf)
输出
id text matches matches2
0 1 I, will, find, algaecide, dd, algaecid, algaec... [algaecide, algaecid, algaecides] [algaecide, algaecid, algaecides]
1 2 fff, algaecid, dd, algaecide [algaecid, algaecide] [algaecid, algaecide]
2 3 ssssalgaecidllll, algaecides [algaecides] [ssssalgaecidllll, algaecides]
推荐阅读
- c# - 添加到 Blazor 应用程序时,ASP.NET 标识未登录
- string - TCL:如何从字符串中删除所有字母/数字?
- pyspark - PySpark 中 df.withColumn 的替代方案?
- java - 如何在Android中将字符串中的小时转换为分钟
- javascript - JQuery trigger('input') 不会触发原生 JavaScript 输入事件
- javascript - 将字符串数组映射到对象键并获取值
- ruby-on-rails - 从控制器发出 POST 请求时,Rails Oauth 请求阶段未能通过 CSRF 验证
- python - 过滤掉/忽略具有快速斜率变化的区域
- javascript - Javascript:合并对象数组,对具有相同键的值求和
- python - 无限循环问题 - 交叉点