regex - 如何使用正则表达式提取作者姓名和出版日期?
问题描述
我试图从这个 HTML 文本中提取作者的姓名和出版日期。
这是我到目前为止所拥有的: (authorName) = (".......")
但这仅适用于这种特定情况,我正在寻找一种通用方法。我能得到关于如何解决这个问题的任何提示吗?
老师 SF 应该在哪里投资意外之财的一个典型例子";var omni_bizObjectId = "13560483";var omni_className = "article";var omni_publicationDate = "2019-01-25T12:00:00+00:00";var omni_sourceSite ="sfgate ";var omni_authorName = "Heather Knight";var omni_authorTitle = "";var omni_premiumStatus = "isPremium";var omni_premiumEndDate = "1893506400";var omni_originalSource = "SF";var omni_pageNumber = "1";var omni_breakNewsFlag = "0" ;var omni_localNewsFlag = "1";var omni_isListView = "0";var omni_paywallSite = "1";var omni_displayTemplate = "ard";
解决方案
您可以使用此正则表达式来捕获 group1 中的作者姓名,
authorName\s+=\s+"([^"]*)"
此正则表达式authorName
从字面上匹配一个或多个空格,然后再匹配一个=
或多个空格,然后是双引号"
,然后捕获下一个双引号之间的任何数据并将其存储在 group1 中,在 Python 中可以使用m.group(1)
检查此 Python 代码以了解如何从 group1 捕获数据,
import re
s = 'teacher a prime example of where SF should invest windfall";var omni_bizObjectId = "13560483";var omni_className = "article";var omni_publicationDate = "2019-01-25T12:00:00+00:00";var omni_sourceSite ="sfgate";var omni_authorName = "Heather Knight";var omni_authorTitle = "";var omni_premiumStatus = "isPremium";var omni_premiumEndDate = "1893506400";var omni_originalSource = "SF";var omni_pageNumber = "1";var omni_breakingNewsFlag = "0";var omni_localNewsFlag = "1";var omni_isListView = "0";var omni_paywallSite = "1";var omni_displayTemplate = "ard";'
m = re.search(r'authorName\s+=\s+"([^"]*)"',s)
if (m):
print(m.group(1))
仅打印作者姓名,
Heather Knight
编辑:感谢 Onyambu 指出关于发布日期。
与 类似authorName
,您可以使用上述正则表达式并替换authorName
为publicationDate
并使用此正则表达式进行捕获publicationDate
。
publicationDate\s+=\s+"([^"]*)"
如果你想用单个正则表达式提取两者,你可以使用这个正则表达式,
(?i).*publicationdate\s+=\s+"([^"]*)".*authorName\s+=\s+"([^"]*)"
Python代码,
import re
s = 'teacher a prime example of where SF should invest windfall";var omni_bizObjectId = "13560483";var omni_className = "article";var omni_publicationDate = "2019-01-25T12:00:00+00:00";var omni_sourceSite ="sfgate";var omni_authorName = "Heather Knight";var omni_authorTitle = "";var omni_premiumStatus = "isPremium";var omni_premiumEndDate = "1893506400";var omni_originalSource = "SF";var omni_pageNumber = "1";var omni_breakingNewsFlag = "0";var omni_localNewsFlag = "1";var omni_isListView = "0";var omni_paywallSite = "1";var omni_displayTemplate = "ard";'
m = re.search(r'(?i).*publicationdate\s+=\s+"([^"]*)".*authorName\s+=\s+"([^"]*)"',s)
if (m):
print('Publication Date:', m.group(1))
print('Author Name:', m.group(2))
印刷,
Publication Date: 2019-01-25T12:00:00+00:00
Author Name: Heather Knight
推荐阅读
- java - 为什么使用 completableFuture 如果任务是依赖的
- list - 计算每个用户的总记录
- java - 在 Azure 应用服务中访问 Tomcat 和 Webapp 日志
- tkinter - 如何使用函数在 Tkinter GUI 中拥有多个页面(不打开新窗口)?
- ruby-on-rails - 如何根据数量复制记录
- javascript - '错误:\'NoneType\' 对象没有属性 \'startswith\'
- java - Spring - 在类之间隐式转换(希望强制执行模式)
- ios - 如何:为 UIViewController 将 UIModalPresentationStyle 从 .automatic 更改为 .fullscreen
- angular - graphql 角度突变在控制台中显示“错误:网络错误:无法读取 null 的属性‘长度’”
- python - 如何进行句子标记化