首页 > 解决方案 > 如何使用正则表达式提取作者姓名和出版日期?

问题描述

我试图从这个 HTML 文本中提取作者的姓名和出版日期。

这是我到目前为止所拥有的: (authorName) = (".......")

但这仅适用于这种特定情况,我正在寻找一种通用方法。我能得到关于如何解决这个问题的任何提示吗?

老师 SF 应该在哪里投资意外之财的一个典型例子";var omni_bizObjectId = "13560483";var omni_className = "article";var omni_publicationDate = "2019-01-25T12:00:00+00:00";var omni_sourceSite ="sfgate ";var omni_authorName = "Heather Knight";var omni_authorTitle = "";var omni_premiumStatus = "isPremium";var omni_premiumEndDate = "1893506400";var omni_originalSource = "SF";var omni_pageNumber = "1";var omni_breakNewsFlag = "0" ;var omni_localNewsFlag = "1";var omni_isListView = "0";var omni_paywallSite = "1";var omni_displayTemplate = "ard";

标签: regexpython-3.xbeautifulsoup

解决方案


您可以使用此正则表达式来捕获 group1 中的作者姓名,

authorName\s+=\s+"([^"]*)"

此正则表达式authorName从字面上匹配一个或多个空格,然后再匹配一个=或多个空格,然后是双引号",然后捕获下一个双引号之间的任何数据并将其存储在 group1 中,在 Python 中可以使用m.group(1)

演示

检查此 Python 代码以了解如何从 group1 捕获数据,

import re

s = 'teacher a prime example of where SF should invest windfall";var omni_bizObjectId = "13560483";var omni_className = "article";var omni_publicationDate = "2019-01-25T12:00:00+00:00";var omni_sourceSite ="sfgate";var omni_authorName = "Heather Knight";var omni_authorTitle = "";var omni_premiumStatus = "isPremium";var omni_premiumEndDate = "1893506400";var omni_originalSource = "SF";var omni_pageNumber = "1";var omni_breakingNewsFlag = "0";var omni_localNewsFlag = "1";var omni_isListView = "0";var omni_paywallSite = "1";var omni_displayTemplate = "ard";'

m = re.search(r'authorName\s+=\s+"([^"]*)"',s)
if (m):
 print(m.group(1))

仅打印作者姓名,

Heather Knight

编辑:感谢 Onyambu 指出关于发布日期。

与 类似authorName,您可以使用上述正则表达式并替换authorNamepublicationDate并使用此正则表达式进行捕获publicationDate

publicationDate\s+=\s+"([^"]*)"

演示发布日期

如果你想用单个正则表达式提取两者,你可以使用这个正则表达式,

(?i).*publicationdate\s+=\s+"([^"]*)".*authorName\s+=\s+"([^"]*)"

演示

Python代码,

import re

s = 'teacher a prime example of where SF should invest windfall";var omni_bizObjectId = "13560483";var omni_className = "article";var omni_publicationDate = "2019-01-25T12:00:00+00:00";var omni_sourceSite ="sfgate";var omni_authorName = "Heather Knight";var omni_authorTitle = "";var omni_premiumStatus = "isPremium";var omni_premiumEndDate = "1893506400";var omni_originalSource = "SF";var omni_pageNumber = "1";var omni_breakingNewsFlag = "0";var omni_localNewsFlag = "1";var omni_isListView = "0";var omni_paywallSite = "1";var omni_displayTemplate = "ard";'

m = re.search(r'(?i).*publicationdate\s+=\s+"([^"]*)".*authorName\s+=\s+"([^"]*)"',s)
if (m):
 print('Publication Date:', m.group(1))
 print('Author Name:', m.group(2))

印刷,

Publication Date: 2019-01-25T12:00:00+00:00
Author Name: Heather Knight

推荐阅读