python - 如何从 Python 3 中的字符串列表中获取时间戳和用户 ID?
问题描述
我正在尝试从字符串列表中提取文本的某些部分。这是列表的样子:
'<rev revid="78273004" parentid="78127030" minor="" user="BF" timestamp="2016-01-19T17:33:57Z" comment="added [[Category:Politics]] usando [[Wikipedia:Monobook.js/Hot Cat|HotCat]]" />', '<rev revid="78127030" parentid="78054777" user="Atar" timestamp="2016-01-15T05:33:33Z" comment="template citazione; rinomina/fix nomi parametri; converto template cite xxx -> cita xxx; elimino parametri vuoti; fix formato data" />', '<rev revid="78054777" parentid="78054533" user="yk" timestamp="2016-01-11T20:50:39Z" comment="/* Voci correlate */ coll. esterni" />', ...
我会在两个不同的数组中提取用户和时间戳,以便分别绘制它们。
我已经尝试做的是创建两个不同的数组并尝试获取用户和时间戳。
url = "https://it.wikipedia.org/w/api.php?action=query&format=xml&prop=revisions&rvlimit=500&titles=" + pageTitle
revisions = [] #list of all accumulated revisions
timestamps = [] #list of all accumulated timestamps
users = [] #list of all accumulated users
next = '' #information for the next request
while True:
response = requests.get(url + next).text #web request
revisions += re.findall('<rev [^>]*>', response) #adds all revisions from the current request to the list
timestamps += re.findall('timestamp="\d{4}-\d{2}-\d{2}\w\d{2}:\d{2}:\d{2}\w"', response)
users += re.findall('user="\w"', response)
cont = re.search('<continue rvcontinue="([^"]+)"', response)
if not cont: #break the loop if 'continue' element missing
break
next = "&rvcontinue=" + cont.group(1) #gets the revision Id from which to start the next request
return timestamps, users;
GetRevisions("Italia")
我想得到的是两个数组,一个带有时间戳,另一个带有用户。
timestamps= [2016-01-19T17:33:57Z, 2016-01-15T05:33:33Z, ...]
users= [BF, Atar, ...]
(我想在用户和时间戳之间建立关联)。
但是,我只得到空列表:
[], []
我希望你能帮助我。
解决方案
您是否尝试过使用解析文本BeautifulSoup
?
您可以将您的文本解析为 html 标签,并在一个简单的循环中提取对您重要的标签:
from bs4 import BeautifulSoup
## The text you refer to as list:
yourText = '''<rev revid="78273004" parentid="78127030" minor="" user="BF" timestamp="2016-01-19T17:33:57Z" comment="added [[Category:Politics]] usando [[Wikipedia:Monobook.js/Hot Cat|HotCat]]" />', '<rev revid="78127030" parentid="78054777" user="Atar" timestamp="2016-01-15T05:33:33Z" comment="template citazione; rinomina/fix nomi parametri; converto template cite xxx -> cita xxx; elimino parametri vuoti; fix formato data" />', '<rev revid="78054777" parentid="78054533" user="yk" timestamp="2016-01-11T20:50:39Z" comment="/* Voci correlate */ coll. esterni" />'''
### parse it with BeautifulSoup
soup = BeautifulSoup(yourText, 'html.parser')
users = []
timestamps = []
for rev in soup.findAll('rev'):
users.append(rev.get('user'))
timestamps.append(rev.get('timestamp'))
print (users)
print (timestamps)
['BF','Atar','yk']
['2016-01-19T17:33:57Z', '2016-01-15T05:33:33Z', '2016-01-11T20:50:39Z']
使用您的原始代码
使用您的原始代码,我们只需要更改您使用正则表达式捕获文本的方式。我应用的逻辑是:
- 以
timestamp=
或开头user=
; - 其次是
"
- 后跟任何不是的字符
"
- 以一个
"
字符结束。
timestamps += re.findall('(?:timestamp=)"([^"]*)"', response)
users += re.findall('(?:user=)"([^"]*)"', response)
url = "https://it.wikipedia.org/w/api.php?action=query&format=xml&prop=revisions&rvlimit=500&titles=Italia"
revisions = [] #list of all accumulated revisions
timestamps = [] #list of all accumulated timestamps
users = [] #list of all accumulated users
next = '' #information for the next request
while True:
response = requests.get(url + next).text #web request
revisions += re.findall('(?=<rev)', response) #adds all revisions from the current request to the list
timestamps += re.findall('(?:timestamp=)"([^"]*)"', response)
users += re.findall('(?:user=)"([^"]*)"', response)
cont = re.search('<continue rvcontinue="([^"]+)"', response)
if not cont: #break the loop if 'continue' element missing
break
next = "&rvcontinue=" + cont.group(1) #gets the revision Id from which to start the next request
这将产生两个包含 9968 个元素的列表:
users[0:3]
Out[1]:
['U9POI57', 'SuperPierlu', 'Superchilum']
timestamps[0:3]
Out[2]:
['2019-07-24T22:15:23Z', '2019-07-24T16:09:59Z', '2019-07-24T12:40:24Z']
编辑
只保留日期,没有时间。为此,您只需要将匹配字符串的结尾替换为"
to T
:
timestamps += re.findall('(?:timestamp=)"([^"]*)T', response)
推荐阅读
- django - 如何将geojson文件转换为django中的类模型
- javascript - WebMidi.js 无法识别我的 midi 控制器
- office-js - 我们如何获取 Onenote 页面的 html。我正在尝试通过 Onenote.analyzePage() 但给出的是 null
- javascript - 当它是一个 URL 时下载 PDF
- ios - 使用 sudo gem install cocoapods 更新 cocopods 后,它仍然显示旧版本
- javascript - Express CORS 不适用于 socket.io
- libreoffice - 显示简单 TextBox 形状的 LibreOffice 宏
- android - 获取真实日期,当用户从 android studio 中的设置更改日期时
- scikit-learn - 为什么每次运行程序时我的 Pearson 相关系数(和 MSE)都会不断变化?
- c# - 如何在asp中每天运行cron作业。网络核心应用?