首页 > 解决方案 > 如何从 Python 3 中的字符串列表中获取时间戳和用户 ID?

问题描述

我正在尝试从字符串列表中提取文本的某些部分。这是列表的样子:

'<rev revid="78273004" parentid="78127030" minor="" user="BF" timestamp="2016-01-19T17:33:57Z" comment="added [[Category:Politics]] usando [[Wikipedia:Monobook.js/Hot Cat|HotCat]]" />', '<rev revid="78127030" parentid="78054777" user="Atar" timestamp="2016-01-15T05:33:33Z" comment="template citazione; rinomina/fix nomi parametri; converto template cite xxx -&gt; cita xxx; elimino parametri vuoti; fix formato data" />', '<rev revid="78054777" parentid="78054533" user="yk" timestamp="2016-01-11T20:50:39Z" comment="/* Voci correlate */  coll. esterni" />', ...

我会在两个不同的数组中提取用户和时间戳,以便分别绘制它们。

我已经尝试做的是创建两个不同的数组并尝试获取用户和时间戳。

url = "https://it.wikipedia.org/w/api.php?action=query&format=xml&prop=revisions&rvlimit=500&titles=" + pageTitle
    revisions = []                                        #list of all accumulated revisions
    timestamps = []                                       #list of all accumulated timestamps
    users = []                                            #list of all accumulated users
    next = ''                                             #information for the next request
    while True:
        response = requests.get(url + next).text     #web request
        revisions += re.findall('<rev [^>]*>', response)  #adds all revisions from the current request to the list
        timestamps += re.findall('timestamp="\d{4}-\d{2}-\d{2}\w\d{2}:\d{2}:\d{2}\w"', response)
        users += re.findall('user="\w"', response)
        cont = re.search('<continue rvcontinue="([^"]+)"', response)
        if not cont:                                      #break the loop if 'continue' element missing
            break

        next = "&rvcontinue=" + cont.group(1)             #gets the revision Id from which to start the next request

    return timestamps, users;

GetRevisions("Italia")

我想得到的是两个数组,一个带有时间戳,另一个带有用户。

timestamps= [2016-01-19T17:33:57Z, 2016-01-15T05:33:33Z, ...]
users= [BF, Atar, ...]

(我想在用户和时间戳之间建立关联)。

但是,我只得到空列表:

[], []

我希望你能帮助我。

标签: pythonpython-3.x

解决方案


您是否尝试过使用解析文本BeautifulSoup

您可以将您的文本解析为 html 标签,并在一个简单的循环中提取对您重要的标签:

from bs4 import BeautifulSoup

## The text you refer to as list:
yourText = '''<rev revid="78273004" parentid="78127030" minor="" user="BF" timestamp="2016-01-19T17:33:57Z" comment="added [[Category:Politics]] usando [[Wikipedia:Monobook.js/Hot Cat|HotCat]]" />', '<rev revid="78127030" parentid="78054777" user="Atar" timestamp="2016-01-15T05:33:33Z" comment="template citazione; rinomina/fix nomi parametri; converto template cite xxx -&gt; cita xxx; elimino parametri vuoti; fix formato data" />', '<rev revid="78054777" parentid="78054533" user="yk" timestamp="2016-01-11T20:50:39Z" comment="/* Voci correlate */  coll. esterni" />'''

### parse it with BeautifulSoup
soup = BeautifulSoup(yourText, 'html.parser')
users = []
timestamps  = []
for rev in soup.findAll('rev'):
    users.append(rev.get('user'))
    timestamps.append(rev.get('timestamp'))

print (users)
print (timestamps)

['BF','Atar','yk']

['2016-01-19T17:33:57Z', '2016-01-15T05:33:33Z', '2016-01-11T20:50:39Z']

使用您的原始代码

使用您的原始代码,我们只需要更改您使用正则表达式捕获文本的方式。我应用的逻辑是:

  1. timestamp=或开头user=
  2. 其次是"
  3. 后跟任何不是的字符"
  4. 以一个"字符结束。
timestamps += re.findall('(?:timestamp=)"([^"]*)"', response)
users += re.findall('(?:user=)"([^"]*)"', response)
url = "https://it.wikipedia.org/w/api.php?action=query&format=xml&prop=revisions&rvlimit=500&titles=Italia"
revisions = []                                        #list of all accumulated revisions
timestamps = []                                       #list of all accumulated timestamps
users = []                                            #list of all accumulated users
next = ''                                             #information for the next request
while True:
    response = requests.get(url + next).text     #web request
    revisions += re.findall('(?=<rev)', response)  #adds all revisions from the current request to the list
    timestamps += re.findall('(?:timestamp=)"([^"]*)"', response)
    users += re.findall('(?:user=)"([^"]*)"', response)
    cont = re.search('<continue rvcontinue="([^"]+)"', response)
    if not cont:                                      #break the loop if 'continue' element missing
        break

    next = "&rvcontinue=" + cont.group(1)             #gets the revision Id from which to start the next request

这将产生两个包含 9968 个元素的列表:

users[0:3]

Out[1]:
['U9POI57', 'SuperPierlu', 'Superchilum']

timestamps[0:3]

Out[2]:
['2019-07-24T22:15:23Z', '2019-07-24T16:09:59Z', '2019-07-24T12:40:24Z']

编辑

只保留日期,没有时间。为此,您只需要将匹配字符串的结尾替换为"to T

timestamps += re.findall('(?:timestamp=)"([^"]*)T', response)

推荐阅读