首页 > 解决方案 > Python - 请求 GZ 文件并解析 XML

问题描述

几天前我开始学习 Python,以建立一个基本站点,以便从 BOINC 项目(例如 SETI@home 等)中编译一些统计数据。

基本上该网站会:

总共有来自 34 个不同 BOINC 项目的 34 个 .gz 文件。

现在所有代码都已完成并且可以工作,但是来自一个项目的 .gz 文件拒绝解析,而其他 34 个工作正常。

该文件是:

user.gz

http://www.rnaworld.de/rnaworld/stats/

这些是我得到的错误:

Traceback (most recent call last):
  File "C:/Users/chris/PycharmProjects/testproject1/rnaw100.py", line 77, in <module>
    for event, elem in ET.iterparse(str(x_file_name2), events=("start", "end")):
  File "C:\Users\chris\AppData\Local\Programs\Python\Python38-32\lib\xml\etree\ElementTree.py", line 1227, in iterator
    yield from pullparser.read_events()
  File "C:\Users\chris\AppData\Local\Programs\Python\Python38-32\lib\xml\etree\ElementTree.py", line 1302, in read_events
    raise event
  File "C:\Users\chris\AppData\Local\Programs\Python\Python38-32\lib\xml\etree\ElementTree.py", line 1274, in feed
    self._parser.feed(data)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 0

这是下载 .gz 文件并解析 XML 的代码:(我省略了 var 声明等)

作为一个新手,我发现很难理解哪里出了问题,因为 (a) 错误指的是 Python 核心文件,例如 ElementTree.py,并且 (b) 我不明白为什么 .gz 文件有许多其他 BOINC stat使用的网站不会在这里工作,以及(c)为什么我的代码适用于 34 个文件,但不是这个 1.

response = requests.get(url2, stream=True)

if response.status_code == 200:
    with open(target_path2, 'wb') as f:
        f.write(response.raw.read())

with gzip.open(target_path2, 'rb') as f_in:
    with open(x_file_name2, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

for event, elem in ET.iterparse(str(x_file_name2), events=("start", "end")):

    if elem.tag == "total_credit" and event == "end":
        tc=float(elem.text)
        elem.clear

    if elem.tag == "expavg_credit" and event == "end":
        ac=float(elem.text)
        elem.clear

    if elem.tag == "id" and event == "end":
        id=elem.text
        elem.clear

    if elem.tag == "cpid" and event == "end":
        cpid=elem.text
        elem.clear

    if elem.tag == "name" and event == "end":
        name = elem.text
        elem.clear()
    teamid=TEAMID

    if elem.tag == "teamid" and event == "end":
        if elem.text == TEAMID:
            cnt=cnt+1
            dic[id]={"Name":name,"CPID":cpid, "TC":tc, "AC":ac}
        elem.clear()

标签: pythonrequestxml-parsingcelementtreeboinc

解决方案


另一种解决方案。

from simplified_scrapy import SimplifiedDoc,req,utils
import gzip
with gzip.open('user.gz', 'rb') as f_in:
  with open('user.xml', 'wb') as f_out:
    f_out.write(f_in.read())
html = utils.getFileContent('user.xml')
doc = SimplifiedDoc(html)
users = doc.selects('user')
for user in users:
  tags = user.children

@Chris 我解压缩文件并保存它。数据是正确的。尝试用它替换您的shutil。

import gzip
with gzip.open('user.gz', 'rb') as f_in:
    with open('user.xml', 'wb') as f_out:
        f_out.write(f_in.read())

推荐阅读