首页 > 解决方案 > xml.etree.ElementTree.ParseError:格式不正确(无效标记)

问题描述

使用 Python 3

我们得到的错误:

File "C:/scratch.py", line 27, in run
    tree = ET.fromstring(responses[0].decode(), ET.XMLParser(encoding='utf-8'))
  File "C:\Programs\Python\Python36-32\lib\xml\etree\ElementTree.py", line 1314, in XML
    parser.feed(text)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 163, column 1106

我们的代码:

tree = ET.fromstring(responses[0].decode(), ET.XMLParser(encoding='utf-8'))
    for i in tree.iter('item'):
        try:
            title = i.find('title').text
        except Exception:
            pass

响应 [0] 来自正在返回的 url 获取请求列表,但在这种索引 0 的情况下,测试一个特定的 url:http://feeds.feedburner.com/marginalrevolution/feed

我们能够将 XML 代码插入 W3 School 验证器并得到:

This page contains the following errors:
error on line 163 at column 31: Input is not in proper UTF-8, indicate encoding! Bytes: 0x0C 0x66 0x69 0x67

但是有了ET.XMLParser(encoding='utf-8')属性,这不应该解决解析时的错误吗?

标签: pythonpython-3.xxml-parsing

解决方案


W3 Schools 验证器的错误消息具有误导性。问题0x0c不在于它是无效的 UTF-8,而在于它不是 XML 中的合法字符

0x0c换页控制字符,因此它在文档中的存在是没有用的 符合标准的 XML 解析器有义务拒绝格式不正确的文档,并且您无法更改 rss 提要,因此最简单的解决方案是在处理之前将其从文档中删除。

>>> tree = ET.fromstring(original_response, ET.XMLParser(encoding='utf-8'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.7/xml/etree/ElementTree.py", line 1315, in XML
    parser.feed(text)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 185, column 1106

>>> fixed = original_response.replace(b'\x0c', b'')
>>> tree = ET.fromstring(fixed, ET.XMLParser(encoding='utf-8'))
>>> tree
<Element 'rss' at 0x7ff316db6278>

推荐阅读