首页 > 解决方案 > 删除 xml 文件的第一部分,无法序列化

问题描述

我有一个 xml 文件,它的开头如下:

'''some non ascii character'''
<b:FatturaElettronica xmlns:b="#">
  <FatturaElettronicaHeader>
    <DatiTrasmissione>
      <IdTrasmittente>
        <IdPaese>IT</IdPaese>

我需要全部删除,直到

<FatturaElettronicaHeader>

现在代码是:

import xml.etree.ElementTree as ET
import xml.etree.ElementTree as ETree
from lxml import etree

parser = etree.XMLParser(encoding='utf-8', recover=True, remove_comments=True, resolve_entities=False)
tree = ETree.parse('test.xml', parser)

root = tree.getroot()

print etree.tostring(root)

给我:

Traceback (most recent call last):
  File "xml2.py", line 14, in <module>
    print etree.tostring(root)
  File "src/lxml/etree.pyx", line 3350, in lxml.etree.tostring
TypeError: Type 'NoneType' cannot be serialized.

没有它工作的 xml 文件的第一部分。

标签: pythondjangoxmlparsingpkcs#7

解决方案


您可以使用find()函数来搜索第一个括号。

import xml.etree.ElementTree as ET

with open ('...XMLFILE.xml', 'r') as file:
    filestring = file.read()

XML_start = filestring.find('<')
print(XML_start) #gives 31

tree = ET.fromstring(filestring[XML_start:])

for i in tree.iter():
    print(i.tag) #gives {#}FatturaElettronica, FatturaElettronicaHeader, ... 

而且您的 xml 文件也必须正确:

'''some non ascii character'''
<b:FatturaElettronica xmlns:b="#">
  <FatturaElettronicaHeader>
    <DatiTrasmissione>
      <IdTrasmittente>
        <IdPaese>IT</IdPaese>
        </IdTrasmittente>
    </DatiTrasmissione>
</FatturaElettronicaHeader>
</b:FatturaElettronica>

推荐阅读