首页 > 解决方案 > 修复格式错误的 xml 时出现内存错误

问题描述

我有一些非常大的格式错误的 XML——它缺少顶级标记并且有重复的属性。为了解决这个问题,我在我的格式错误的 XML 的一个子集上测试了以下解决方案,它可以完美地添加标签并使用BeautifulSoup.

import sys
from bs4 import BeautifulSoup
import xml.etree.ElementTree as ET

flow_file = sys.stdin.read()

try:
    tree = ET.fromstring(flow_file)
    sys.stdout.write(flow_file)
except:
    flow_file = f"<dispatch>{flow_file}</dispatch>"
    soup = BeautifulSoup(flow_file, 'xml')
    sys.stdout.write(soup)

但是,由于我的真实文件太大,它会引发内存错误。由于我需要(AFAICT)完整的 XML 来添加顶级标签并删除重复项,因此我不太确定如何修改我的代码来处理如此大的 XML。我看到了一些使用lxml和迭代的建议,但我不清楚它如何适合我的需求/流程。

ETA:不确定是否有帮助,但这样做的重点是清理文件,以便它可以通过 NiFi 的SplitXML处理器运行。

标签: pythonbeautifulsouplxmlelementtreelarge-files

解决方案


由于我真的不知道你对数据做了什么,这里是我对几个 GB 大 xml 文件采取的方法:

import xml.etree.ElementTree as etree

root = False
#iterparse file, get event tags start and end
for event, elem in etree.iterparse("my_big.xml", events=('start', 'end')):
#set first element as rootelement, so we can clear it later
    if event == "start" and root == "False":
        root = elem
#Here we look for a certain end tag
    if event == "end" and elem.tag == "TAGOFINTEREST":
#set found False, so we can break, as soon, as we found our DataOfInterest
        found = False
#iterate through children and iterate over child nodes
#HERE I guess you would work with pandas
        for stuff in elem.getchildren():
#if found == True stop iterating or whatever condition you have
            if found:
                break
             #look for what you need, set found to True and break the 
             found = True
#clear elem, in order to save RAM
        elem.clear()

#Might require revision clears RAM after every "end" event
    if event == "end":
        root.clear()

我希望这有帮助。


推荐阅读