首页 > 解决方案 > Python XML元素树解析一个大文档,返回一个子集

问题描述

我有一个德语文本的 xml 元素的大文档,root iter 只返回文档的一个子集

root.iter('tu') 只找到 82

import logging
import xml.etree.cElementTree as ET
class Extractor(object):
    def _get_iter(self, filename: str):
        with open(filename) as objects:
            context = ET.iterparse(objects, events=("start", "end"))

            index, (event, root) = next(enumerate(context))

            return root.iter('tu')

    def get_objects(self, filename: str, limit=-1):
        found = sum(1 for _ in self._get_iter(filename))
        logging.getLogger(__name__).info('found: {}'.format(found))

// found is 82, actual number is millions

alignments = extractor.get_alignments('data/file.tmx', 100000)

更新:示例 tmx 文件:https ://pastebin.com/kUFMMjck

更新:使用 event 和 tagname = tu 解决了它,我想这是 root.iter() 的错误行为

标签: pythonxml

解决方案


root.iter('tagname') 的行为出乎意料,它不能像预期的迭代器那样工作,并且显然是对文档进行了预解析。

解决方案是

class Extractor(object):
    def get_objects(self, filename: str):

        # get an iterable
        context = ET.iterparse(filename, events=("start", "end"))

        # turn it into an iterator
        context = iter(context)

        for event, elem in context:
            if event == "end" and elem.tag == "tu":
               # do something with elem
               elem.clear() # clears memory after doing something with the data


推荐阅读