首页 > 解决方案 > 如何使用 spacy 短语匹配器在 python 中有效地遍历 XML 文件

问题描述

我试图在这里迭代一个日文到英文的字典,它存储为一个 XML 文件。我不需要它的所有部分,我所需要的只是能够选择一个词性并对具有该给定词性标签的所有条目进行排序:

<pos>&n;</pos>
<pos>&vs;</pos>

有关XML 类型声明的更多详细信息

现在,我想知道使用给定 POS 遍历所有条目的最佳方法是什么。这可能会发生变化,但我只对提取某些部分感兴趣,可能是这些:

<k_ele>
<keb>収集</keb>
<ke_pri>ichi1</ke_pri>
<ke_pri>news1</ke_pri>
<ke_pri>nf05</ke_pri>
</k_ele>
<k_ele>
<keb>蒐集</keb>
</k_ele>
<k_ele>
<keb>拾集</keb>
</k_ele>
<k_ele>
<keb>収輯</keb>
</k_ele>

一些伪代码:

For all Ichidan verbs in the XML file:
ruler.add_patterns([{"label": "ICHIDANVERB", "pattern": x.text} for x in word.kanji_forms])
ruler.add_patterns([{"label": "ICHIDANVERB", "pattern": x.text} for x in word.kana_forms])

也许可以选择忽略 okurigana

最有效的方法是什么?有成千上万的条目。非常感谢。

编辑:建议的解决方案:

import xml.etree.ElementTree as ET

path = r"C:\Users\NameRedacted\Desktop\JMdict"
tree = ET.parse(path)

print("Search the entire tree for entries with '&n;' pos")

# "noun (common) (futsuumeishi)" must be used instead of the entity version "&n;" as defined in the DTD
for entry in tree.findall("./entry/sense/[pos='noun (common) (futsuumeishi)']/.."):

  for k_ele in entry.findall("./k_ele"):

    for keb in k_ele.findall("./keb"):
      # Do something with every keb of the k_ele
        print(keb)
        ruler.add_patterns([{"label": "NOUNS", "pattern": (keb)}])
  for r_ele in entry.findall("./r_ele"):

    for reb in k_ele.findall("./reb"):
      # Do something with every reb of the r_ele
        ruler.add_patterns([{"label": "NOUNS", "pattern": (reb)}])

标签: pythonxmlspacy

解决方案


最简单的方法是将 XML 文件解析为内存树并使用 XPath 查找所需的元素。这将需要足够的内存,但如果需要,您可以多次查询树。

例子:

import xml.etree.ElementTree as ET

tree = ET.parse('JMdict_e')

print("Search the entire tree for entries with '&n;' pos")

# "noun (common) (futsuumeishi)" must be used instead of the entity version "&n;" as defined in the DTD
for entry in tree.findall("./entry/sense/[pos='noun (common) (futsuumeishi)']/.."):
  # Do something with every entry
  for k_ele in entry.findall("./k_ele"):
    # Do something with every k_ele of the entry
    for keb in k_ele.findall("./keb"):
      # Do something with every keb of the k_ele
      pass
    for ke_pri in k_ele.findall("./ke_pri"):
      # Do something with every ke_pri of the k_ele
      pass

# Delete the tree when no longer needed to release the memory
del tree

xml.etree.ElementTree 的文档显示了支持的 XPath 语法。

在这个 colab中查看演示。在这个使用 51 MB 的 XML(仅限英文翻译)的测试中,在将文件解析到内存树后,内存增加了约 500 MB。解析树大约需要 4 秒,查询它大约需要 3 秒。


推荐阅读