首页 > 解决方案 > 从我的 XML 文件中提取信息并为其分配一个向量

问题描述

我想在我的计算机上用 python 解析一些 XML 文件并从每个文件中提取一些信息

这是我其中之一的 xml 文件:

在此处输入图像描述

(如果你想要文字在这里: https ://github.com/peldszus/arg-microtexts/blob/master/corpus/en/micro_b002.xml )

作为第一级,我已经完成了第一级:

myList = []                #read the whole text from 
for root, dirs, files in os.walk(path):
    for file in files:
        if file.endswith('.xml'):
            with open(os.path.join(root, file), encoding="UTF-8") as content:
                tree = ET.parse(content)
                myList.append(tree)

在 myList 我有几个 XMl 文件 <xml.etree.ElementTree.ElementTree at 0x1f0fb1f8430>

现在对于他们没有 type="seg" 的根“edge”

 <edge id="c1" src="a1" trg="a3" type="sup"/>
  <edge id="c2" src="a2" trg="a3" type="sup"/>
  <edge id="c4" src="a4" trg="a3" type="reb"/>
  <edge id="c5" src="a5" trg="c4" type="und"/>

我想提取标签“src”那些我想提取标签= Src,

  src="a1"  
  src="a2"  
  src="a4" 
  src="a5" 

然后我想分配的数字不在src中,因为这句话叫做前提,例如在这里......我想说“a3”就是所谓的“前提”(因为它不是标签src)

例如这里

(0,0,1,0,0) 应该是我的过程的结果,因为 a3 未应用我将第三个数组设为 1,其余设为零

通常,我想提取信息以注释我的文本,这些文本已经使用 xml 进行了一些注释

标签: pythonxmlnlp

解决方案


您的问题并非一切都清楚......
以下是数据提取部分

import xml.etree.ElementTree as ET

xml = '''<?xml version='1.0' encoding='UTF-8'?>
<arggraph id="micro_b002" topic_id="higher_dog_poo_fines" stance="pro">
  <edu id="e1"><![CDATA[One can hardly move in Friedrichshain or Neukölln these days without permanently scanning the ground for dog dirt.]]></edu>
  <edu id="e2"><![CDATA[And when bad luck does strike and you step into one of the many 'land mines' you have to painstakingly scrape the remains off your soles.]]></edu>
  <edu id="e3"><![CDATA[Higher fines are therefore the right measure against negligent, lazy or simply thoughtless dog owners.]]></edu>
  <edu id="e4"><![CDATA[Of course, first they'd actually need to be caught in the act by public order officers,]]></edu>
  <edu id="e5"><![CDATA[but once they have to dig into their pockets, their laziness will sure vanish!]]></edu>
  <adu id="a1" type="pro"/>
  <adu id="a2" type="pro"/>
  <adu id="a3" type="pro"/>
  <adu id="a4" type="opp"/>
  <adu id="a5" type="pro"/>
  <edge id="c6" src="e1" trg="a1" type="seg"/>
  <edge id="c7" src="e2" trg="a2" type="seg"/>
  <edge id="c8" src="e3" trg="a3" type="seg"/>
  <edge id="c9" src="e4" trg="a4" type="seg"/>
  <edge id="c10" src="e5" trg="a5" type="seg"/>
  <edge id="c1" src="a1" trg="a3" type="sup"/>
  <edge id="c2" src="a2" trg="a3" type="sup"/>
  <edge id="c4" src="a4" trg="a3" type="reb"/>
  <edge id="c5" src="a5" trg="c4" type="und"/>
</arggraph>'''
root = ET.fromstring(xml)
interesting_edges_src = [e.attrib['src'] for e in root.findall('.//edge') if e.attrib['type'] != 'seg' ]
print(interesting_edges_src)

输出

['a1', 'a2', 'a4', 'a5']

推荐阅读