首页 > 解决方案 > 从python中的xml文件中提取信息

问题描述

我想从几个 xml 文件中提取信息,如下所示: 在此处输入图像描述

https://github.com/peldszus/arg-microtexts/blob/master/corpus/en/micro_b001.xml

我只想提取此标签信息:

<arggraph id="micro_b001" topic_id="waste_separation" stance="pro">

这是:“micro_b001”“waste_separation”

我想将它们保存为列表

我试过这个:

myList = []  
myEdgesList=[]
#read the whole text from 
for root, dirs, files in os.walk(path):
    for file in files:
        if file.endswith('.xml'):
            with open(os.path.join(root, file), encoding="UTF-8") as content:
                tree = ET.parse(content)
                myList.append(tree)

上面的代码是正确的,它给出了每个文件的信息

<xml.etree.ElementTree.ElementTree at 0x21c893e34c0>,

但这看起来不正确

for k in myList:
    arg= [e.attrib['stance'] for e in k.findall('.//arggraph')  ]
    print(arg)

第二个代码没有给我所需的值

标签: pythonxmlnlp

解决方案


处理此问题的一种方法:

from lxml import etree
tree = etree.parse(myfile.xml)
for graph in tree.xpath('//arggraph'):
    print(graph.xpath('@id')[0])
    print(graph.xpath('@topic_id')[0])

输出:

micro_b001
waste_separation

推荐阅读