首页 > 解决方案 > 如何使用 lxml 提取 XML 元素的上下文

问题描述

给定以下数据结构( Journal Article Tag Suite, JATS) ,我想提取 PubMed Central 论文的引用上下文:

<p>
This is a sentence. 
This is a citing sentence [<xref ref-type="bibr" rid="CR1">1</xref>]. 
This is another sentence
</p>

一个真实的示例文件:https ://www.dropbox.com/s/u4g1sisil33wnhu/PMC1914234.xml?dl=0

我想用 rid 提取“This is a citing sentence”作为引用的上下文CR1,而不是提取前面和后面的句子。

我能够找到包含xref标签的段落,但我不知道如何只提取正确的句子。XPath 命令string()text()本段仅返回没有结构信息(例如标签)的文本,因此我很难找到确切的句子。

有没有办法解决这个问题?

编辑:我想我需要一个可以提取具有结构信息的文本的 XPath 命令,而不是在其父元素上使用string()and 。text()

标签: pythonxmllxml

解决方案


如果你有 ref id <ref id="some ID">

您可以利用 Beautiful Soup 根据 ref id 轻松查找数据。

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

from bs4 import BeautifulSoup
import re


fName = 'PMC1914234.xml'
rid = 'CR1'

pattern = r'.*?\](.*)[\,\.\]\ ].*'
"""
        .*? # matches any character but a newline. * is a meta-character and means Repeat this 0 or more times. ? makes the * non-greedy, i.e., . will match up as few chars as possible before hitting a '['.
        \] # escaped character  ] = \]
        (.*) #Parenthesis 'groups' whatever is inside it and you can later retrieve the groups by their numeric IDs or names (if they're given one).
        \[.* # ens char  (escaped) change this if required to another char)
        """

with open(fName) as f:
    soup = BeautifulSoup(f, "xml")

#find paragraphs (<p>)
paragraphs = soup.find_all('p') # find a ll paragraphs

# loop through each one
for text in paragraphs:
    # look for the rid specified.
    found = text.find(rid=rid)
    if found:
        print(text.text)
        # match the item between the tags.. If there is no other [ to hit, you might need to add something else to the list.            
        match = match = re.search(pattern, text.text).group(1)
        print(match) # there maybe more groups, you can iterate like below
# >>> However, when these pathways are deranged because of genetic mutations of enzymes involved or relative deficiencies of folate, vitamin B6 or vitamin B12, the serum concentration of Hcy increases.

推荐阅读