python - 如何使用 lxml 提取 XML 元素的上下文
问题描述
给定以下数据结构( Journal Article Tag Suite, JATS) ,我想提取 PubMed Central 论文的引用上下文:
<p>
This is a sentence.
This is a citing sentence [<xref ref-type="bibr" rid="CR1">1</xref>].
This is another sentence
</p>
一个真实的示例文件:https ://www.dropbox.com/s/u4g1sisil33wnhu/PMC1914234.xml?dl=0
我想用 rid 提取“This is a citing sentence”作为引用的上下文CR1
,而不是提取前面和后面的句子。
我能够找到包含xref
标签的段落,但我不知道如何只提取正确的句子。XPath 命令string()
或text()
本段仅返回没有结构信息(例如标签)的文本,因此我很难找到确切的句子。
有没有办法解决这个问题?
编辑:我想我需要一个可以提取具有结构信息的文本的 XPath 命令,而不是在其父元素上使用string()
and 。text()
解决方案
如果你有 ref id <ref id="some ID">
,
您可以利用 Beautiful Soup 根据 ref id 轻松查找数据。
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
from bs4 import BeautifulSoup
import re
fName = 'PMC1914234.xml'
rid = 'CR1'
pattern = r'.*?\](.*)[\,\.\]\ ].*'
"""
.*? # matches any character but a newline. * is a meta-character and means Repeat this 0 or more times. ? makes the * non-greedy, i.e., . will match up as few chars as possible before hitting a '['.
\] # escaped character ] = \]
(.*) #Parenthesis 'groups' whatever is inside it and you can later retrieve the groups by their numeric IDs or names (if they're given one).
\[.* # ens char (escaped) change this if required to another char)
"""
with open(fName) as f:
soup = BeautifulSoup(f, "xml")
#find paragraphs (<p>)
paragraphs = soup.find_all('p') # find a ll paragraphs
# loop through each one
for text in paragraphs:
# look for the rid specified.
found = text.find(rid=rid)
if found:
print(text.text)
# match the item between the tags.. If there is no other [ to hit, you might need to add something else to the list.
match = match = re.search(pattern, text.text).group(1)
print(match) # there maybe more groups, you can iterate like below
# >>> However, when these pathways are deranged because of genetic mutations of enzymes involved or relative deficiencies of folate, vitamin B6 or vitamin B12, the serum concentration of Hcy increases.
推荐阅读
- go - 使用数组初始化结构
- camunda - 尝试为 Camunda 设置流程变量时,Node js 中未发生轮询
- c++ - C++中xpos和迭代器的关系是什么
- entity-framework-core - 使用 .Net 5 EF 从 abp.io 中的 CrudAppService 检索子实体
- python - push and reply on LINE chat bot
- ios - SwiftUI 从 ViewModel 预填充 TextField
- bash - git: list all files in HEAD that have more than one author
- apache - .htaccess redirect... got lost
- javascript - 使用 jQuery 动态生成多个 DOM-Node 级别
- string - 找到单词时的Stata标志,而不是strpos