python - 从 xml 文档中获取文本
问题描述
我想获取PMID,对于每个PMID,从作者列表中获取其他人的列表,对于每个PMID,我可以获取作者列表,同样对于所有其他PMId,我可以获取作者列表
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE PubmedArticleSet SYSTEM "http://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd">
<PubmedArticleSet>
<PubmedArticle>
<MedlineCitation Status="MEDLINE" Owner="NLM">
<PMID Version="1">2844048</PMID>
<DateCompleted>
<Year>1988</Year>
<Month>10</Month>
<Day>26</Day>
</DateCompleted>
<DateRevised>
<Year>2010</Year>
<Month>11</Month>
<Day>18</Day>
</DateRevised>
<AuthorList CompleteYN="Y">
<Author ValidYN="Y">
<LastName>Guarner</LastName>
<ForeName>J</ForeName>
<Initials>J</Initials>
<AffiliationInfo>
<Affiliation>Department of Pathology and Laboratory Medicine, Emory University Hospital, Atlanta, Georgia.</Affiliation>
</AffiliationInfo>
</Author>
<Author ValidYN="Y">
<LastName>Cohen</LastName>
<ForeName>C</ForeName>
<Initials>C</Initials>
</Author>
</AuthorList>
</MedlineCitation>
由于标签结构,我可以单独获取,但不知道如何对其进行分组。
tree = ET.parse('x.xml')
root = tree.getroot()
pid =[]
for pmid in root.iter('PMID'):
pid.append(pmid.text)
lastname=[]
for id in root.findall("./PubmedArticle/MedlineCitation/Article/AuthorList"):
for ln in id.findall("./Author/LastName"):
lastname.append(ln.text)
forename=[]
for id in root.findall("./PubmedArticle/MedlineCitation/Article/AuthorList"):
for fn in id.findall("./Author/ForeName"):
forename.append(fn.text)
initialname=[]
for id in root.findall("./PubmedArticle/MedlineCitation/Article/AuthorList"):
for i in id.findall("./Author/Initials"):
initialname.append(i.text)
预期产出
PMID AUTHORS
2844048 'Guarner J J', 'Cohen C C'
请提出解决问题的可能方法,预期输出的行数将更多,在此先感谢,
解决方案
我想我明白了,虽然花了一段时间。为了使这成为一个有趣的练习,我做了一些改变。
首先,您问题中的xml代码无效;例如,您可以在此处查看。
所以首先我修复了xml。此外,我把它变成了 PubmedArticleSet,所以它有 2 篇文章,第一篇文章有 3 位作者,第二篇文章有 3 位作者(显然是虚拟信息),只是为了确保代码能抓住所有作者。为了使它更简单一些,我删除了一些(与本练习无关的)信息,例如隶属关系。
所以这就是离开我们的地方。一、修改xml:
source = """
<PubmedArticleSet>
<PubmedArticle>
<MedlineCitation Status="MEDLINE" Owner="NLM">
<PMID Version="1">2844048</PMID>
<AuthorList CompleteYN="Y">
<Author ValidYN="Y">
<LastName>Guarner</LastName>
<ForeName>J</ForeName>
<Initials>J</Initials>
</Author>
<Author ValidYN="Y">
<LastName>Cohen</LastName>
<ForeName>C</ForeName>
<Initials>C</Initials>
</Author>
<Author ValidYN="Y">
<LastName>Mushi</LastName>
<ForeName>E</ForeName>
<Initials>F</Initials>
</Author>
</AuthorList>
</MedlineCitation>
</PubmedArticle>
<PubmedArticle>
<MedlineCitation Status="MEDLINE" Owner="NLM">
<PMID Version="1">123456</PMID>
<AuthorList CompleteYN="Y">
<Author ValidYN="Y">
<LastName>Smith</LastName>
<ForeName>C</ForeName>
<Initials>C</Initials>
</Author>
<Author ValidYN="Y">
<LastName>Jones</LastName>
<ForeName>E</ForeName>
<Initials>F</Initials>
</Author>
</AuthorList>
</MedlineCitation>
</PubmedArticle>
"""
接下来,导入需要导入的内容:
from lxml import etree
import pandas as pd
接下来,代码:
doc = etree.fromstring(source)
art_loc = '..//*/PubmedArticle' #this is the path to all the articles
#count the number of articles in the article set - that number is a float has to be converted to integer before use:
num_arts = int(doc.xpath(f'count({art_loc})')) # or could use len(doc.xpath(f'({art_loc})'))
grand_inf = [] #this list will hold the accumulated information at the end
for art in range(1,num_arts+1): #can't do range(num_arts) because of the different ways python and Pubmed count
loc_path = (f'{art_loc}[{art}]/*/') #locate the path to each article
#grab the article id:
id_path = loc_path+'PMID'
pmid = doc.xpath(id_path)[0].text
art_inf = [] #this list holds the information for each article
art_inf.append(pmid)
art_path = loc_path+'/Author' #locate the path to the author group
#determine the number of authors for this article; again, it's a float which needs to converted to integer
num_auths = int(doc.xpath(f'count({art_path})')) #again: could use len(doc.xpath(f'({art_path})'))
auth_inf = [] #this will hold the full name of each of the authors
for auth in range(1,num_auths+1):
auth_path = (f'{art_path}[{auth}]') #locate the path to each author
LastName = doc.xpath((f'{auth_path}/LastName'))[0].text
FirstName = doc.xpath((f'{auth_path}/ForeName'))[0].text
Middle = doc.xpath((f'{auth_path}/Initials'))[0].text
full_name = LastName+' '+FirstName+' '+Middle
auth_inf.append(full_name)
art_inf.append(auth_inf)
grand_inf.append(art_inf)
最后,将此信息加载到数据框中:
df=pd.DataFrame(grand_inf,columns=['PMID','Author(s)'])
df
输出:
PMID Author(s)
0 2844048 [Guarner J J, Cohen C C, Mushi E F]
1 123456 [Smith C C, Jones E F]
而我们现在可以休息了……
推荐阅读
- python - 在 django 查询集中获取 3 个以前的项目
- docker - 在 Jenkins 中创建 Docker 镜像并作为服务运行
- tensorflow - tensorflow中cudnnlstm的默认激活函数是什么
- amazon-ec2 - 卡夫卡集群设置
- c# - Ghostscript gsdll32.dll 给出的程序集未安装在您的系统上
- linux - 如何让netcat多次接收数据包?
- apache - 整个目录和子目录的 RewriteRule 301 重定向
- python - 对矩阵中的第一列和第二个列表进行“与”运算
- javascript - 无法设置未定义的属性“颜色”
- javascript - 如何在图表中的 y 轴上按固定量间隔值