首页 > 解决方案 > 使用 elemettree 获取 XML 中特定标签的内容

问题描述

以下是我的 XML 数据:

<PubmedArticle>
<MedlineCitation Status="MEDLINE" Owner="NLM">
  <PMID Version="1">1883738</PMID>
  <DateCompleted>
    <Year>1991</Year>
    <Month>10</Month>
    <Day>07</Day>
  </DateCompleted>
  <DateRevised>
    <Year>2013</Year>
    <Month>11</Month>
    <Day>21</Day>
  </DateRevised>
  <Article PubModel="Print">
    <Journal>
      <ISSN IssnType="Print">0959-9673</ISSN>
      <JournalIssue CitedMedium="Print">
        <Volume>72</Volume>
        <Issue>4</Issue>
        <PubDate>
          <Year>1991</Year>
          <Month>Aug</Month>
        </PubDate>
      </JournalIssue>
      <Title>International journal of experimental pathology</Title>
      <ISOAbbreviation>Int J Exp Pathol</ISOAbbreviation>
    </Journal>
    <ArticleTitle>The effect of HeNe laser radiation on the thyroid gland of the rat.</ArticleTitle>
    <Pagination>
      <MedlinePgn>379-85</MedlinePgn>
    </Pagination>
    <Abstract>
      <AbstractText>Although laser irradiation is becoming common practice in medicine, there is not always a clear understanding of the possible side-effects. The present report is a light and electron microscopic study of the effects of fixed low intensity doses of soft HeNe laser on the thyroid of Wistar rats. The immediate effects are mild multifocal degenerative changes; these lesions recover in less than 3 months. Long-term lesions are identified only by electron microscopy; they consist of an increased number of peroxisomes and free or intramitochondrial crystalline structures. We discuss the laser's hypothetical functions.</AbstractText>
    </Abstract>
    <AuthorList CompleteYN="Y">
      <Author ValidYN="Y">
        <LastName>Lerma</LastName>
        <ForeName>E</ForeName>
        <Initials>E</Initials>
        <AffiliationInfo>
          <Affiliation>Department of Pathology and Radiology, Hospital Universitario Virgen Macarena, University of Seville, Spain.</Affiliation>
        </AffiliationInfo>
      </Author>
      <Author ValidYN="Y">
        <LastName>Hevia</LastName>
        <ForeName>A</ForeName>
        <Initials>A</Initials>
      </Author>
      <Author ValidYN="Y">
        <LastName>Rodrigo</LastName>
        <ForeName>P</ForeName>
        <Initials>P</Initials>
      </Author>
      <Author ValidYN="Y">
        <LastName>Gonzalez-Campora</LastName>
        <ForeName>R</ForeName>
        <Initials>R</Initials>
      </Author>
      <Author ValidYN="Y">
        <LastName>Armas</LastName>
        <ForeName>J R</ForeName>
        <Initials>JR</Initials>
      </Author>
      <Author ValidYN="Y">
        <LastName>Galera</LastName>
        <ForeName>H</ForeName>
        <Initials>H</Initials>
      </Author>
    </AuthorList>
    <Language>eng</Language>
    <PublicationTypeList>
      <PublicationType UI="D016428">Journal Article</PublicationType>
    </PublicationTypeList>
  </Article>
  <MedlineJournalInfo>
    <Country>England</Country>
    <MedlineTA>Int J Exp Pathol</MedlineTA>
    <NlmUniqueID>9014042</NlmUniqueID>
    <ISSNLinking>0959-9673</ISSNLinking>
  </MedlineJournalInfo>
  <ChemicalList>
    <Chemical>
      <RegistryNumber>06LU7C9H1V</RegistryNumber>
      <NameOfSubstance UI="D014284">Triiodothyronine</NameOfSubstance>
    </Chemical>
    <Chemical>
      <RegistryNumber>Q51BO43MG4</RegistryNumber>
      <NameOfSubstance UI="D013974">Thyroxine</NameOfSubstance>
    </Chemical>
  </ChemicalList>
  <CitationSubset>IM</CitationSubset>
  <CommentsCorrectionsList>
    <CommentsCorrections RefType="Cites">
      <RefSource>J Histochem Cytochem. 1969 Oct;17(10):675-80</RefSource>
      <PMID Version="1">4194356</PMID>
    </CommentsCorrections>
    <CommentsCorrections RefType="Cites">
      <RefSource>Acta Anat (Basel). 1986;125(1):10-3</RefSource>
      <PMID Version="1">3953239</PMID>
    </CommentsCorrections>
    <CommentsCorrections RefType="Cites">
      <RefSource>Anat Anz. 1977;142(3):209-12</RefSource>
      <PMID Version="1">603070</PMID>
    </CommentsCorrections>
    <CommentsCorrections RefType="Cites">
      <RefSource>J Cell Biol. 1964 Nov;23:383-5</RefSource>
      <PMID Version="1">14222822</PMID>
    </CommentsCorrections>
    <CommentsCorrections RefType="Cites">
      <RefSource>J Cell Biol. 1967 Jun;33(3):605-23</RefSource>
      <PMID Version="1">6036524</PMID>
    </CommentsCorrections>
    <CommentsCorrections RefType="Cites">
      <RefSource>Am J Med. 1983 May;74(5):852-62</RefSource>
      <PMID Version="1">6837608</PMID>
    </CommentsCorrections>
    <CommentsCorrections RefType="Cites">
      <RefSource>Exp Eye Res. 1977 Jan;24(1):45-56</RefSource>
      <PMID Version="1">402283</PMID>
    </CommentsCorrections>
  </CommentsCorrectionsList>
  <MeshHeadingList>
    <MeshHeading>
      <DescriptorName UI="D000818" MajorTopicYN="N">Animals</DescriptorName>
    </MeshHeading>
    <MeshHeading>
      <DescriptorName UI="D007834" MajorTopicYN="N">Lasers</DescriptorName>
      <QualifierName UI="Q000009" MajorTopicYN="Y">adverse effects</QualifierName>
    </MeshHeading>
    <MeshHeading>
      <DescriptorName UI="D008297" MajorTopicYN="N">Male</DescriptorName>
    </MeshHeading>
    <MeshHeading>
      <DescriptorName UI="D008830" MajorTopicYN="N">Microbodies</DescriptorName>
      <QualifierName UI="Q000528" MajorTopicYN="N">radiation effects</QualifierName>
    </MeshHeading>
    <MeshHeading>
      <DescriptorName UI="D008854" MajorTopicYN="N">Microscopy, Electron</DescriptorName>
    </MeshHeading>
    <MeshHeading>
      <DescriptorName UI="D051381" MajorTopicYN="N">Rats</DescriptorName>
    </MeshHeading>
    <MeshHeading>
      <DescriptorName UI="D011919" MajorTopicYN="N">Rats, Inbred Strains</DescriptorName>
    </MeshHeading>
    <MeshHeading>
      <DescriptorName UI="D013961" MajorTopicYN="N">Thyroid Gland</DescriptorName>
      <QualifierName UI="Q000528" MajorTopicYN="Y">radiation effects</QualifierName>
      <QualifierName UI="Q000648" MajorTopicYN="N">ultrastructure</QualifierName>
    </MeshHeading>
    <MeshHeading>
      <DescriptorName UI="D013974" MajorTopicYN="N">Thyroxine</DescriptorName>
      <QualifierName UI="Q000097" MajorTopicYN="N">blood</QualifierName>
    </MeshHeading>
    <MeshHeading>
      <DescriptorName UI="D014284" MajorTopicYN="N">Triiodothyronine</DescriptorName>
      <QualifierName UI="Q000097" MajorTopicYN="N">blood</QualifierName>
    </MeshHeading>
  </MeshHeadingList>
  <OtherID Source="NLM">PMC2001961</OtherID>
</MedlineCitation>
<PubmedData>

我需要从文档中提取所有作者姓氏。但是,有多个这样的文件,每个文件都有不同的作者姓名。如何解析此文件并仅将作者姓氏提取到列表中以创建数据库?

我已经使用 elementtree 来解析文档。以下是我的代码:

tree = ET.parse("file path"+file)
            doc = tree.getroot()
            for LastName in doc.iter('LastName'):
                file1 = (ET.tostring(LastName, encoding='utf8').decode('utf8'))
                file2 = file1[48:(len(file1))]
                author_name_lastname = file2.split("<")[0]
                print(author_name_lastname)

在这里我只能打印第一个作者姓名“Lerma”。

标签: pythonpython-3.xxml-parsingelementtree

解决方案


import os
from lxml import etree as ET

DIR="D:\yourfilesdirectory/"

for filename in os.listdir(DIR):
    if filename.endswith(".xml"):
        with open(file=DIR+filename,mode='r',encoding='utf-8') as file:
            _tree = ET.fromstring(text=file.read())
            _all_metadata_tags = _tree.xpath('.//LastName')
            for i in _all_metadata_tags:
                print(i.text + '\n')

    else:
        print("skipping for filename")

推荐阅读