首页 > 解决方案 > 通过 XML 解析时正在删除记录

问题描述

我正在将我的 XML 解析为 Pandas DF,但在此过程中我丢失了记录。并非所有记录都具有所有属性。在这种情况下,我注意到记录(DF 中的行)已从 DF 中删除,而不是被“无”替换。

有没有办法减轻这种情况?我似乎找不到解决方案。

我在下面粘贴了我的代码作为参考:

import xml.etree.ElementTree as et
import pandas as pd

tree = et.parse('20191125_DMG_PI.xml')
root = tree.getroot()

df_cols = ["status",
           "priref",
           "full_name",
           "achternaam",
           "geboorteplaats",
           "sterfplaats",
           "detail",
           "adres",
           "zip",
           "note",
           "gender"]
rows = []

for record in root:
    for child in record:
        s_priref = ""
        s_priref = child.get('priref')
    for child in record:
        s_name_note = ""
        s_name_note = child.get('name.note')
    for child in record:
        s_surname = ""
        s_surname = child.find('surname')

        for field in child.findall('Address'):
            s_adress = ""
            s_address = field.find('address').text if field is not None else None
        for field in child.findall('Address'):
            s_zip = ""
            s_zip = field.find('address.postal_code').text if field is not None else None
        for field in child.findall('name'):
            s_full_name = ""
            s_full_name = field.find('value').text if field is not None else None
        for field in child.findall('name.status'):
            s_status = ""
            s_status = field.find('value').text if field is not None else None
        for field in child.findall('level_of_detail'):
            s_detail = ""
            s_detail = field.tag + ": " + field.find('value').text if field is not None else None
        for field in child.findall('gender'):
            s_gender = ""
            s_gender = field.find('value').text

        for field in child.findall('birth.place'):
            s_gbp = ""
            s_gbp = field.find('value').text if field is not None else None
        for field in child.findall('death.place'):
            s_pvo = ""
            if len(field.findall('death.place')) == 0:
                s_pvo = "NaN"
            else:
                s_pvo = field.find('value').text if field is not None else None

            rows.append({"status": s_status,
                         "priref": s_priref,
                         "full_name": s_full_name,
                         "achternaam": s_surname,
                         "geboorteplaats": s_gbp,
                         "sterfplaats": s_pvo,
                         "detail": s_detail,
                         "adres": s_address,
                         "zip": s_zip,
                         "note": s_name_note,
                         "gender": s_gender
                         })

out_df = pd.DataFrame(rows, columns=df_cols)
print(out_df)

前三条记录如下:

<recordList><record priref="530000001" creation="2014-06-23T11:36:18" modification="2019-09-13T09:07:12">
  <name>
    <value lang="">C.I.A.P.</value>
  </name>
  <name.type>
    <value lang="neutral">ACQUISITIONSOURCE</value>
    <value lang="0">acquisition source</value>
    <value lang="1">verwervingsbron</value>
    <value lang="2">source d'acquisition</value>
    <value lang="3">Erwerbungsquelle</value>
    <value lang="5">fonte di acquisizione</value>
    <value lang="6">πηγή απόκτησης</value>
  </name.type>
  <name.type>
    <value lang="neutral">INST</value>
    <value lang="0">institution</value>
    <value lang="1">instelling</value>
    <value lang="2">institution</value>
    <value lang="3">Institution</value>
    <value lang="4">المؤسسة</value>
    <value lang="5">istituto</value>
    <value lang="6">οργανισμός</value>
  </name.type>
  <name.status>
    <value lang="neutral">1</value>
    <value lang="0">approved preferred term</value>
    <value lang="1">descriptor</value>
    <value lang="2">descripteur</value>
    <value lang="3">Deskriptor</value>
    <value lang="5">termine preferenziale approvato</value>
  </name.status>
  <Address>
    <address>Lombaardstraat 23</address>
    <address.country>
      <value lang="">België</value>
    </address.country>
    <address.place>
      <value lang="">Hasselt</value>
    </address.place>
    <address.postal_code>3500</address.postal_code>
    <address.type />
  </Address>
  <level_of_detail>
    <value lang="neutral">PARTIAL</value>
    <value lang="0">partial</value>
    <value lang="1">partieel</value>
    <value lang="2">partiel</value>
    <value lang="3">partiell</value>
    <value lang="5">parziale</value>
  </level_of_detail>
  <birth.place>
    <value lang="">Hasselt</value>
  </birth.place>
  <id_number>53</id_number>
  <supplier.letter.processing>
    <value lang="neutral">PRINT</value>
    <value lang="0">Print to documents</value>
    <value lang="1">Afdrukken naar documenten</value>
    <value lang="2">Imprimer en documents</value>
    <value lang="3">Ausdruck in Dokumenten</value>
    <value lang="5">Stampa nei documenti</value>
  </supplier.letter.processing>
  <name.note>Centrum voor Informatie en Aktueel Prentenkabinet</name.note>
  <Place_activity>
    <place_activity.institution />
    <place_activity.type />
    <place_activity>
      <value lang="">Hasselt</value>
    </place_activity>
    <place_activity.notes />
    <place_activity.date.end />
    <place_activity.date.start />
  </Place_activity>
  <Edit>
    <edit.notes />
    <edit.source>people&gt;people</edit.source>
    <edit.date>2019-09-13</edit.date>
    <edit.name>ovandhuynslager</edit.name>
    <edit.time>09:07:12</edit.time>
  </Edit>
  <Edit>
    <edit.notes />
    <edit.source>people&gt;people</edit.source>
    <edit.date>2019-09-12</edit.date>
    <edit.name>ovandhuynslager</edit.name>
    <edit.time>13:15:16</edit.time>
  </Edit>
</record><record priref="530000003" creation="2014-06-23T11:36:18" modification="2019-09-13T09:02:51">
  <name>
    <value lang="">Goossens, K.</value>
  </name>
  <name.type>
    <value lang="neutral">ACQUISITIONSOURCE</value>
    <value lang="0">acquisition source</value>
    <value lang="1">verwervingsbron</value>
    <value lang="2">source d'acquisition</value>
    <value lang="3">Erwerbungsquelle</value>
    <value lang="5">fonte di acquisizione</value>
    <value lang="6">πηγή απόκτησης</value>
  </name.type>
  <name.type>
    <value lang="neutral">PERSON</value>
    <value lang="0">person</value>
    <value lang="1">persoon</value>
    <value lang="2">personne</value>
    <value lang="3">Person</value>
    <value lang="4">إسم شخص</value>
    <value lang="5">persona</value>
    <value lang="6">πρόσωπο</value>
  </name.type>
  <name.status>
    <value lang="neutral">1</value>
    <value lang="0">approved preferred term</value>
    <value lang="1">descriptor</value>
    <value lang="2">descripteur</value>
    <value lang="3">Deskriptor</value>
    <value lang="5">termine preferenziale approvato</value>
  </name.status>
  <surname>Goossens</surname>
  <Address>
    <address>Morckhovelei</address>
    <address.country>
      <value lang="">België</value>
    </address.country>
    <address.place>
      <value lang="">Borgerhout</value>
    </address.place>
    <address.postal_code />
    <address.type />
  </Address>
  <nationality>
    <value lang="">Belgisch</value>
  </nationality>
  <level_of_detail>
    <value lang="neutral">PARTIAL</value>
    <value lang="0">partial</value>
    <value lang="1">partieel</value>
    <value lang="2">partiel</value>
    <value lang="3">partiell</value>
    <value lang="5">parziale</value>
  </level_of_detail>
  <forename>K.</forename>
  <gender>
    <value lang="neutral">FEMALE</value>
    <value lang="0">female</value>
    <value lang="1">vrouw</value>
    <value lang="2">femme</value>
    <value lang="3">weiblich</value>
    <value lang="5">femmina</value>
    <value lang="6">θηλυκό</value>
  </gender>
  <id_number>53</id_number>
  <supplier.letter.processing>
    <value lang="neutral">PRINT</value>
    <value lang="0">Print to documents</value>
    <value lang="1">Afdrukken naar documenten</value>
    <value lang="2">Imprimer en documents</value>
    <value lang="3">Ausdruck in Dokumenten</value>
    <value lang="5">Stampa nei documenti</value>
  </supplier.letter.processing>
  <Edit>
    <edit.notes />
    <edit.source>people&gt;people</edit.source>
    <edit.date>2019-09-13</edit.date>
    <edit.name>ovandhuynslager</edit.name>
    <edit.time>09:02:51</edit.time>
  </Edit>
  <Edit>
    <edit.notes />
    <edit.source>people&gt;people</edit.source>
    <edit.date>2019-09-12</edit.date>
    <edit.name>ovandhuynslager</edit.name>
    <edit.time>13:21:05</edit.time>
  </Edit>
  <Edit>
    <edit.notes />
    <edit.source>people&gt;people</edit.source>
    <edit.date>2019-09-12</edit.date>
    <edit.name>ovandhuynslager</edit.name>
    <edit.time>13:20:03</edit.time>
  </Edit>
  <Edit>
    <edit.notes />
    <edit.source>people&gt;people</edit.source>
    <edit.date>2019-09-12</edit.date>
    <edit.name>ovandhuynslager</edit.name>
    <edit.time>13:19:45</edit.time>
  </Edit>
  <Edit>
    <edit.notes />
    <edit.source>people&gt;people</edit.source>
    <edit.date>2019-09-12</edit.date>
    <edit.name>ovandhuynslager</edit.name>
    <edit.time>13:19:16</edit.time>
  </Edit>
</record><record priref="530000004" creation="2014-06-23T11:36:18" modification="2019-07-19T09:55:26">
  <name>
    <value lang="">De Bruyne, Pieter</value>
  </name>
  <name.type>
    <value lang="neutral">MAKER</value>
    <value lang="0">creator</value>
    <value lang="1">vervaardiger</value>
    <value lang="2">créateur</value>
    <value lang="3">Hersteller</value>
    <value lang="4">الصانع</value>
    <value lang="5">creatore</value>
    <value lang="6">δημιουργός</value>
  </name.type>
  <name.type>
    <value lang="neutral">ACQUISITIONSOURCE</value>
    <value lang="0">acquisition source</value>
    <value lang="1">verwervingsbron</value>
    <value lang="2">source d'acquisition</value>
    <value lang="3">Erwerbungsquelle</value>
    <value lang="5">fonte di acquisizione</value>
    <value lang="6">πηγή απόκτησης</value>
  </name.type>
  <name.type>
    <value lang="neutral">PERSON</value>
    <value lang="0">person</value>
    <value lang="1">persoon</value>
    <value lang="2">personne</value>
    <value lang="3">Person</value>
    <value lang="4">إسم شخص</value>
    <value lang="5">persona</value>
    <value lang="6">πρόσωπο</value>
  </name.type>
  <name.type>
    <value lang="neutral">AUTHOR</value>
    <value lang="0">author</value>
    <value lang="1">auteur</value>
    <value lang="2">auteur</value>
    <value lang="3">Verfasser</value>
    <value lang="4">المؤلف</value>
    <value lang="5">autore</value>
    <value lang="6">συντάκτης</value>
  </name.type>
  <birth.date.start>1931</birth.date.start>
  <death.date.start>1987</death.date.start>
  <name.status>
    <value lang="neutral">1</value>
    <value lang="0">approved preferred term</value>
    <value lang="1">descriptor</value>
    <value lang="2">descripteur</value>
    <value lang="3">Deskriptor</value>
    <value lang="5">termine preferenziale approvato</value>
  </name.status>
  <surname>De Bruyne</surname>
  <Address>
    <address>Stationstraat 16</address>
    <address.country>
      <value lang="">België</value>
    </address.country>
    <address.place>
      <value lang="">Aalst</value>
    </address.place>
    <address.postal_code>9300</address.postal_code>
    <address.type>woning Pieter De Bruyne</address.type>
  </Address>
  <biography>Pieter De Bruyne is als pionier binnen het postmodern ontwerpen een internationaal geapprecieerde meubelontwerper. Hij wijdde zijn hele leven aan de vernieuwing van het meubilair. De Bruynes werk sluit aan bij de Memphis-stijl, hoewel hij nooit actief deel wilde uitmaken van dergelijke bewegingen. Elk meubel van zijn hand opent nieuwe perspectieven en is stimulans om andere denkrichtingen in te slaan. 

Bibliotheek Design museum Gent:     
(1) Pieter De Bruyne 1931- 1987. Pionier van het postmoderne.  / Christian Kieckens, Eva Storgaard
(2) 25 jaar Pieter De Bruyne. / Christian Norberg-Schulz</biography>
  <Source>
    <source>http://vocab.getty.edu/page/ulan/</source>
    <source.number>500009402</source.number>
  </Source>
  <Source>
    <source>https://www.wikidata.org/wiki/</source>
    <source.number>Q14101030</source.number>
  </Source>
  <death.date.end>1987</death.date.end>
  <death.place>
    <value lang="">Aalst</value>
  </death.place>
  <nationality>
    <value lang="">Belgisch</value>
  </nationality>
  <level_of_detail>
    <value lang="neutral">FULL</value>
    <value lang="0">full</value>
    <value lang="1">volledig</value>
    <value lang="2">complet</value>
    <value lang="3">vollständig</value>
    <value lang="5">completo</value>
  </level_of_detail>
  <forename>Pieter</forename>
  <birth.date.end>1931</birth.date.end>
  <birth.place>
    <value lang="">Aalst</value>
  </birth.place>
  <gender>
    <value lang="neutral">MALE</value>
    <value lang="0">male</value>
    <value lang="1">man</value>
    <value lang="2">homme</value>
    <value lang="3">männlich</value>
    <value lang="5">maschio</value>
    <value lang="6">αρσενικό</value>
  </gender>
  <occupation>
    <value lang="">ontwerper</value>
  </occupation>
  <Part_of>
    <part_of>
      <value lang="">Pieter De Bruyne N.V.</value>
    </part_of>
    <part_of.notes />
    <part_of.category />
    <part_of.date.end />
    <part_of.date.start />
  </Part_of>
  <Equivalent>
    <equivalent_name>
      <value lang="">Pieter De Bruyne N.V.</value>
    </equivalent_name>
    <equivalent_name.category />
  </Equivalent>
  <id_number>53</id_number>
  <supplier.letter.processing>
    <value lang="neutral">PRINT</value>
    <value lang="0">Print to documents</value>
    <value lang="1">Afdrukken naar documenten</value>
    <value lang="2">Imprimer en documents</value>
    <value lang="3">Ausdruck in Dokumenten</value>
    <value lang="5">Stampa nei documenti</value>
  </supplier.letter.processing>
  <school_style>
    <value lang="">post-modernisme</value>
  </school_style>
  <language>
    <value lang="">Nederlands</value>
  </language>
  <Edit>
    <edit.notes />
    <edit.source>people&gt;people</edit.source>
    <edit.date>2019-07-19</edit.date>
    <edit.name>ovandhuynslager</edit.name>
    <edit.time>09:55:26</edit.time>
  </Edit>
  <Edit>
    <edit.notes />
    <edit.source>people&gt;people</edit.source>
    <edit.date>2019-07-19</edit.date>
    <edit.name>ovandhuynslager</edit.name>
    <edit.time>09:55:24</edit.time>
  </Edit>
  <Edit>
    <edit.notes />
    <edit.source>people&gt;people</edit.source>
    <edit.date>2019-07-17</edit.date>
    <edit.name>ovandhuynslager</edit.name>
    <edit.time>11:24:24</edit.time>
  </Edit>
  <Edit>
    <edit.notes />
    <edit.source>people&gt;people</edit.source>
    <edit.date>2019-06-18</edit.date>
    <edit.name>ovandhuynslager</edit.name>
    <edit.time>11:54:47</edit.time>
  </Edit>
  <Edit>
    <edit.notes />
    <edit.source>people&gt;people</edit.source>
    <edit.date>2019-06-12</edit.date>
    <edit.name>ovandhuynslager</edit.name>
    <edit.time>11:44:02</edit.time>
  </Edit>
  <Edit>
    <edit.notes />
    <edit.source>people&gt;people</edit.source>
    <edit.date>2019-05-28</edit.date>
    <edit.name>ovandhuynslager</edit.name>
    <edit.time>08:20:09</edit.time>
  </Edit>
  <Edit>
    <edit.notes />
    <edit.source>people&gt;people</edit.source>
    <edit.date>2019-05-27</edit.date>
    <edit.name>ovandhuynslager</edit.name>
    <edit.time>10:44:41</edit.time>
  </Edit>
  <Edit>
    <edit.notes />
    <edit.source>people&gt;people</edit.source>
    <edit.date>2019-05-13</edit.date>
    <edit.name>ovandhuynslager</edit.name>
    <edit.time>14:24:58</edit.time>
  </Edit>
  <Edit>
    <edit.notes />
    <edit.source>people&gt;people</edit.source>
    <edit.date>2019-05-13</edit.date>
    <edit.name>ovandhuynslager</edit.name>
    <edit.time>14:23:25</edit.time>
  </Edit>
  <Edit>
    <edit.notes />
    <edit.source>people&gt;people</edit.source>
    <edit.date>2019-04-23</edit.date>
    <edit.name>ovandhuynslager</edit.name>
    <edit.time>16:12:25</edit.time>
  </Edit>
  <Edit>
    <edit.notes />
    <edit.source>thesau&gt;thesau</edit.source>
    <edit.date>2019-04-18</edit.date>
    <edit.name>ovandhuynslager</edit.name>
    <edit.time>15:19:53</edit.time>
  </Edit>
  <Edit>
    <edit.notes />
    <edit.source>COLLECT&gt;intern</edit.source>
    <edit.date>2016-09-26</edit.date>
    <edit.name>rgoris</edit.name>
    <edit.time>10:58:19</edit.time>
  </Edit>
  <Edit>
    <edit.notes />
    <edit.source>COLLECT&gt;intern</edit.source>
    <edit.date>2016-09-26</edit.date>
    <edit.name>rgoris</edit.name>
    <edit.time>10:57:40</edit.time>
  </Edit>
  <Edit>
    <edit.notes />
    <edit.source>COLLECT&gt;intern</edit.source>
    <edit.date>2016-09-26</edit.date>
    <edit.name>rgoris</edit.name>
    <edit.time>10:50:49</edit.time>
  </Edit>
  <Edit>
    <edit.notes />
    <edit.source>COLLECT&gt;intern</edit.source>
    <edit.date>2016-09-26</edit.date>
    <edit.name>rgoris</edit.name>
    <edit.time>10:21:40</edit.time>
  </Edit>
  <Edit>
    <edit.notes />
    <edit.source>COLLECT&gt;intern</edit.source>
    <edit.date>2016-09-26</edit.date>
    <edit.name>rgoris</edit.name>
    <edit.time>10:20:30</edit.time>
  </Edit>

标签: pythonxmlpandasdataframeelementtree

解决方案


通过切换到 XPath 作为定位任何给定节点的方法,您可以大大简化处理 XML 的代码部分。考虑一下:

import xml.etree.ElementTree as et

def node_text(node, default=''):
    return node.text if node is not None and node.text is not None else default

tree = et.parse('20191125_DMG_PI.xml')

rows = []
for record in tree.iterfind('./record'):
    rows.append({
        'status':         node_text(record.find('./name.status/value')),
        'priref':         record.get('priref'),
        'full_name':      node_text(record.find('./name/value')),
        'achternaam':     node_text(record.find('./surname')),
        'geboorteplaats': node_text(record.find('./birth.place/value')),
        'sterfplaats':    node_text(record.find('./death.place/value')),
        'detail':         node_text(record.find('./level_of_detail/value[@lang="neutral"]')),
        'adres':          node_text(record.find('./Address/address')),
        'zip':            node_text(record.find('./Address/address.postal_code')),
        'note':           node_text(record.find('./name.note')),
        'gender':         node_text(record.find('./gender/value'))
    })

print(rows)

顶部的node_text()辅助函数处理“找不到节点”的情况。None如果您更喜欢空字符串,则可以将其用作默认值,或者为每个值传递单独的默认值。

ElementTree 中的 XPath 必须从 XPath 1.0 可以做的事情开始./并且仅限于一个子集,但这对于您的用例来说已经绰绰有余了。

之后进入rows数据框应该不再是问题。


推荐阅读