python - 通过 XML 解析时正在删除记录
问题描述
我正在将我的 XML 解析为 Pandas DF,但在此过程中我丢失了记录。并非所有记录都具有所有属性。在这种情况下,我注意到记录(DF 中的行)已从 DF 中删除,而不是被“无”替换。
有没有办法减轻这种情况?我似乎找不到解决方案。
我在下面粘贴了我的代码作为参考:
import xml.etree.ElementTree as et
import pandas as pd
tree = et.parse('20191125_DMG_PI.xml')
root = tree.getroot()
df_cols = ["status",
"priref",
"full_name",
"achternaam",
"geboorteplaats",
"sterfplaats",
"detail",
"adres",
"zip",
"note",
"gender"]
rows = []
for record in root:
for child in record:
s_priref = ""
s_priref = child.get('priref')
for child in record:
s_name_note = ""
s_name_note = child.get('name.note')
for child in record:
s_surname = ""
s_surname = child.find('surname')
for field in child.findall('Address'):
s_adress = ""
s_address = field.find('address').text if field is not None else None
for field in child.findall('Address'):
s_zip = ""
s_zip = field.find('address.postal_code').text if field is not None else None
for field in child.findall('name'):
s_full_name = ""
s_full_name = field.find('value').text if field is not None else None
for field in child.findall('name.status'):
s_status = ""
s_status = field.find('value').text if field is not None else None
for field in child.findall('level_of_detail'):
s_detail = ""
s_detail = field.tag + ": " + field.find('value').text if field is not None else None
for field in child.findall('gender'):
s_gender = ""
s_gender = field.find('value').text
for field in child.findall('birth.place'):
s_gbp = ""
s_gbp = field.find('value').text if field is not None else None
for field in child.findall('death.place'):
s_pvo = ""
if len(field.findall('death.place')) == 0:
s_pvo = "NaN"
else:
s_pvo = field.find('value').text if field is not None else None
rows.append({"status": s_status,
"priref": s_priref,
"full_name": s_full_name,
"achternaam": s_surname,
"geboorteplaats": s_gbp,
"sterfplaats": s_pvo,
"detail": s_detail,
"adres": s_address,
"zip": s_zip,
"note": s_name_note,
"gender": s_gender
})
out_df = pd.DataFrame(rows, columns=df_cols)
print(out_df)
前三条记录如下:
<recordList><record priref="530000001" creation="2014-06-23T11:36:18" modification="2019-09-13T09:07:12">
<name>
<value lang="">C.I.A.P.</value>
</name>
<name.type>
<value lang="neutral">ACQUISITIONSOURCE</value>
<value lang="0">acquisition source</value>
<value lang="1">verwervingsbron</value>
<value lang="2">source d'acquisition</value>
<value lang="3">Erwerbungsquelle</value>
<value lang="5">fonte di acquisizione</value>
<value lang="6">πηγή απόκτησης</value>
</name.type>
<name.type>
<value lang="neutral">INST</value>
<value lang="0">institution</value>
<value lang="1">instelling</value>
<value lang="2">institution</value>
<value lang="3">Institution</value>
<value lang="4">المؤسسة</value>
<value lang="5">istituto</value>
<value lang="6">οργανισμός</value>
</name.type>
<name.status>
<value lang="neutral">1</value>
<value lang="0">approved preferred term</value>
<value lang="1">descriptor</value>
<value lang="2">descripteur</value>
<value lang="3">Deskriptor</value>
<value lang="5">termine preferenziale approvato</value>
</name.status>
<Address>
<address>Lombaardstraat 23</address>
<address.country>
<value lang="">België</value>
</address.country>
<address.place>
<value lang="">Hasselt</value>
</address.place>
<address.postal_code>3500</address.postal_code>
<address.type />
</Address>
<level_of_detail>
<value lang="neutral">PARTIAL</value>
<value lang="0">partial</value>
<value lang="1">partieel</value>
<value lang="2">partiel</value>
<value lang="3">partiell</value>
<value lang="5">parziale</value>
</level_of_detail>
<birth.place>
<value lang="">Hasselt</value>
</birth.place>
<id_number>53</id_number>
<supplier.letter.processing>
<value lang="neutral">PRINT</value>
<value lang="0">Print to documents</value>
<value lang="1">Afdrukken naar documenten</value>
<value lang="2">Imprimer en documents</value>
<value lang="3">Ausdruck in Dokumenten</value>
<value lang="5">Stampa nei documenti</value>
</supplier.letter.processing>
<name.note>Centrum voor Informatie en Aktueel Prentenkabinet</name.note>
<Place_activity>
<place_activity.institution />
<place_activity.type />
<place_activity>
<value lang="">Hasselt</value>
</place_activity>
<place_activity.notes />
<place_activity.date.end />
<place_activity.date.start />
</Place_activity>
<Edit>
<edit.notes />
<edit.source>people>people</edit.source>
<edit.date>2019-09-13</edit.date>
<edit.name>ovandhuynslager</edit.name>
<edit.time>09:07:12</edit.time>
</Edit>
<Edit>
<edit.notes />
<edit.source>people>people</edit.source>
<edit.date>2019-09-12</edit.date>
<edit.name>ovandhuynslager</edit.name>
<edit.time>13:15:16</edit.time>
</Edit>
</record><record priref="530000003" creation="2014-06-23T11:36:18" modification="2019-09-13T09:02:51">
<name>
<value lang="">Goossens, K.</value>
</name>
<name.type>
<value lang="neutral">ACQUISITIONSOURCE</value>
<value lang="0">acquisition source</value>
<value lang="1">verwervingsbron</value>
<value lang="2">source d'acquisition</value>
<value lang="3">Erwerbungsquelle</value>
<value lang="5">fonte di acquisizione</value>
<value lang="6">πηγή απόκτησης</value>
</name.type>
<name.type>
<value lang="neutral">PERSON</value>
<value lang="0">person</value>
<value lang="1">persoon</value>
<value lang="2">personne</value>
<value lang="3">Person</value>
<value lang="4">إسم شخص</value>
<value lang="5">persona</value>
<value lang="6">πρόσωπο</value>
</name.type>
<name.status>
<value lang="neutral">1</value>
<value lang="0">approved preferred term</value>
<value lang="1">descriptor</value>
<value lang="2">descripteur</value>
<value lang="3">Deskriptor</value>
<value lang="5">termine preferenziale approvato</value>
</name.status>
<surname>Goossens</surname>
<Address>
<address>Morckhovelei</address>
<address.country>
<value lang="">België</value>
</address.country>
<address.place>
<value lang="">Borgerhout</value>
</address.place>
<address.postal_code />
<address.type />
</Address>
<nationality>
<value lang="">Belgisch</value>
</nationality>
<level_of_detail>
<value lang="neutral">PARTIAL</value>
<value lang="0">partial</value>
<value lang="1">partieel</value>
<value lang="2">partiel</value>
<value lang="3">partiell</value>
<value lang="5">parziale</value>
</level_of_detail>
<forename>K.</forename>
<gender>
<value lang="neutral">FEMALE</value>
<value lang="0">female</value>
<value lang="1">vrouw</value>
<value lang="2">femme</value>
<value lang="3">weiblich</value>
<value lang="5">femmina</value>
<value lang="6">θηλυκό</value>
</gender>
<id_number>53</id_number>
<supplier.letter.processing>
<value lang="neutral">PRINT</value>
<value lang="0">Print to documents</value>
<value lang="1">Afdrukken naar documenten</value>
<value lang="2">Imprimer en documents</value>
<value lang="3">Ausdruck in Dokumenten</value>
<value lang="5">Stampa nei documenti</value>
</supplier.letter.processing>
<Edit>
<edit.notes />
<edit.source>people>people</edit.source>
<edit.date>2019-09-13</edit.date>
<edit.name>ovandhuynslager</edit.name>
<edit.time>09:02:51</edit.time>
</Edit>
<Edit>
<edit.notes />
<edit.source>people>people</edit.source>
<edit.date>2019-09-12</edit.date>
<edit.name>ovandhuynslager</edit.name>
<edit.time>13:21:05</edit.time>
</Edit>
<Edit>
<edit.notes />
<edit.source>people>people</edit.source>
<edit.date>2019-09-12</edit.date>
<edit.name>ovandhuynslager</edit.name>
<edit.time>13:20:03</edit.time>
</Edit>
<Edit>
<edit.notes />
<edit.source>people>people</edit.source>
<edit.date>2019-09-12</edit.date>
<edit.name>ovandhuynslager</edit.name>
<edit.time>13:19:45</edit.time>
</Edit>
<Edit>
<edit.notes />
<edit.source>people>people</edit.source>
<edit.date>2019-09-12</edit.date>
<edit.name>ovandhuynslager</edit.name>
<edit.time>13:19:16</edit.time>
</Edit>
</record><record priref="530000004" creation="2014-06-23T11:36:18" modification="2019-07-19T09:55:26">
<name>
<value lang="">De Bruyne, Pieter</value>
</name>
<name.type>
<value lang="neutral">MAKER</value>
<value lang="0">creator</value>
<value lang="1">vervaardiger</value>
<value lang="2">créateur</value>
<value lang="3">Hersteller</value>
<value lang="4">الصانع</value>
<value lang="5">creatore</value>
<value lang="6">δημιουργός</value>
</name.type>
<name.type>
<value lang="neutral">ACQUISITIONSOURCE</value>
<value lang="0">acquisition source</value>
<value lang="1">verwervingsbron</value>
<value lang="2">source d'acquisition</value>
<value lang="3">Erwerbungsquelle</value>
<value lang="5">fonte di acquisizione</value>
<value lang="6">πηγή απόκτησης</value>
</name.type>
<name.type>
<value lang="neutral">PERSON</value>
<value lang="0">person</value>
<value lang="1">persoon</value>
<value lang="2">personne</value>
<value lang="3">Person</value>
<value lang="4">إسم شخص</value>
<value lang="5">persona</value>
<value lang="6">πρόσωπο</value>
</name.type>
<name.type>
<value lang="neutral">AUTHOR</value>
<value lang="0">author</value>
<value lang="1">auteur</value>
<value lang="2">auteur</value>
<value lang="3">Verfasser</value>
<value lang="4">المؤلف</value>
<value lang="5">autore</value>
<value lang="6">συντάκτης</value>
</name.type>
<birth.date.start>1931</birth.date.start>
<death.date.start>1987</death.date.start>
<name.status>
<value lang="neutral">1</value>
<value lang="0">approved preferred term</value>
<value lang="1">descriptor</value>
<value lang="2">descripteur</value>
<value lang="3">Deskriptor</value>
<value lang="5">termine preferenziale approvato</value>
</name.status>
<surname>De Bruyne</surname>
<Address>
<address>Stationstraat 16</address>
<address.country>
<value lang="">België</value>
</address.country>
<address.place>
<value lang="">Aalst</value>
</address.place>
<address.postal_code>9300</address.postal_code>
<address.type>woning Pieter De Bruyne</address.type>
</Address>
<biography>Pieter De Bruyne is als pionier binnen het postmodern ontwerpen een internationaal geapprecieerde meubelontwerper. Hij wijdde zijn hele leven aan de vernieuwing van het meubilair. De Bruynes werk sluit aan bij de Memphis-stijl, hoewel hij nooit actief deel wilde uitmaken van dergelijke bewegingen. Elk meubel van zijn hand opent nieuwe perspectieven en is stimulans om andere denkrichtingen in te slaan.
Bibliotheek Design museum Gent:
(1) Pieter De Bruyne 1931- 1987. Pionier van het postmoderne. / Christian Kieckens, Eva Storgaard
(2) 25 jaar Pieter De Bruyne. / Christian Norberg-Schulz</biography>
<Source>
<source>http://vocab.getty.edu/page/ulan/</source>
<source.number>500009402</source.number>
</Source>
<Source>
<source>https://www.wikidata.org/wiki/</source>
<source.number>Q14101030</source.number>
</Source>
<death.date.end>1987</death.date.end>
<death.place>
<value lang="">Aalst</value>
</death.place>
<nationality>
<value lang="">Belgisch</value>
</nationality>
<level_of_detail>
<value lang="neutral">FULL</value>
<value lang="0">full</value>
<value lang="1">volledig</value>
<value lang="2">complet</value>
<value lang="3">vollständig</value>
<value lang="5">completo</value>
</level_of_detail>
<forename>Pieter</forename>
<birth.date.end>1931</birth.date.end>
<birth.place>
<value lang="">Aalst</value>
</birth.place>
<gender>
<value lang="neutral">MALE</value>
<value lang="0">male</value>
<value lang="1">man</value>
<value lang="2">homme</value>
<value lang="3">männlich</value>
<value lang="5">maschio</value>
<value lang="6">αρσενικό</value>
</gender>
<occupation>
<value lang="">ontwerper</value>
</occupation>
<Part_of>
<part_of>
<value lang="">Pieter De Bruyne N.V.</value>
</part_of>
<part_of.notes />
<part_of.category />
<part_of.date.end />
<part_of.date.start />
</Part_of>
<Equivalent>
<equivalent_name>
<value lang="">Pieter De Bruyne N.V.</value>
</equivalent_name>
<equivalent_name.category />
</Equivalent>
<id_number>53</id_number>
<supplier.letter.processing>
<value lang="neutral">PRINT</value>
<value lang="0">Print to documents</value>
<value lang="1">Afdrukken naar documenten</value>
<value lang="2">Imprimer en documents</value>
<value lang="3">Ausdruck in Dokumenten</value>
<value lang="5">Stampa nei documenti</value>
</supplier.letter.processing>
<school_style>
<value lang="">post-modernisme</value>
</school_style>
<language>
<value lang="">Nederlands</value>
</language>
<Edit>
<edit.notes />
<edit.source>people>people</edit.source>
<edit.date>2019-07-19</edit.date>
<edit.name>ovandhuynslager</edit.name>
<edit.time>09:55:26</edit.time>
</Edit>
<Edit>
<edit.notes />
<edit.source>people>people</edit.source>
<edit.date>2019-07-19</edit.date>
<edit.name>ovandhuynslager</edit.name>
<edit.time>09:55:24</edit.time>
</Edit>
<Edit>
<edit.notes />
<edit.source>people>people</edit.source>
<edit.date>2019-07-17</edit.date>
<edit.name>ovandhuynslager</edit.name>
<edit.time>11:24:24</edit.time>
</Edit>
<Edit>
<edit.notes />
<edit.source>people>people</edit.source>
<edit.date>2019-06-18</edit.date>
<edit.name>ovandhuynslager</edit.name>
<edit.time>11:54:47</edit.time>
</Edit>
<Edit>
<edit.notes />
<edit.source>people>people</edit.source>
<edit.date>2019-06-12</edit.date>
<edit.name>ovandhuynslager</edit.name>
<edit.time>11:44:02</edit.time>
</Edit>
<Edit>
<edit.notes />
<edit.source>people>people</edit.source>
<edit.date>2019-05-28</edit.date>
<edit.name>ovandhuynslager</edit.name>
<edit.time>08:20:09</edit.time>
</Edit>
<Edit>
<edit.notes />
<edit.source>people>people</edit.source>
<edit.date>2019-05-27</edit.date>
<edit.name>ovandhuynslager</edit.name>
<edit.time>10:44:41</edit.time>
</Edit>
<Edit>
<edit.notes />
<edit.source>people>people</edit.source>
<edit.date>2019-05-13</edit.date>
<edit.name>ovandhuynslager</edit.name>
<edit.time>14:24:58</edit.time>
</Edit>
<Edit>
<edit.notes />
<edit.source>people>people</edit.source>
<edit.date>2019-05-13</edit.date>
<edit.name>ovandhuynslager</edit.name>
<edit.time>14:23:25</edit.time>
</Edit>
<Edit>
<edit.notes />
<edit.source>people>people</edit.source>
<edit.date>2019-04-23</edit.date>
<edit.name>ovandhuynslager</edit.name>
<edit.time>16:12:25</edit.time>
</Edit>
<Edit>
<edit.notes />
<edit.source>thesau>thesau</edit.source>
<edit.date>2019-04-18</edit.date>
<edit.name>ovandhuynslager</edit.name>
<edit.time>15:19:53</edit.time>
</Edit>
<Edit>
<edit.notes />
<edit.source>COLLECT>intern</edit.source>
<edit.date>2016-09-26</edit.date>
<edit.name>rgoris</edit.name>
<edit.time>10:58:19</edit.time>
</Edit>
<Edit>
<edit.notes />
<edit.source>COLLECT>intern</edit.source>
<edit.date>2016-09-26</edit.date>
<edit.name>rgoris</edit.name>
<edit.time>10:57:40</edit.time>
</Edit>
<Edit>
<edit.notes />
<edit.source>COLLECT>intern</edit.source>
<edit.date>2016-09-26</edit.date>
<edit.name>rgoris</edit.name>
<edit.time>10:50:49</edit.time>
</Edit>
<Edit>
<edit.notes />
<edit.source>COLLECT>intern</edit.source>
<edit.date>2016-09-26</edit.date>
<edit.name>rgoris</edit.name>
<edit.time>10:21:40</edit.time>
</Edit>
<Edit>
<edit.notes />
<edit.source>COLLECT>intern</edit.source>
<edit.date>2016-09-26</edit.date>
<edit.name>rgoris</edit.name>
<edit.time>10:20:30</edit.time>
</Edit>
解决方案
通过切换到 XPath 作为定位任何给定节点的方法,您可以大大简化处理 XML 的代码部分。考虑一下:
import xml.etree.ElementTree as et
def node_text(node, default=''):
return node.text if node is not None and node.text is not None else default
tree = et.parse('20191125_DMG_PI.xml')
rows = []
for record in tree.iterfind('./record'):
rows.append({
'status': node_text(record.find('./name.status/value')),
'priref': record.get('priref'),
'full_name': node_text(record.find('./name/value')),
'achternaam': node_text(record.find('./surname')),
'geboorteplaats': node_text(record.find('./birth.place/value')),
'sterfplaats': node_text(record.find('./death.place/value')),
'detail': node_text(record.find('./level_of_detail/value[@lang="neutral"]')),
'adres': node_text(record.find('./Address/address')),
'zip': node_text(record.find('./Address/address.postal_code')),
'note': node_text(record.find('./name.note')),
'gender': node_text(record.find('./gender/value'))
})
print(rows)
顶部的node_text()
辅助函数处理“找不到节点”的情况。None
如果您更喜欢空字符串,则可以将其用作默认值,或者为每个值传递单独的默认值。
ElementTree 中的 XPath 必须从 XPath 1.0 可以做的事情开始./
并且仅限于一个子集,但这对于您的用例来说已经绰绰有余了。
之后进入rows
数据框应该不再是问题。
推荐阅读
- oracle - 地图在 Oracle APEX 5.1.4 中不起作用
- node.js - 如何在 windows 中使用 yarn 运行一个简单的文件脚本
- java - 程序读取空格时未显示其他循环语句
- java - 在对象初始化和自动装配之后调用 init 方法
- swift - Swift - 如何将枚举与关联值进行比较?
- javascript - 反应路由器更改为具有不同参数的当前页面不起作用
- javascript - 如何在 javascript 成员函数中访问两个“this”
- vue.js - VueJs:迭代属性内的计算属性
- javascript - 在 Chrome 中为 div 元素设置动画(jQuery 或 CSS 转换)时意外触发或触发 Mouseleave 事件
- javascript - 检查并添加数组对象中的属性