python - 使用 Python 和元素树查找和替换标签内的 XML 数据
问题描述
首先,我对 python 很陌生,知道的很少。然而,我的任务是制作这个程序,所以我很感谢你的帮助。
我需要匿名化 XML 文件中的数据。这将包括将多个标签更改为 NULL。
我首先尝试使用带有元素树的 python 来替换 DateOfBirth 数据。我需要将出生日期标签替换为 NULL
这是一个 XML 文件的片段,其中包含学习者的 MOCK 数据之一。这包括 1 个学习者,通常会有 1-1000 个学习者,并且所有值都需要在整个过程中更改为 NULL。
<?xml version="1.0" encoding="UTF-8"?>
<!-- Please note that this file is properly formed, and serves as an example of a file that will load into the ILR DC system. The data is anonymised and does not refer to a real-world provider, learning delivery or learner. Based on the ILR specification, version 2, dated April 2018-->
<Message xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns="ESFA/ILR/2018-19" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="ESFA/ILR/2018-19">
<Header>
<CollectionDetails>
<Collection>ILR</Collection>
<Year>1819</Year>
<FilePreparationDate>2018-01-07</FilePreparationDate>
</CollectionDetails>
<Source>
<ProtectiveMarking>OFFICIAL-SENSITIVE-Personal</ProtectiveMarking>
<UKPRN>99999999</UKPRN>
<SoftwareSupplier>SupplierName</SoftwareSupplier>
<SoftwarePackage>SystemName</SoftwarePackage>
<Release>1</Release>
<SerialNo>01</SerialNo>
<DateTime>2018-06-26T11:14:05</DateTime>
<!-- This and the next element only appear in files generated by FIS -->
<ReferenceData>Version5.0, LARS 2017-08-01</ReferenceData>
<ComponentSetVersion>1</ComponentSetVersion>
</Source>
</Header>
<SourceFiles>
<!-- The SourceFiles group only appears in files generated by FIS -->
<SourceFile>
<SourceFileName>ILR-LLLLLLLL1819-20180626-144401-01.xml</SourceFileName>
<FilePreparationDate>2018-06-26</FilePreparationDate>
<SoftwareSupplier>Software Systems Inc.</SoftwareSupplier>
<SoftwarePackage>GreatStuffMIS</SoftwarePackage>
<Release>1</Release>
<SerialNo>01</SerialNo>
<DateTime>2018-06-26T11:14:05</DateTime>
</SourceFile>
</SourceFiles>
<LearningProvider>
<UKPRN>99999999</UKPRN>
</LearningProvider>
<!-- 16 yr old learner undertaking full time 16-19 (excluding apprenticeships) funded programme -->
<Learner>
<LearnRefNumber>16Learner</LearnRefNumber>
<PMUKPRN>87654321</PMUKPRN>
<CampId>1234ABCD</CampId>
<ULN>1061484016</ULN>
<FamilyName>Smith</FamilyName>
<GivenNames>Jane</GivenNames>
<DateOfBirth>1999-02-27</DateOfBirth>
<Ethnicity>31</Ethnicity>
<Sex>F</Sex>
<LLDDHealthProb>2</LLDDHealthProb>
<Accom>5</Accom>
<PlanLearnHours>440</PlanLearnHours>
<PlanEEPHours>100</PlanEEPHours>
<MathGrade>NONE</MathGrade>
<EngGrade>D</EngGrade>
<PostcodePrior>BR1 7SS</PostcodePrior>
<Postcode>BR1 7SS</Postcode>
<AddLine1>The Street</AddLine1>
<AddLine2>ToyTown</AddLine2>
<LearnerFAM>
<LearnFAMType>LSR</LearnFAMType>
<LearnFAMCode>55</LearnFAMCode>
</LearnerFAM>
<LearnerFAM>
<LearnFAMType>EDF</LearnFAMType>
<LearnFAMCode>2</LearnFAMCode>
</LearnerFAM>
<LearnerFAM>
<LearnFAMType>MCF</LearnFAMType>
<LearnFAMCode>3</LearnFAMCode>
</LearnerFAM>
<LearnerFAM>
<LearnFAMType>FME</LearnFAMType>
<LearnFAMCode>2</LearnFAMCode>
</LearnerFAM>
<LearnerFAM>
<LearnFAMType>PPE</LearnFAMType>
<LearnFAMCode>2</LearnFAMCode>
</LearnerFAM>
<!-- Employment status record is not required for full time 16-19 (excluding apprenticeships) funded learners -->
<!-- 16-19 (excluding apprenticeships) funded study programme -->
<LearningDelivery>
<LearnAimRef>50022246</LearnAimRef>
<AimType>5</AimType>
<AimSeqNumber>1</AimSeqNumber>
<LearnStartDate>2015-09-14</LearnStartDate>
<LearnPlanEndDate>2016-07-02</LearnPlanEndDate>
<FundModel>25</FundModel>
<DelLocPostCode>BR1 3RL</DelLocPostCode>
<CompStatus>1</CompStatus>
<SWSupAimId>cb5f0d25-cff4-4ea0-92f5-99378cce306d</SWSupAimId>
<LearningDeliveryFAM>
<LearnDelFAMType>SOF</LearnDelFAMType>
<LearnDelFAMCode>107</LearnDelFAMCode>
</LearningDeliveryFAM>
</LearningDelivery>
<LearningDelivery>
<LearnAimRef>50023408</LearnAimRef>
<AimType>4</AimType>
<AimSeqNumber>2</AimSeqNumber>
<LearnStartDate>2015-02-14</LearnStartDate>
<LearnPlanEndDate>2016-07-15</LearnPlanEndDate>
<FundModel>25</FundModel>
<DelLocPostCode>BR2 7UP</DelLocPostCode>
<CompStatus>3</CompStatus>
<LearnActEndDate>2015-04-01</LearnActEndDate>
<WithdrawReason>98</WithdrawReason>
<Outcome>3</Outcome>
<SWSupAimId>c243182a-30af-4879-8f68-3eac708e6bb3</SWSupAimId>
<LearningDeliveryFAM>
<LearnDelFAMType>SOF</LearnDelFAMType>
<LearnDelFAMCode>107</LearnDelFAMCode>
</LearningDeliveryFAM>
</LearningDelivery>
</Learner>
我当前的代码:
import os
from xml.etree import ElementTree as et
base_path = os.path.dirname(os.path.realpath(__file__))
xml_file = os.path.join(base_path, "ILR_mock_data.xml")
tree = et.parse(xml_file)
# root = tree.getroot()
# for child in root:
# print(child.tag, child.attrib)
#for child in root:
# for element in child:
# print(element.tag, ":", element.text)
tree.find('Learner/DateOfBirth').text = 'NULL'
tree.wrtie("ILR_Aoned_output.xml")
错误代码:
Traceback (most recent call last):
File "C:/Users/jkay/Desktop/Anon Tool RCU/RCU MOCK TOOL (Anonamising).py", line 20, in <module>
tree.find('Learner/DateOfBirth').text = 'NULL'
AttributeError: 'NoneType' object has no attribute 'text'
我希望程序运行 XML 文件并返回一个新文件,其中所有出生日期都替换为 NULL
谢谢你的帮助。
解决方案
Beautiful Soup看起来像您在这里寻找的解决方案。这是一个专门为解析 HTML 和 XML 文件而构建的库(尽管您可能还必须安装一些解析器.
应用于您的用例:
from bs4 import BeautifulSoup
with open("my_file.xml", "r") as infile:
xml_text = infile.read()
soup = BeautifulSoup(xml_text, 'xml')
# replace all DateOfBirth tag contents with NULL
for dob_tag in soup.find_all("DateOfBirth"):
dob_tag.string = "NULL"
# output and save modified file
with open("my_file_edited.xml", "w") as outfile:
outfile.write(soup.prettify())
作为奖励,您还可以调整库以轻松替换其他标签,或者进行更复杂/有条件的修改。该工具有很好的文档。
推荐阅读
- java - 从中间点开始的简单文本迷宫求解器在找到第一个出口时不会结束
- javascript - 如何获取日期输入的值?
- apache-spark - Spark AQE Post-Shuffle partitions coalesce 无法按预期工作,甚至在某些分区中造成数据倾斜。为什么?
- javascript - Jquery ajax 调用在 django 应用程序中无法用于依赖下拉
- python - 使用 boto3 从 ubuntu 系统上传到 s3 存储桶后,如何通过 python 代码公开对象?
- nginx - 根据referer有条件地重写请求url - Nginx
- c# - 乐观锁定是否足以确保资金转移等操作的安全?
- javascript - 在响应式设计中,拖放在 wordpress 页面中不起作用
- python - 无法创建 python 可执行文件来读取 spss 文件
- java - 如何修复滚动窗格的宽度和高度并且只能垂直滚动