python - 使用 Python 将 CSV 文件转换为 XML
问题描述
我正在尝试设置一个脚本来从 CSV 文件中提取数据以输出到 XML 文件。我使用此链接中的信息(使用 PYTHON 将 CSV 文件转换为 XML 文件)来创建脚本。它部分有效,但我需要更多指导来获得确切的布局。
我需要的布局是这样的:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ImportAcademicExtract xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<AcademicExtract>
<StudentId>StudentID</StudentId>
<LastName>LastName</LastName>
<FirstName>FirstName</FirstName>
<MiddleName>MiddleName</MiddleName>
<SocialSecurityNumber>SocialSecurityNumber</SocialSecurityNumber>
<BirthDate>BirthDate</BirthDate>
<GradeLevel>GradeLevel</GradeLevel>
<SpecialProgramIndicator>SpecialProgramIndicator</SpecialProgramIndicator>
<CIPCode>CIPCode</CIPCode>
<RegisteredHours>RegisteredHours</RegisteredHours>
<PostalAddresses>
<PostalAddress>
<AddressLine1>Address</AddressLine1>
<AddressLine2>Address2ndLn</AddressLine2>
<City>City</City>
<State>State</State>
<PostalCode>PostalCode</PostalCode>
<CountryCode>Country</CountryCode>
</PostalAddress>
</PostalAddresses>
<EmailAddresses>
<EmailAddress>Email</EmailAddress>
</EmailAddresses>
<PhoneNumbers>
<PhoneNumber>PhoneNumber</PhoneNumber>
</PhoneNumbers>
<AdmissionsTerm>
<AdmissionTerm>
<TermName>TermName</TermName>
<AcademicYear>AcademicYear</AcademicYear>
<AdmittedDate>AdmittedDate</AdmittedDate>
</AdmissionTerm>
</AdmissionsTerm>
<EnrollmentTerms>
<EnrollmentTerm>
<TermName>TermNames</TermName>
<AcademicYear>AcademicYears</AcademicYear>
<CumulativeAttemptedHours>CumulativeAttemptedHours</CumulativeAttemptedHours>
<CumulativeRegisteredHours>CumulativeRegisteredHours</CumulativeRegisteredHours>
<CumulativeEarnedHours>CumulativeEarnedHours</CumulativeEarnedHours>
<CumulativeGPA>CumulativeGPA</CumulativeGPA>
<TermStartDate>TermStartDate</TermStartDate>
<TermEndDate>TermEndDate</TermEndDate>
<EnrollmentType>EnrollmentType</EnrollmentType>
<EnrollmentStatus>EnrollmentStatus</EnrollmentStatus>
<FirstTimeDegreeSeeking>FirstTimeDegreeSeeking</FirstTimeDegreeSeeking>
<IntentToReturn>IntentToReturn</IntentToReturn>
<WithdrawnDate>WithdrawnDate</WithdrawnDate>
<ExtTermName>ExtTermName</ExtTermName>
<AcademicYearStartDate>AcademicYearStartDate</AcademicYearStartDate>
<AcademicYearEndDate>AcademicYearEndDate</AcademicYearEndDate>
<ExtEnrollmentType>ExtEnrollmentType</ExtEnrollmentType>
<DateOfDetermination>DateOfDetermination</DateOfDetermination>
<LastDateOfAttendance>LastDateOfAttendance</LastDateOfAttendance>
<SAPStatus>SAPStatus</SAPStatus>
</EnrollmentTerm>
</EnrollmentTerms>
<AcademicPrograms>
<AcademicProgram>
<ProgramCredentialLevel>ProgramCredentialLevel</ProgramCredentialLevel>
<ProgramName>ProgramName</ProgramName>
<EdMajor1>EdMajor1</EdMajor1>
<SiteName>SiteName</SiteName>
<EffectiveStartDate>EffectiveStartDate</EffectiveStartDate>
<EffectiveEndDate>EffectiveEndDate</EffectiveEndDate>
<GraduationDate>GraduationDate</GraduationDate>
<AnticipatedGraduationDate>AnticipatedGraduationDate</AnticipatedGraduationDate>
<ProgramLengthInWeeks>ProgramLengthInWeeks</ProgramLengthInWeeks>
<ProgramLengthInMonths>ProgramLengthInMonths</ProgramLengthInMonths>
<ProgramLengthInYears>ProgramLengthInYears</ProgramLengthInYears>
<AcademicYearBeginDate>AcademicYearBeginDate2</AcademicYearBeginDate>
<AcademicYearEndDate>AcademicYearEndDate2</AcademicYearEndDate>
<WeeksInProgramAcademicYear>WeeksInProgramAcademicYear</WeeksInProgramAcademicYear>
</AcademicProgram>
</AcademicPrograms>
</AcademicExtract>
我正在使用以下脚本:
import itertools
import csv
import os
csvFile = r'J:\JFAFiles\JFA-BR.csv'
xmlFile = r'J:\JFAFiles\XML-BR.xml'
csvData = csv.reader(open(csvFile))
xmlData = open(xmlFile, 'w')
xmlData.write('<?xml version="1.0" encoding="UTF-8" standalone="yes"?>' + "\n" +'<ImportAcademicExtract xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">' + "\n" )
rowNum = 0
for row in csvData:
if rowNum == 0:
tags = row
# replace spaces w/ underscores in tag names
for i in range(len(tags)):
tags[i] = tags[i].replace(' ', '_')
else:
xmlData.write(' '+'<AcademicExtract>' +"\n")
for i in range (len(tags)):
xmlData.write(' ' +'<' + tags[i] + '>' \
+ row[i] + '</' + tags[i] + '>' + "\n")
xmlData.write(' '+'</AcademicExtract>' + "\n")
rowNum +=1
xmlData.write('</ImportAcademicExtract>' + "\n")
xmlData.close()
当我运行脚本时,我无法弄清楚如何引入所看到的插入区域(如邮政地址区域、电子邮件地址、电话号码、录取条款等)。一切都保持一致。
运行脚本的输出如下。下划线是从我因敏感信息而删除的文件中输入文本的位置。文本正确地出现了,我只是不知道如何让那些其他区域插入更多。
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ImportAcademicExtract xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<AcademicExtract>
<StudentID>____</StudentID>
<LastName>____</LastName>
<FirstName>____</FirstName>
<MiddleName>____</MiddleName>
<SocialSecurityNumber>____</SocialSecurityNumber>
<BirthDate>____</BirthDate>
<GradeLevel>____</GradeLevel>
<SpecialProgramIndicator>____</SpecialProgramIndicator>
<CIPCode>____</CIPCode>
<RegisteredHours>____</RegisteredHours>
<Address>____</Address>
<Address2ndLn>____</Address2ndLn>
<City>____</City>
<State>____</State>
<PostalCode>____</PostalCode>
<Country>____</Country>
<Email>____</Email>
<PhoneNumber>____</PhoneNumber>
<TermName>____</TermName>
<AcademicYear>____</AcademicYear>
<AdmittedDate>____</AdmittedDate>
<TermName2>____</TermName2>
<AcademicYear2>____</AcademicYear2>
<CumulativeAttemptedHours>____</CumulativeAttemptedHours>
<CumulativeRegisteredHours>____</CumulativeRegisteredHours>
<CumulativeEarnedHours>____</CumulativeEarnedHours>
<CumulativeGPA>____</CumulativeGPA>
<TermStartDate>____</TermStartDate>
<TermEndDate>____</TermEndDate>
<EnrollmentType>____</EnrollmentType>
<EnrollmentStatus>____</EnrollmentStatus>
<FirstTimeDegreeSeeking>____</FirstTimeDegreeSeeking>
<IntentToReturn>____</IntentToReturn>
<WithdrawnDate>____</WithdrawnDate>
<ExtTermName>____</ExtTermName>
<AcademicYearStartDate>____</AcademicYearStartDate>
<AcademicYearEndDate>____</AcademicYearEndDate>
<ExtEnrollmentType>____</ExtEnrollmentType>
<DateOfDetermination>____</DateOfDetermination>
<LastDateOfAttendance>____</LastDateOfAttendance>
<SAPStatus>____</SAPStatus>
<ProgramCredentialLevel>____</ProgramCredentialLevel>
<ProgramName>____</ProgramName>
<EdMajor1>____</EdMajor1>
<SiteName>____</SiteName>
<EffectiveStartDate>____</EffectiveStartDate>
<EffectiveEndDate>____</EffectiveEndDate>
<GraduationDateDate>____</GraduationDateDate>
<AnticipatedGraduationDate>____</AnticipatedGraduationDate>
<ProgramLengthInWeeks>____</ProgramLengthInWeeks>
<ProgramLengthInMonths>____</ProgramLengthInMonths>
<ProgramLengthInYears>____</ProgramLengthInYears>
<AcademicYearBeginDate2>____</AcademicYearBeginDate2>
<AcademicYearEndDate2>____</AcademicYearEndDate2>
<WeeksInProgramAcademicYear>____</WeeksInProgramAcademicYear>
</AcademicExtract>
</ImportAcademicExtract>
我尝试将此添加到脚本中,但它不起作用:
rowNum = 0
for row in csvData:
if rowNum == 0:
tags = row
# replace spaces w/ underscores in tag names
for i in range(len(tags)):
tags[i] = tags[i].replace(' ', '_')
else:
xmlData.write(' '+'<AcademicExtract>' +"\n")
for i in range (len(tags)):
xmlData.write(' ' +'<' + tags[i] + '>' \
+ row[i] + '</' + tags[i] + '>' + "\n")
if rowNum == 11:
tags = row
# replace spaces w/ underscores in tag names
for i in range(len(tags)):
tags[i] = tags[i].replace(' ', '_')
else:
xmlData.write(' '+'<PostalAddresses>' +"\n"+' '+'<PostalAddress>' +"\n")
for i in range (len(tags)):
xmlData.write(' ' +'<' + tags[i] + '>' \
+ row[i] + '</' + tags[i] + '>' + "\n")
对于如何编辑编码以使布局正确,我们将不胜感激。
解决方案
我认为熊猫数据框更容易做到这一点,您可以使用以下脚本,它可能不是很有效。
您需要做的就是更改 xml_map 以定义您想要的输出。
将此 python 脚本保存在名为 csv2xml.py 的文件中
#!/usr/bin/python3
import pandas as pd, argparse, os
xml_map = {"AcademicExtract" : ["StudentID", "LastName", "FirstName", "MiddleName", "SocialSecurityNumber",
"BirthDate", "GradeLevel", "SpecialProgramIndicator", "CIPCode", "RegisteredHours",
{"PostalAddresses" : { "PostalAddress" : [ "Address", "Address2ndLn", "City", "State", "PostalCode", "Country" ] } },
{"EmailAddresses" : [ "Email" ] },
{"PhoneNumbers" : ["PhoneNumber"] },
{"AdmissionsTerm" : { "AdmissionTerm" : [ "TermName", "AcademicYear", "AdmittedDate" ] } },
{"EnrollmentTerms" : { "EnrollmentTerm" : [ "TermName2", "AcademicYear2", "CumulativeAttemptedHours",
"CumulativeRegisteredHours", "CumulativeEarnedHours", "CumulativeGPA",
"TermStartDate", "TermEndDate", "EnrollmentType", "EnrollmentStatus",
"FirstTimeDegreeSeeking", "IntentToReturn", "WithdrawnDate", "ExtTermName",
"AcademicYearStartDate", "AcademicYearEndDate", "ExtEnrollmentType", "DateOfDetermination",
"LastDateOfAttendance", "SAPStatus" ] } },
{"AcademicPrograms" : { "AcademicProgram" : [ "ProgramCredentialLevel", "ProgramName", "EdMajor1", "SiteName",
"EffectiveStartDate", "EffectiveEndDate", "GraduationDateDate", "AnticipatedGraduationDate",
"ProgramLengthInWeeks", "ProgramLengthInMonths", "ProgramLengthInYears", "AcademicYearBeginDate2",
"AcademicYearEndDate2", "WeeksInProgramAcademicYear" ] } } ] }
# this name map is used to rename terms in the XML output
# the map is of the format "oldname":"newname", "oldname2":"newname2"
name_map = {"TermName2" : "TermName", "AcademicYear2" : "AcademicYear"}
parser = argparse.ArgumentParser(description='convert csv 2 xml')
parser.add_argument('input_file', help='filename to convert')
parser.add_argument('output_file', help='output filename')
args = parser.parse_args()
if not os.path.isfile(args.input_file): raise ValueError('File does not exist: ' + args.input_file)
outfile = open(args.output_file,"w")
def printXML(data, xmlm, level):
if isinstance(xmlm,dict):
for k in xmlm:
node_name = name_map.get(k,k)
print(" "*level*4,f"<{node_name}>",file=outfile)
printXML(data,xmlm[k],level+1)
print(" "*level*4,f"</{node_name}>",file=outfile)
elif isinstance(xmlm,list):
for i in xmlm: printXML(data,i,level+1)
elif data[xmlm].strip():
node_name = name_map.get(xmlm,xmlm)
print(" "*level*4,f"<{node_name}>",end="",file=outfile)
print(data[xmlm],end="",file=outfile)
print(f"</{node_name}>",file=outfile)
df = pd.read_csv(args.input_file,dtype=str,keep_default_na=False,header=0)
print('<?xml version="1.0" encoding="UTF-8" standalone="yes"?>',file=outfile)
print('<ImportAcademicExtract xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">',file=outfile)
df.apply(lambda x: printXML(x, xml_map, 1), axis=1)
print("</ImportAcademicExtract>",file=outfile)
outfile.close()
对于 xml_map 变量,我从第二个示例输出中提取了元素列表,因为这看起来像文件中的实际列名。
xml_map 数组中的列名,方括号 [ ... ] 之间的任何内容都必须与输入文件中的列名匹配
在这个例子中,输入文件是:
StudentID,LastName,FirstName,MiddleName,SocialSecurityNumber,BirthDate,GradeLevel,SpecialProgramIndicator,CIPCode,RegisteredHours,Address,Address2ndLn,City,State,PostalCode,Country,Email,PhoneNumber,TermName,AcademicYear,AdmittedDate,TermName2,AcademicYear2,CumulativeAttemptedHours,CumulativeRegisteredHours,CumulativeEarnedHours,CumulativeGPA,TermStartDate,TermEndDate,EnrollmentType,EnrollmentStatus,FirstTimeDegreeSeeking,IntentToReturn,WithdrawnDate,ExtTermName,AcademicYearStartDate,AcademicYearEndDate,ExtEnrollmentType,DateOfDetermination,LastDateOfAttendance,SAPStatus,ProgramCredentialLevel,ProgramName,EdMajor1,SiteName,EffectiveStartDate,EffectiveEndDate,GraduationDateDate,AnticipatedGraduationDate,ProgramLengthInWeeks,ProgramLengthInMonths,ProgramLengthInYears,AcademicYearBeginDate2,AcademicYearEndDate2,WeeksInProgramAcademicYear
1,Brown,Adam,,12345678,01/01/1970,Masters,SPI1,CIP1,20,this is line 1,this is line 2,NY,NY,PC123,USA,adam@example.com,213415441,Summer,2020,Date1,Autmn,2021,Attempted Hours1,Registered Hours1,Earned Hours 1,GPA1,Date1,Date2,,,,,,,,,,,,,,,,,,,,,,,,,,
2,Baker,Andrew,,43219952,02/01/1970,PHD,SPI2,CIP2,30,this is line 1,this is line 2,NY,NY,PC234,USA,andrew@example.com,132141541,Summer,2020,Date2,Autmn,2021,Attempted Hours2,Registered Hours2,Earned Hours 2,GPA2,Date1,Date2,,,,,,,,,,,,,,,,,,,,,,,,,,
注意:CSV 文件中的第一行必须包含列名
我假设你正在使用 powershell 并且你已经安装了 python。运行脚本如下:
python csv2xml.py "J:\JFAFiles\JFA-BR.csv" "J:\JFAFiles\XML-BR.xml"
注意:在文件名周围使用引号,因为 Windows 允许文件名和路径中有空格
如果您收到错误:ModuleNotFoundError: No module named 'pandas' 那么您需要安装 pandas 模块
pip.exe install pandas
如果您在运行时遇到关键错误,例如:
KeyError: 'studentID'
注意:列名区分大小写
然后您可以运行以下 python 代码来检查数据框中的实际列名
import pandas
df = pd.read_csv("J:\JFAFiles\JFA-BR.csv",dtype=str,keep_default_na=False,header=0)
print(df.columns)
在上面的示例输入上运行脚本的结果是:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ImportAcademicExtract xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<AcademicExtract>
<StudentID>1</StudentID>
<LastName>Brown</LastName>
<FirstName>Adam</FirstName>
<MiddleName></MiddleName>
<SocialSecurityNumber>12345678</SocialSecurityNumber>
<BirthDate>01/01/1970</BirthDate>
<GradeLevel>Masters</GradeLevel>
<SpecialProgramIndicator>SPI1</SpecialProgramIndicator>
<CIPCode>CIP1</CIPCode>
<RegisteredHours>20</RegisteredHours>
<PostalAddresses>
<PostalAddress>
<Address>this is line 1</Address>
<Address2ndLn>this is line 2</Address2ndLn>
<City>NY</City>
<State>NY</State>
<PostalCode>PC123</PostalCode>
<Country>USA</Country>
</PostalAddress>
</PostalAddresses>
<EmailAddresses>
<Email>adam@example.com</Email>
</EmailAddresses>
<PhoneNumbers>
<PhoneNumber>213415441</PhoneNumber>
</PhoneNumbers>
<AdmissionsTerm>
<AdmissionTerm>
<TermName>Summer</TermName>
<AcademicYear>2020</AcademicYear>
<AdmittedDate>Date1</AdmittedDate>
</AdmissionTerm>
</AdmissionsTerm>
<EnrollmentTerms>
<EnrollmentTerm>
<TermName2>Autmn</TermName2>
<AcademicYear2>2021</AcademicYear2>
<CumulativeAttemptedHours>Attempted Hours1</CumulativeAttemptedHours>
<CumulativeRegisteredHours>Registered Hours1</CumulativeRegisteredHours>
<CumulativeEarnedHours>Earned Hours 1</CumulativeEarnedHours>
<CumulativeGPA>GPA1</CumulativeGPA>
<TermStartDate>Date1</TermStartDate>
<TermEndDate>Date2</TermEndDate>
<EnrollmentType></EnrollmentType>
<EnrollmentStatus></EnrollmentStatus>
<FirstTimeDegreeSeeking></FirstTimeDegreeSeeking>
<IntentToReturn></IntentToReturn>
<WithdrawnDate></WithdrawnDate>
<ExtTermName></ExtTermName>
<AcademicYearStartDate></AcademicYearStartDate>
<AcademicYearEndDate></AcademicYearEndDate>
<ExtEnrollmentType></ExtEnrollmentType>
<DateOfDetermination></DateOfDetermination>
<LastDateOfAttendance></LastDateOfAttendance>
<SAPStatus></SAPStatus>
</EnrollmentTerm>
</EnrollmentTerms>
<AcademicPrograms>
<AcademicProgram>
<ProgramCredentialLevel></ProgramCredentialLevel>
<ProgramName></ProgramName>
<EdMajor1></EdMajor1>
<SiteName></SiteName>
<EffectiveStartDate></EffectiveStartDate>
<EffectiveEndDate></EffectiveEndDate>
<GraduationDateDate></GraduationDateDate>
<AnticipatedGraduationDate></AnticipatedGraduationDate>
<ProgramLengthInWeeks></ProgramLengthInWeeks>
<ProgramLengthInMonths></ProgramLengthInMonths>
<ProgramLengthInYears></ProgramLengthInYears>
<AcademicYearBeginDate2></AcademicYearBeginDate2>
<AcademicYearEndDate2></AcademicYearEndDate2>
<WeeksInProgramAcademicYear></WeeksInProgramAcademicYear>
</AcademicProgram>
</AcademicPrograms>
</AcademicExtract>
<AcademicExtract>
<StudentID>2</StudentID>
<LastName>Baker</LastName>
<FirstName>Andrew</FirstName>
<MiddleName></MiddleName>
<SocialSecurityNumber>43219952</SocialSecurityNumber>
<BirthDate>02/01/1970</BirthDate>
<GradeLevel>PHD</GradeLevel>
<SpecialProgramIndicator>SPI2</SpecialProgramIndicator>
<CIPCode>CIP2</CIPCode>
<RegisteredHours>30</RegisteredHours>
<PostalAddresses>
<PostalAddress>
<Address>this is line 1</Address>
<Address2ndLn>this is line 2</Address2ndLn>
<City>NY</City>
<State>NY</State>
<PostalCode>PC234</PostalCode>
<Country>USA</Country>
</PostalAddress>
</PostalAddresses>
<EmailAddresses>
<Email>andrew@example.com</Email>
</EmailAddresses>
<PhoneNumbers>
<PhoneNumber>132141541</PhoneNumber>
</PhoneNumbers>
<AdmissionsTerm>
<AdmissionTerm>
<TermName>Summer</TermName>
<AcademicYear>2020</AcademicYear>
<AdmittedDate>Date2</AdmittedDate>
</AdmissionTerm>
</AdmissionsTerm>
<EnrollmentTerms>
<EnrollmentTerm>
<TermName2>Autmn</TermName2>
<AcademicYear2>2021</AcademicYear2>
<CumulativeAttemptedHours>Attempted Hours2</CumulativeAttemptedHours>
<CumulativeRegisteredHours>Registered Hours2</CumulativeRegisteredHours>
<CumulativeEarnedHours>Earned Hours 2</CumulativeEarnedHours>
<CumulativeGPA>GPA2</CumulativeGPA>
<TermStartDate>Date1</TermStartDate>
<TermEndDate>Date2</TermEndDate>
<EnrollmentType></EnrollmentType>
<EnrollmentStatus></EnrollmentStatus>
<FirstTimeDegreeSeeking></FirstTimeDegreeSeeking>
<IntentToReturn></IntentToReturn>
<WithdrawnDate></WithdrawnDate>
<ExtTermName></ExtTermName>
<AcademicYearStartDate></AcademicYearStartDate>
<AcademicYearEndDate></AcademicYearEndDate>
<ExtEnrollmentType></ExtEnrollmentType>
<DateOfDetermination></DateOfDetermination>
<LastDateOfAttendance></LastDateOfAttendance>
<SAPStatus></SAPStatus>
</EnrollmentTerm>
</EnrollmentTerms>
<AcademicPrograms>
<AcademicProgram>
<ProgramCredentialLevel></ProgramCredentialLevel>
<ProgramName></ProgramName>
<EdMajor1></EdMajor1>
<SiteName></SiteName>
<EffectiveStartDate></EffectiveStartDate>
<EffectiveEndDate></EffectiveEndDate>
<GraduationDateDate></GraduationDateDate>
<AnticipatedGraduationDate></AnticipatedGraduationDate>
<ProgramLengthInWeeks></ProgramLengthInWeeks>
<ProgramLengthInMonths></ProgramLengthInMonths>
<ProgramLengthInYears></ProgramLengthInYears>
<AcademicYearBeginDate2></AcademicYearBeginDate2>
<AcademicYearEndDate2></AcademicYearEndDate2>
<WeeksInProgramAcademicYear></WeeksInProgramAcademicYear>
</AcademicProgram>
</AcademicPrograms>
</AcademicExtract>
</ImportAcademicExtract>
推荐阅读
- python - 如何使用 SQLite3 在页面加载/刷新时将数据插入 SQLite?
- javascript - Chrome 通知何时触发“onClosed”?
- matlab - keep matlab view open after the function exits
- angular - Ionic 3 Set date from selected date
- javascript - Vue Router 上一页无需重新加载
- regex - 使用正则表达式和 bash 在 xidel 中为 xpath 表达式创建别名
- apache-flink - 使用 Flink 处理过去 10 分钟内的 kafka 消息?
- java - Can we add spring boot cacheable method inside scheduled cache evict method for application safety?
- perl - 我想知道 seek 的效率
- java - 如何在每次测试之间清除 chrome 浏览器状态?