首页 > 解决方案 > 将数据数据转换为数据框

问题描述

我正在读取数据数据并尝试将其转换为数据框以将其保存为可读格式。但是没有关于转换 dat 数据的线索。R 有点初学者。任何帮助将不胜感激。

到目前为止的代码:

data <- readLines("Day8.dat")

print(data)

到目前为止的输出:

[1] "<d2lm:d2LogicalModel extensionVersion=\"2.0\" extensionName=\"NTIS Published Services\" 
modelBaseVersion=\"2\" xmlns:ns4=\"http://www.thalesgroup.com/NTIS/Datex2Extensions/1.0Beta1\" 
xmlns:ns3=\"http://datex2.eu/schema/2/2_0/inrix\" xmlns:d2lm=\"http://datex2.eu/schema/2/2_0\"> 
<d2lm:exchange><d2lm:supplierIdentification><d2lm:country>gb</d2lm:country> 
<d2lm:nationalIdentifier>NTIS</d2lm:nationalIdentifier></d2lm:supplierIdentification></d2lm:exchange> 
<d2lm:payloadPublication xsi:type=\"d2lm:SituationPublication\" lang=\"en\" 
xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\"><d2lm:feedType>Event Data</d2lm:feedType> 
<d2lm:publicationTime>2020-05-10T00:00:44.778+01:00</d2lm:publicationTime><d2lm:publicationCreator> 
<d2lm:country>gb</d2lm:country><d2lm:nationalIdentifier>NTIS</d2lm:nationalIdentifier> 
</d2lm:publicationCreator><d2lm:situation version=\"\" id=\"2922904\"><d2lm:headerInformation> 
<d2lm:areaOfInterest>national</d2lm:areaOfInterest>
....

谢谢

标签: rxmldataframe

解决方案


这一切都取决于你想对数据做什么,即你想如何处理它。例如,假设您的兴趣是将所有 XML 标签解析为单独的字符串,那么您可以使用正则表达式和函数提取标签str_extract

library(stringr)
str_extract_all(dat, "<(d2lm:[^>]*)>.*</\\1>|<d2lm:[^>]*>")

即使 XML 元素名称是可变的,此正则表达式也有效:

str_extract_all(dat, "<([^>]*)>.*</\\1>|<[^>]*>")

结果是一个列表:

[[1]]
 [1] "<d2lm:d2LogicalModel extensionVersion=\"2.0\" extensionName=\"NTIS Published Services\" \nmodelBaseVersion=\"2\" xmlns:ns4=\"http://www.thalesgroup.com/NTIS/Datex2Extensions/1.0Beta1\" \nxmlns:ns3=\"http://datex2.eu/schema/2/2_0/inrix\" xmlns:d2lm=\"http://datex2.eu/schema/2/2_0\">"
 [2] "<d2lm:exchange>"                                                                                                                                                                                                                                                                           
 [3] "<d2lm:supplierIdentification>"                                                                                                                                                                                                                                                             
 [4] "<d2lm:country>gb</d2lm:country>"                                                                                                                                                                                                                                                           
 [5] "<d2lm:nationalIdentifier>NTIS</d2lm:nationalIdentifier>"                                                                                                                                                                                                                                   
 [6] "<d2lm:payloadPublication xsi:type=\"d2lm:SituationPublication\" lang=\"en\" \nxmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\">"                                                                                                                                                    
 [7] "<d2lm:feedType>Event Data</d2lm:feedType>"                                                                                                                                                                                                                                                 
 [8] "<d2lm:publicationTime>2020-05-10T00:00:44.778+01:00</d2lm:publicationTime>"                                                                                                                                                                                                                
 [9] "<d2lm:publicationCreator>"                                                                                                                                                                                                                                                                 
[10] "<d2lm:country>gb</d2lm:country>"                                                                                                                                                                                                                                                           
[11] "<d2lm:nationalIdentifier>NTIS</d2lm:nationalIdentifier>"                                                                                                                                                                                                                                   
[12] "<d2lm:situation version=\"\" id=\"2922904\">"                                                                                                                                                                                                                                              
[13] "<d2lm:headerInformation>"                                                                                                                                                                                                                                                                  
[14] "<d2lm:areaOfInterest>national</d2lm:areaOfInterest>"   

要将列表转换为数据框:

datDF <- data.frame(tags = unlist(str_extract_all(dat, "<(d2lm:[^>]*)>.*</\\1>|<d2lm:[^>]*>")))

编辑

如果您想在 XML 开始标记和 XML 结束标记之间有一个包含文本值的数据框,您可以按照以下几行提取这些标记和值:

datDF <- data.frame(
  tags = unlist(str_extract_all(dat, "<([^>]*)>(?=[^>]*</\\1>)")),
  values = unlist(str_extract_all(dat, "(?<=<([^>]{1,100})>).*(?=</\\1>)"))
) 
datDF
                       tags                        values
1            <d2lm:country>                            gb
2 <d2lm:nationalIdentifier>                          NTIS
3           <d2lm:feedType>                    Event Data
4    <d2lm:publicationTime> 2020-05-10T00:00:44.778+01:00
5            <d2lm:country>                            gb
6 <d2lm:nationalIdentifier>                          NTIS
7     <d2lm:areaOfInterest>                      national

这是 - 大致 - 你的想法吗?

数据:

dat <- '<d2lm:d2LogicalModel extensionVersion=\"2.0\" extensionName=\"NTIS Published Services\" 
modelBaseVersion=\"2\" xmlns:ns4=\"http://www.thalesgroup.com/NTIS/Datex2Extensions/1.0Beta1\" 
xmlns:ns3=\"http://datex2.eu/schema/2/2_0/inrix\" xmlns:d2lm=\"http://datex2.eu/schema/2/2_0\"> 
<d2lm:exchange><d2lm:supplierIdentification><d2lm:country>gb</d2lm:country> 
<d2lm:nationalIdentifier>NTIS</d2lm:nationalIdentifier></d2lm:supplierIdentification></d2lm:exchange> 
<d2lm:payloadPublication xsi:type=\"d2lm:SituationPublication\" lang=\"en\" 
xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\"><d2lm:feedType>Event Data</d2lm:feedType> 
<d2lm:publicationTime>2020-05-10T00:00:44.778+01:00</d2lm:publicationTime><d2lm:publicationCreator> 
<d2lm:country>gb</d2lm:country><d2lm:nationalIdentifier>NTIS</d2lm:nationalIdentifier> 
</d2lm:publicationCreator><d2lm:situation version=\"\" id=\"2922904\"><d2lm:headerInformation> 
<d2lm:areaOfInterest>national</d2lm:areaOfInterest>'

推荐阅读