r - 将xml文件转换为R中的数据框并回溯源
问题描述
我有一个 XML 文件,其中包含:
<?xml version="1.0" encoding="UTF-8" ?>
<Repository xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<DECLARE>
<PhysicalColumn name="Department" parentName=""Sample App Lite Data"..."D20 Offices"" parentId="3001:129" parentUid="80ca6538-0bb9-0000-714b-e31d00000000" id="3003:484" uid="80ca6539-0bbb-0000-714b-e31d00000000" dataType="VARCHAR" precision="20" extName="//Table/SAMP_OFFICES_D/DEPARTMENT" specialType="none">
<SourceColumn>
<RefPhysicalColumn id="3003:427" uid="80ca64f9-0bbb-0000-714b-e31d00000000" qualifiedName=""Sample App Lite Data"..."SAMP_OFFICES_D"."Department""/>
</SourceColumn>
</PhysicalColumn>
<LogicalTable name="D2 Offices" parentName=""SampleApp Lite"" parentId="2000:42377" parentUid="80cb6802-07d0-0000-714b-e31d00000000" id="2035:42562" uid="80cb68bb-07f3-0000-714b-e31d00000000" x="938" y="669">
<Description><![CDATA[This logical table maps to the physical Office Dimension table with various attributes.]]></Description>
<Columns>
<RefLogicalColumn id="2006:42563" uid="80cb68bc-07d6-0000-714b-e31d00000000" qualifiedName=""SampleApp Lite"."D2 Offices"."Office""/>
<RefLogicalColumn id="2006:42564" uid="80cb68bd-07d6-0000-714b-e31d00000000" qualifiedName=""SampleApp Lite"."D2 Offices"."Office Key""/>
<RefLogicalColumn id="2006:42565" uid="80cb68be-07d6-0000-714b-e31d00000000" qualifiedName=""SampleApp Lite"."D2 Offices"."Department""/>
<RefLogicalColumn id="2006:42566" uid="80cb68bf-07d6-0000-714b-e31d00000000" qualifiedName=""SampleApp Lite"."D2 Offices"."Dept Key""/>
<RefLogicalColumn id="2006:42567" uid="80cb68c0-07d6-0000-714b-e31d00000000" qualifiedName=""SampleApp Lite"."D2 Offices"."Organization""/>
<RefLogicalColumn id="2006:42568" uid="80cb68c1-07d6-0000-714b-e31d00000000" qualifiedName=""SampleApp Lite"."D2 Offices"."Org Key""/>
<RefLogicalColumn id="2006:42569" uid="80cb68c2-07d6-0000-714b-e31d00000000" qualifiedName=""SampleApp Lite"."D2 Offices"."Company""/>
<RefLogicalColumn id="2006:42570" uid="80cb68c3-07d6-0000-714b-e31d00000000" qualifiedName=""SampleApp Lite"."D2 Offices"."Company Key""/>
<RefLogicalColumn id="2006:42571" uid="80cb68c4-07d6-0000-714b-e31d00000000" qualifiedName=""SampleApp Lite"."D2 Offices"."Office Sequence""/>
</Columns>
<TableSources>
<RefLogicalTableSource id="2037:43058" uid="80cb6a2c-07f5-0000-714b-e31d00000000" qualifiedName=""SampleApp Lite"."D2 Offices"."LTS1 Offices""/>
</TableSources>
</LogicalTable>
<LogicalTableSource name="LTS1 Offices" parentName=""SampleApp Lite"."D2 Offices"" parentId="2035:42562" parentUid="80cb68bb-07f3-0000-714b-e31d00000000" id="2037:43058" uid="80cb6a2c-07f5-0000-714b-e31d00000000" isActive="true">
<Link>
<StartNode>
<RefPhysicalTable id="3001:129" uid="80ca6538-0bb9-0000-714b-e31d00000000" qualifiedName=""Sample App Lite Data"..."D20 Offices""/>
</StartNode>
</Link>
<WhereClause>
<Expr></Expr>
</WhereClause>
<GroupBy>
<Expr><![CDATA[ GROUPBYLEVEL("SampleApp Lite"."H2 Offices"."Offices Detail")]]></Expr>
</GroupBy>
<FragmentContent>
<Expr></Expr>
</FragmentContent>
</LogicalTableSource>
<PresentationColumn name="Department" parentName=""Sample Targets Lite".."Offices"" parentId="4008:43412" parentUid="80cb6c16-0fa8-0000-714b-e31d00000000" id="4010:43649" uid="80cb6d77-0faa-0000-714b-e31d00000000" hasDispName="false" hasDispDescription="false" overrideLogicalName="false">
<Description><![CDATA[Returns the Department description from the Office dimension. Naturally drills into Office Column.]]></Description>
<RefLogicalColumn id="2006:42565" uid="80cb68be-07d6-0000-714b-e31d00000000" qualifiedName=""SampleApp Lite"."D2 Offices"."Department""/>
</PresentationColumn>
</DECLARE>
</Repository>
我需要从哪里找到 Presentation Column 的来源,它是物理列名和使用不同 ID 的物理表。例如,我有 PresentationColumn 名称 = RefLogicalColumn id="2006:42565" 的部门。
<**PresentationColumn name="Department"** parentName=""Sample Targets Lite".."Offices"" parentId="4008:43412" parentUid="80cb6c16-0fa8-0000-714b-e31d00000000" id="4010:43649" uid="80cb6d77-0faa-0000-714b-e31d00000000" hasDispName="false" hasDispDescription="false" overrideLogicalName="false">
<Description><![CDATA[Returns the Department description from the Office dimension. Naturally drills into Office Column.]]></Description>
<**RefLogicalColumn id="2006:42565"** uid="80cb68be-07d6-0000-714b-e31d00000000" qualifiedName=""SampleApp Lite"."D2 Offices"."Department""/>
</PresentationColumn>
通过使用 RefLogicalColumn id="2006:42565" 我们将使用 RefLogicalColumn id 在 LogicalTable 中进行搜索。
<LogicalTable name="D2 Offices" parentName=""SampleApp Lite"" parentId="2000:42377" parentUid="80cb6802-07d0-0000-714b-e31d00000000" id="2035:42562" uid="80cb68bb-07f3-0000-714b-e31d00000000" x="938" y="669">
<Description><![CDATA[This logical table maps to the physical Office Dimension table with various attributes.]]></Description>
<Columns>
<RefLogicalColumn id="2006:42563" uid="80cb68bc-07d6-0000-714b-e31d00000000" qualifiedName=""SampleApp Lite"."D2 Offices"."Office""/>
<RefLogicalColumn id="2006:42564" uid="80cb68bd-07d6-0000-714b-e31d00000000" qualifiedName=""SampleApp Lite"."D2 Offices"."Office Key""/>
<**RefLogicalColumn id="2006:42565"** uid="80cb68be-07d6-0000-714b-e31d00000000" qualifiedName=""SampleApp Lite"."D2 Offices"."Department""/>
<RefLogicalColumn id="2006:42566" uid="80cb68bf-07d6-0000-714b-e31d00000000" qualifiedName=""SampleApp Lite"."D2 Offices"."Dept Key""/>
<RefLogicalColumn id="2006:42567" uid="80cb68c0-07d6-0000-714b-e31d00000000" qualifiedName=""SampleApp Lite"."D2 Offices"."Organization""/>
<RefLogicalColumn id="2006:42568" uid="80cb68c1-07d6-0000-714b-e31d00000000" qualifiedName=""SampleApp Lite"."D2 Offices"."Org Key""/>
<RefLogicalColumn id="2006:42569" uid="80cb68c2-07d6-0000-714b-e31d00000000" qualifiedName=""SampleApp Lite"."D2 Offices"."Company""/>
<RefLogicalColumn id="2006:42570" uid="80cb68c3-07d6-0000-714b-e31d00000000" qualifiedName=""SampleApp Lite"."D2 Offices"."Company Key""/>
<RefLogicalColumn id="2006:42571" uid="80cb68c4-07d6-0000-714b-e31d00000000" qualifiedName=""SampleApp Lite"."D2 Offices"."Office Sequence""/>
</Columns>
<TableSources>
<**RefLogicalTableSource id="2037:43058"** uid="80cb6a2c-07f5-0000-714b-e31d00000000" qualifiedName=""SampleApp Lite"."D2 Offices"."LTS1 Offices""/>
</TableSources>
</LogicalTable>
然后使用 RefLogicalTableSource id = 2037:43058 我们将使用 id 在 LogicalTableSource 中进行搜索。
<LogicalTableSource name="LTS1 Offices" parentName=""SampleApp Lite"."D2 Offices"" parentId="2035:42562" parentUid="80cb68bb-07f3-0000-714b-e31d00000000" **id="2037:43058"** uid="80cb6a2c-07f5-0000-714b-e31d00000000" isActive="true">
<Link>
<StartNode>
<**RefPhysicalTable id="3001:129"** uid="80ca6538-0bb9-0000-714b-e31d00000000" qualifiedName=""Sample App Lite Data"..."D20 Offices""/>
</StartNode>
</Link>
<WhereClause>
<Expr></Expr>
</WhereClause>
<GroupBy>
<Expr><![CDATA[ GROUPBYLEVEL("SampleApp Lite"."H2 Offices"."Offices Detail")]]></Expr>
</GroupBy>
<FragmentContent>
<Expr></Expr>
</FragmentContent>
</LogicalTableSource>
然后使用 RefPhysicalTable id= 3001:129 我们将使用 parentId 在 PhysicalColumn 中搜索。
<PhysicalColumn name="Department" parentName=""Sample App Lite Data"..."D20 Offices"" **parentId="3001:129"** parentUid="80ca6538-0bb9-0000-714b-e31d00000000" id="3003:484" uid="80ca6539-0bbb-0000-714b-e31d00000000" dataType="VARCHAR" precision="20" extName="//Table/SAMP_OFFICES_D/DEPARTMENT" specialType="none">
<SourceColumn>
<RefPhysicalColumn id="3003:427" uid="80ca64f9-0bbb-0000-714b-e31d00000000" qualifiedName=""Sample App Lite Data"..."SAMP_OFFICES_D"."Department""/>
</SourceColumn>
</PhysicalColumn>
从这里我们需要 PhysicalColumn name="Department" 和 extName="//Table/SAMP_OFFICES_D/DEPARTMENT"
我的第一个问题是将我的 xml 文件转换为数据框,第二个问题是回溯源。
解决方案
xml2::read_xml
将帮助您阅读它。另一个会更难,因为看起来你有 3 个关系表。看到这个页面,可能还有这个,尽管当我尝试将它组合到一个表中时它变得很乱。
library(xml2)
library(tidyverse)
dfxml <- xml2::read_xml("C:/foo/bar.xml")
mcga <- function(tbl) {
x <- colnames(tbl)
x <- tolower(x)
x <- gsub("[[:punct:][:space:]]+", "_", x)
x <- gsub("_+", "_", x)
x <- gsub("(^_|_$)", "", x)
x <- make.unique(x, sep = "_")
colnames(tbl) <- x
tbl
}
dfxlm2 <- xml_find_all(dfxml1, ".//*") %>%
map_df(~{
xml_attrs(.x) %>%
as.list()
}) %>%
mcga()
或者将它们分成 3 张桌子。
LogicalTable <- xml_find_all(dfxml1, ".//LogicalTable//*") %>%
map_df(~{
xml_attrs(.x) %>%
as.list()
}) %>%
mcga()
PhysicalTable <- xml_find_all(dfxml1, ".//PhysicalColumn") %>%
map_df(~{
xml_attrs(.x) %>%
as.list()
}) %>%
mcga()
LogTable <- xml_find_all(dfxml1, ".//LogicalTableSource//*") %>%
map_df(~{
xml_attrs(.x) %>%
as.list()
}) %>%
mcga()
你想如何跟踪这些?
推荐阅读
- mysql - MySQL“Connector/J 5.1.48”平台独立于 MacOS Mojave 问题
- intellij-idea - 带有 setter 的 Intellij IDEA 自定义构造函数
- haskell - Haskell Crypto.Hash.SHA256 输出
- laravel-5.8 - 重定向回job_seeker.register
- homebrew - 使用 brew 安装 Buck 尝试访问无效的 url 并抛出 404
- math - 如何获得由许多三角形组成的网格表面(或平面)的 4 个角?
- python - 找到图像的中心来定义一个 hitbox
- python - 关于 np.nonzero(arr)[0] 中尾随 [0] 的作用的问题
- linux-kernel - filename_lookup.part.64 vs filename_lookup 符号
- python - 将 NumPy 数据流式传输到 Tensorflow