首页 > 解决方案 > 将xml文件转换为R中的数据框并回溯源

问题描述

我有一个 XML 文件,其中包含:

 <?xml version="1.0" encoding="UTF-8" ?>
<Repository xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<DECLARE>
<PhysicalColumn name="Department" parentName="&quot;Sample App Lite Data&quot;...&quot;D20 Offices&quot;" parentId="3001:129" parentUid="80ca6538-0bb9-0000-714b-e31d00000000" id="3003:484" uid="80ca6539-0bbb-0000-714b-e31d00000000" dataType="VARCHAR" precision="20" extName="//Table/SAMP_OFFICES_D/DEPARTMENT" specialType="none">
<SourceColumn>
<RefPhysicalColumn id="3003:427" uid="80ca64f9-0bbb-0000-714b-e31d00000000" qualifiedName="&quot;Sample App Lite Data&quot;...&quot;SAMP_OFFICES_D&quot;.&quot;Department&quot;"/>
</SourceColumn>
</PhysicalColumn>

<LogicalTable name="D2 Offices" parentName="&quot;SampleApp Lite&quot;" parentId="2000:42377" parentUid="80cb6802-07d0-0000-714b-e31d00000000" id="2035:42562" uid="80cb68bb-07f3-0000-714b-e31d00000000" x="938" y="669">
<Description><![CDATA[This logical table maps to the physical Office Dimension table with various attributes.]]></Description>
<Columns>
<RefLogicalColumn id="2006:42563" uid="80cb68bc-07d6-0000-714b-e31d00000000" qualifiedName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;.&quot;Office&quot;"/>
<RefLogicalColumn id="2006:42564" uid="80cb68bd-07d6-0000-714b-e31d00000000" qualifiedName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;.&quot;Office Key&quot;"/>
<RefLogicalColumn id="2006:42565" uid="80cb68be-07d6-0000-714b-e31d00000000" qualifiedName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;.&quot;Department&quot;"/>
<RefLogicalColumn id="2006:42566" uid="80cb68bf-07d6-0000-714b-e31d00000000" qualifiedName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;.&quot;Dept Key&quot;"/>
<RefLogicalColumn id="2006:42567" uid="80cb68c0-07d6-0000-714b-e31d00000000" qualifiedName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;.&quot;Organization&quot;"/>
<RefLogicalColumn id="2006:42568" uid="80cb68c1-07d6-0000-714b-e31d00000000" qualifiedName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;.&quot;Org Key&quot;"/>
<RefLogicalColumn id="2006:42569" uid="80cb68c2-07d6-0000-714b-e31d00000000" qualifiedName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;.&quot;Company&quot;"/>
<RefLogicalColumn id="2006:42570" uid="80cb68c3-07d6-0000-714b-e31d00000000" qualifiedName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;.&quot;Company Key&quot;"/>
<RefLogicalColumn id="2006:42571" uid="80cb68c4-07d6-0000-714b-e31d00000000" qualifiedName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;.&quot;Office Sequence&quot;"/>
</Columns>
<TableSources>
<RefLogicalTableSource id="2037:43058" uid="80cb6a2c-07f5-0000-714b-e31d00000000" qualifiedName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;.&quot;LTS1 Offices&quot;"/>
</TableSources>
</LogicalTable>

<LogicalTableSource name="LTS1 Offices" parentName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;" parentId="2035:42562" parentUid="80cb68bb-07f3-0000-714b-e31d00000000" id="2037:43058" uid="80cb6a2c-07f5-0000-714b-e31d00000000" isActive="true">
<Link>
<StartNode>
<RefPhysicalTable id="3001:129" uid="80ca6538-0bb9-0000-714b-e31d00000000" qualifiedName="&quot;Sample App Lite Data&quot;...&quot;D20 Offices&quot;"/>
</StartNode>
</Link>
<WhereClause>
<Expr></Expr>
</WhereClause>
<GroupBy>
<Expr><![CDATA[ GROUPBYLEVEL("SampleApp Lite"."H2 Offices"."Offices Detail")]]></Expr>
</GroupBy>
<FragmentContent>
<Expr></Expr>
</FragmentContent>
</LogicalTableSource>

<PresentationColumn name="Department" parentName="&quot;Sample Targets Lite&quot;..&quot;Offices&quot;" parentId="4008:43412" parentUid="80cb6c16-0fa8-0000-714b-e31d00000000" id="4010:43649" uid="80cb6d77-0faa-0000-714b-e31d00000000" hasDispName="false" hasDispDescription="false" overrideLogicalName="false">
<Description><![CDATA[Returns the Department description from the Office dimension. Naturally drills into Office Column.]]></Description>
<RefLogicalColumn id="2006:42565" uid="80cb68be-07d6-0000-714b-e31d00000000" qualifiedName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;.&quot;Department&quot;"/>
</PresentationColumn>
</DECLARE>
</Repository>

我需要从哪里找到 Presentation Column 的来源,它是物理列名和使用不同 ID 的物理表。例如,我有 PresentationColumn 名称 = RefLogicalColumn id="2006:42565" 的部门。

<**PresentationColumn name="Department"** parentName="&quot;Sample Targets Lite&quot;..&quot;Offices&quot;" parentId="4008:43412" parentUid="80cb6c16-0fa8-0000-714b-e31d00000000" id="4010:43649" uid="80cb6d77-0faa-0000-714b-e31d00000000" hasDispName="false" hasDispDescription="false" overrideLogicalName="false">
<Description><![CDATA[Returns the Department description from the Office dimension. Naturally drills into Office Column.]]></Description>
<**RefLogicalColumn id="2006:42565"** uid="80cb68be-07d6-0000-714b-e31d00000000" qualifiedName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;.&quot;Department&quot;"/>
</PresentationColumn>

通过使用 RefLogicalColumn id="2006:42565" 我们将使用 RefLogicalColumn id 在 LogicalTable 中进行搜索。

<LogicalTable name="D2 Offices" parentName="&quot;SampleApp Lite&quot;" parentId="2000:42377" parentUid="80cb6802-07d0-0000-714b-e31d00000000" id="2035:42562" uid="80cb68bb-07f3-0000-714b-e31d00000000" x="938" y="669">
<Description><![CDATA[This logical table maps to the physical Office Dimension table with various attributes.]]></Description>
<Columns>
<RefLogicalColumn id="2006:42563" uid="80cb68bc-07d6-0000-714b-e31d00000000" qualifiedName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;.&quot;Office&quot;"/>
<RefLogicalColumn id="2006:42564" uid="80cb68bd-07d6-0000-714b-e31d00000000" qualifiedName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;.&quot;Office Key&quot;"/>
<**RefLogicalColumn id="2006:42565"** uid="80cb68be-07d6-0000-714b-e31d00000000" qualifiedName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;.&quot;Department&quot;"/>
<RefLogicalColumn id="2006:42566" uid="80cb68bf-07d6-0000-714b-e31d00000000" qualifiedName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;.&quot;Dept Key&quot;"/>
<RefLogicalColumn id="2006:42567" uid="80cb68c0-07d6-0000-714b-e31d00000000" qualifiedName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;.&quot;Organization&quot;"/>
<RefLogicalColumn id="2006:42568" uid="80cb68c1-07d6-0000-714b-e31d00000000" qualifiedName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;.&quot;Org Key&quot;"/>
<RefLogicalColumn id="2006:42569" uid="80cb68c2-07d6-0000-714b-e31d00000000" qualifiedName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;.&quot;Company&quot;"/>
<RefLogicalColumn id="2006:42570" uid="80cb68c3-07d6-0000-714b-e31d00000000" qualifiedName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;.&quot;Company Key&quot;"/>
<RefLogicalColumn id="2006:42571" uid="80cb68c4-07d6-0000-714b-e31d00000000" qualifiedName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;.&quot;Office Sequence&quot;"/>
</Columns>
<TableSources>
<**RefLogicalTableSource id="2037:43058"** uid="80cb6a2c-07f5-0000-714b-e31d00000000" qualifiedName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;.&quot;LTS1 Offices&quot;"/>
</TableSources>
</LogicalTable>

然后使用 RefLogicalTableSource id = 2037:43058 我们将使用 id 在 LogicalTableSource 中进行搜索。

<LogicalTableSource name="LTS1 Offices" parentName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;" parentId="2035:42562" parentUid="80cb68bb-07f3-0000-714b-e31d00000000" **id="2037:43058"** uid="80cb6a2c-07f5-0000-714b-e31d00000000" isActive="true">
<Link>
<StartNode>
<**RefPhysicalTable id="3001:129"** uid="80ca6538-0bb9-0000-714b-e31d00000000" qualifiedName="&quot;Sample App Lite Data&quot;...&quot;D20 Offices&quot;"/>
</StartNode>
</Link>
<WhereClause>
<Expr></Expr>
</WhereClause>
<GroupBy>
<Expr><![CDATA[ GROUPBYLEVEL("SampleApp Lite"."H2 Offices"."Offices Detail")]]></Expr>
</GroupBy>
<FragmentContent>
<Expr></Expr>
</FragmentContent>
</LogicalTableSource>

然后使用 RefPhysicalTable id= 3001:129 我们将使用 parentId 在 PhysicalColumn 中搜索。

<PhysicalColumn name="Department" parentName="&quot;Sample App Lite Data&quot;...&quot;D20 Offices&quot;" **parentId="3001:129"** parentUid="80ca6538-0bb9-0000-714b-e31d00000000" id="3003:484" uid="80ca6539-0bbb-0000-714b-e31d00000000" dataType="VARCHAR" precision="20" extName="//Table/SAMP_OFFICES_D/DEPARTMENT" specialType="none">
<SourceColumn>
<RefPhysicalColumn id="3003:427" uid="80ca64f9-0bbb-0000-714b-e31d00000000" qualifiedName="&quot;Sample App Lite Data&quot;...&quot;SAMP_OFFICES_D&quot;.&quot;Department&quot;"/>
</SourceColumn>
</PhysicalColumn>

从这里我们需要 PhysicalColumn name="Department" 和 extName="//Table/SAMP_OFFICES_D/DEPARTMENT"

我的第一个问题是将我的 xml 文件转换为数据框,第二个问题是回溯源。

标签: rxml

解决方案


xml2::read_xml将帮助您阅读它。另一个会更难,因为看起来你有 3 个关系表。看到这个页面,可能还有这个,尽管当我尝试将它组合到一个表中时它变得很乱。

library(xml2)
library(tidyverse)
dfxml <- xml2::read_xml("C:/foo/bar.xml")

mcga <- function(tbl) {
  x <- colnames(tbl)
  x <- tolower(x)
  x <- gsub("[[:punct:][:space:]]+", "_", x)
  x <- gsub("_+", "_", x)
  x <- gsub("(^_|_$)", "", x)
  x <- make.unique(x, sep = "_")
  colnames(tbl) <- x
  tbl
}

dfxlm2 <- xml_find_all(dfxml1, ".//*") %>% 
  map_df(~{
   xml_attrs(.x) %>% 
      as.list()
  }) %>% 
  mcga()

或者将它们分成 3 张桌子。

LogicalTable <- xml_find_all(dfxml1, ".//LogicalTable//*") %>% 
  map_df(~{
    xml_attrs(.x) %>% 
      as.list()
  }) %>% 
  mcga()

PhysicalTable <- xml_find_all(dfxml1, ".//PhysicalColumn") %>% 
  map_df(~{
    xml_attrs(.x) %>% 
      as.list()
  }) %>% 
  mcga()

LogTable <- xml_find_all(dfxml1, ".//LogicalTableSource//*") %>% 
  map_df(~{
    xml_attrs(.x) %>% 
      as.list()
  }) %>% 
  mcga()

你想如何跟踪这些?


推荐阅读