首页 > 解决方案 > 使用 R 提取 XML 中的属性

问题描述

尝试从 XML 文件提取(从大型 XML 文件)中提取两个属性,即“nmRegime”和“CalendarSystemT”(这是日期)。提取后,这两条记录需要与文件名一起保存为 R 中数据框中的两列。
一个给定的 XML 文件中有多个“事件”节点,并且有近 100 个单独的 XML 文件。

<Event tEV="FirA" clearEV="false" onEV="true" dateOriginEV="Calendar" nYrsFromStEV="" nDaysFromStEV="" tFaqEV="Blank" tAaqEV="Blank" aqStYrEV="0" aqEnYrEV="0" nmEV="Fire_Cool" categoryEV="CatUndef" tEvent="Doc" idSP="105" nmRegime="Wheat, Tilled, stubble cool burn" regimeInstance="1">
      <notesEV></notesEV>
      <dateEV CalendarSystemT="FixedLength">19710331</dateEV>
      <FirA fracAfctFirA="0.6" fracGbfrToAtmsFirA="0.98" fracStlkToAtmsFirA="0.98" fracLeafToAtmsFirA="0.98" fracGbfrToGlitFirA="0.02" fracStlkToSlitFirA="0.02" fracLeafToLlitFirA="0.02" fracCortToCodrFirA="1.0" fracFirtToFidrFirA="1.0" fracDGlitToAtmsFirA="0.931" fracRGlitToAtmsFirA="0.931" fracDSlitToAtmsFirA="0.931" fracRSlitToAtmsFirA="0.931" fracDLlitToAtmsFirA="0.931" fracRLlitToAtmsFirA="0.931" fracDCodrToAtmsFirA="0.0" fracRCodrToAtmsFirA="0.0" fracDFidrToAtmsFirA="0.0" fracRFidrToAtmsFirA="0.0" fracDGlitToInrtFirA="0.019" fracRGlitToInrtFirA="0.019" fracDSlitToInrtFirA="0.019" fracRSlitToInrtFirA="0.019" fracDLlitToInrtFirA="0.019" fracRLlitToInrtFirA="0.019" fracDCodrToInrtFirA="0.0" fracRCodrToInrtFirA="0.0" fracDFidrToInrtFirA="0.0" fracRFidrToInrtFirA="0.0" fracSopmToAtmsFirA="" fracLrpmToAtmsFirA="" fracMrpmToAtmsFirA="" fracSommToAtmsFirA="" fracLrmmToAtmsFirA="" fracMrmmToAtmsFirA="" fracMicrToAtmsFirA="" fracSopmToInrtFirA="" fracLrpmToInrtFirA="" fracMrpmToInrtFirA="" fracSommToInrtFirA="" fracLrmmToInrtFirA="" fracMrmmToInrtFirA="" fracMicrToInrtFirA="" fracMnamNToAtmsFirA="" fracSAmmNToAtmsFirA="" fracSNtrNToAtmsFirA="" fracDAmmNToAtmsFirA="" fracDNtrNToAtmsFirA="" fixFirA="" phaFirA="" />
    </Event>

在提取“nmRegime”方面取得了一些成功,但在“CalendarSystemT”方面没有成功。在下面的代码中用于数据提取。
第二个问题,有没有办法循环XML文件列表并做这个操作?

# get records
library(xml2)
recs <- xml_find_all(xml, "//Event")

#extract the names
labs <- trimws(xml_attr(recs, "nmRegime"))
names <- labs[!is.na(labs)]

# Extract the date 
recs_t <- xml_find_all(xml, "//Event/dateEV")
time <- trimws(xml_attr(recs_t, "CalendarSystemT"))

标签: rxml

解决方案


日历时间值不是属性,而是存储为节点的元素并可以直接访问。

另请注意,如果事件节点缺少“dateEV”,则将“实验室”与“时间”对齐会有问题。最好从每个父节点而不是整个文档中提取时间值。

library(xml2)
library(dplyr)

xml<- read_xml('<Event tEV="FirA" clearEV="false" onEV="true" dateOriginEV="Calendar" nYrsFromStEV="" nDaysFromStEV="" tFaqEV="Blank" tAaqEV="Blank" aqStYrEV="0" aqEnYrEV="0" nmEV="Fire_Cool" categoryEV="CatUndef" tEvent="Doc" idSP="105" nmRegime="Wheat, Tilled, stubble cool burn" regimeInstance="1">
      <notesEV></notesEV>
      <dateEV CalendarSystemT="FixedLength">19710331</dateEV>
      <FirA fracAfctFirA="0.6" fracGbfrToAtmsFirA="0.98" fracStlkToAtmsFirA="0.98" fracLeafToAtmsFirA="0.98" fracGbfrToGlitFirA="0.02" fracStlkToSlitFirA="0.02" fracLeafToLlitFirA="0.02" fracCortToCodrFirA="1.0" fracFirtToFidrFirA="1.0" fracDGlitToAtmsFirA="0.931" fracRGlitToAtmsFirA="0.931" fracDSlitToAtmsFirA="0.931" fracRSlitToAtmsFirA="0.931" fracDLlitToAtmsFirA="0.931" fracRLlitToAtmsFirA="0.931" fracDCodrToAtmsFirA="0.0" fracRCodrToAtmsFirA="0.0" fracDFidrToAtmsFirA="0.0" fracRFidrToAtmsFirA="0.0" fracDGlitToInrtFirA="0.019" fracRGlitToInrtFirA="0.019" fracDSlitToInrtFirA="0.019" fracRSlitToInrtFirA="0.019" fracDLlitToInrtFirA="0.019" fracRLlitToInrtFirA="0.019" fracDCodrToInrtFirA="0.0" fracRCodrToInrtFirA="0.0" fracDFidrToInrtFirA="0.0" fracRFidrToInrtFirA="0.0" fracSopmToAtmsFirA="" fracLrpmToAtmsFirA="" fracMrpmToAtmsFirA="" fracSommToAtmsFirA="" fracLrmmToAtmsFirA="" fracMrmmToAtmsFirA="" fracMicrToAtmsFirA="" fracSopmToInrtFirA="" fracLrpmToInrtFirA="" fracMrpmToInrtFirA="" fracSommToInrtFirA="" fracLrmmToInrtFirA="" fracMrmmToInrtFirA="" fracMicrToInrtFirA="" fracMnamNToAtmsFirA="" fracSAmmNToAtmsFirA="" fracSNtrNToAtmsFirA="" fracDAmmNToAtmsFirA="" fracDNtrNToAtmsFirA="" fixFirA="" phaFirA="" />
    </Event>')

recs <- xml_find_all(xml, "//Event")

#extract the names 
labs <- trimws(xml_attr(recs, "nmRegime")) names <- labs[!is.na(labs)]

# Extract the date  
time <- xml_find_first(recs, ".//dateEV") %>% xml_text() %>% trimws()

要回答您的第二个问题,是的,您可以将上述脚本包装成一个函数,然后使用 lapply 循环遍历整个文件列表。
有关详细信息,请参阅此问题和答案:R XML - 将父节点和子节点(同名)组合到数据框中


推荐阅读