首页 > 解决方案 > R删除xml数据中的重复兄弟

问题描述

我正在处理错误 XML 数据集:

 `</short_desc>     
  <report id="322231">
<update>
  <when>1136281841</when>
  <what>When uploading a objectice-c++ file (.mm) bugzilla sets the MIME type as application/octet-stream</what>
</update>
<update>
  <when>1136420901</when>
  <what>When uploading a objective-c++ file (.mm) bugzilla sets the MIME type as application/octet-stream</what>
</update>
 </report>
</short_desc> `

我通过仅保留<when><what>节点数据从上述 xml 数据创建数据框。由于<what>节点中的重复内容。<what>如果两个节点的内容<update>相似,我希望只保留最后一个节点(最近的) 。应该使用 R 中的余弦相似度进行比较。如果<what>节点中的数据不同,那么我想将两者都保留在要创建的数据框中。请建议,在某些情况下,单个更新有两个以上<report>并且文本大致相似。

标签: rxmldata-sciencecosine-similarity

解决方案


尝试以下...

library(xml2)

样本数据

doc <- read_xml( '<report id="322231">
<update>
                 <when>1136281841</when>
                 <what>When uploading a objective-c++ file (.mm) bugzilla sets the MIME type as application/octet-stream</what>
                 </update>
                 <update>
                 <when>1136420901</when>
                 <what>When uploading a objective-c++ file (.mm) bugzilla sets the MIME type as application/octet-stream</what>
                 </update>
                 </report>')

代码

#create nodeset with all 'what'-nodes
what.nodes <- xml_find_all( doc, ".//what" )

#no make a data.frame
df <- data.frame( 
  #get report-attribute "id" by retracing the ancestor tree from the what.nodes
  report_id = xml_attr( xml_find_first( what.nodes, ".//ancestor::report" ), "id" ),
  #get the sibling 'when'  fro the what-node
  when = xml_text( xml_find_first( what.nodes, ".//preceding-sibling::when" ) ),
  #get 'what'
  what = xml_text( what.nodes ),
  #set stringsAsfactors
  stringsAsFactors = FALSE )

#get rows with unique values from the bottom-up
df[ !duplicated( df$what, fromLast = TRUE ), ]

输出

#   report_id       when                                                                                              what
# 2    322231 1136420901 When uploading a objective-c++ file (.mm) bugzilla sets the MIME type as application/octet-stream

推荐阅读