首页 > 解决方案 > 在 R 中读取具有有限节点根的 XML 数据帧

问题描述

我有一个 XML 文件,其中包含来自《纽约客》杂志的存档文本。我是分析 XML 的新手,但我希望将此文件转换为 R 中的数据框,我可以在其中进行一些基本的文本分析(例如,词云)。

这是 XML 文件的子集:

<DjVuXML>
<HEAD>file://clustershare/storage/djvu/Conde%20Nast/New%20Yorker/1925_02_21/page0000004.djvu</HEAD>
<BODY>
<OBJECT data="file://clustershare/storage/djvu/Conde%20Nast/New%20Yorker/1925_02_21/page0000004.djvu" type="image/x.djvu" height="7080" width="5040" usemap="page0000004.djvu" >
<PARAM name="DPI" value="600" />
<PARAM name="GAMMA" value="2.200000" />
  <HIDDENTEXT>
    <PAGECOLUMN>
      <REGION>
        <PARAGRAPH>
          <LINE>
            <WORD coords="742,420,776,362">2</WORD>
          </LINE>
        </PARAGRAPH>
      </REGION>
      <REGION>
        <PARAGRAPH>
          <LINE>
            <WORD coords="738,631,934,562">weeks,</WORD>
            <WORD coords="969,638,1081,584">you</WORD>
            <WORD coords="1119,620,1244,564">will</WORD>
            <WORD coords="1281,641,1553,565">probably</WORD>
            <WORD coords="1589,624,1734,567">need</WORD>
            <WORD coords="1771,624,1878,589">one</WORD>
            <WORD coords="1916,642,1988,568">by</WORD>
          </LINE>
          <LINE>
            <WORD coords="738,702,858,646">that</WORD>
            <WORD coords="899,703,1036,647">time</WORD>
            <WORD coords="1076,723,1334,649">anyhow.</WORD>
          </LINE>
        </PARAGRAPH>
      </REGION>
      <REGION>
        <PARAGRAPH>
          <LINE>
            <WORD coords="819,871,966,814">This</WORD>
            <WORD coords="1014,872,1127,828">two</WORD>
            <WORD coords="1175,873,1374,817">week&apos;s</WORD>
            <WORD coords="1422,891,1588,817">delay</WORD>
            <WORD coords="1634,876,1811,819">looms</WORD>
            <WORD coords="1859,876,1911,841">as</WORD>
            <WORD coords="1958,877,1988,842">a</WORD>
          </LINE>
          <LINE>
            <WORD coords="737,956,1099,900">tremendous</WORD>
            <WORD coords="1157,958,1395,900">obstacle</WORD>
            <WORD coords="1455,959,1566,904">and</WORD>
            <WORD coords="1623,977,1735,925">you</WORD>
            <WORD coords="1794,962,1984,904">hasten</WORD>
          </LINE>
          <LINE>
            <WORD coords="736,1060,1101,983">breathlessly</WORD>
            <WORD coords="1128,1042,1186,1000">to</WORD>
            <WORD coords="1214,1042,1310,987">the</WORD>
            <WORD coords="1338,1063,1642,986">telephone</WORD>
            <WORD coords="1670,1065,1988,990">company&apos;s</WORD>
          </LINE>
          <LINE>
            <WORD coords="736,1124,890,1070">office</WORD>
            <WORD coords="949,1125,1137,1070">where</WORD>
            <WORD coords="1194,1145,1307,1091">you</WORD>
            <WORD coords="1365,1128,1594,1071">become</WORD>
            <WORD coords="1650,1148,1774,1087">part</WORD>
            <WORD coords="1832,1128,1902,1073">of</WORD>
            <WORD coords="1958,1128,1988,1094">a</WORD>
          </LINE>
          <LINE>
            <WORD coords="737,1225,946,1153">throng</WORD>
            <WORD coords="973,1229,1352,1155">surrounding</WORD>
            <WORD coords="1380,1212,1407,1178">a</WORD>
            <WORD coords="1436,1213,1669,1171">counter</WORD>
            <WORD coords="1698,1214,1795,1157">for</WORD>
            <WORD coords="1823,1214,1984,1160">about</WORD>
          </LINE>
          <LINE>
            <WORD coords="735,1293,805,1259">an</WORD>
            <WORD coords="838,1296,999,1239">hour.</WORD>
            <WORD coords="1074,1296,1150,1240">At</WORD>
            <WORD coords="1183,1296,1278,1241">the</WORD>
            <WORD coords="1311,1297,1424,1240">end</WORD>
            <WORD coords="1455,1297,1526,1241">of</WORD>
            <WORD coords="1556,1297,1676,1242">that</WORD>
            <WORD coords="1709,1298,1846,1242">time</WORD>
            <WORD coords="1875,1316,1984,1262">you</WORD>
          </LINE>
          <LINE>
            <WORD coords="736,1378,840,1322">tell</WORD>
            <WORD coords="871,1398,1013,1343">your</WORD>
            <WORD coords="1044,1398,1194,1337">story</WORD>
            <WORD coords="1225,1380,1282,1337">to</WORD>
            <WORD coords="1316,1381,1343,1346">a</WORD>
            <WORD coords="1375,1381,1504,1346">man</WORD>
            <WORD coords="1538,1381,1591,1340">at</WORD>
            <WORD coords="1625,1381,1721,1326">the</WORD>
            <WORD coords="1755,1382,1984,1339">counter</WORD>
          </LINE>
          <LINE>
            <WORD coords="735,1463,865,1408">who</WORD>
            <WORD coords="916,1481,1125,1407">dodges</WORD>
            <WORD coords="1172,1464,1231,1421">to</WORD>
            <WORD coords="1279,1464,1308,1429">a</WORD>
            <WORD coords="1358,1465,1488,1409">desk</WORD>
            <WORD coords="1535,1484,1839,1410">telephone</WORD>
            <WORD coords="1890,1467,1984,1411">for</WORD>
          </LINE>
          <LINE>
            <WORD coords="736,1548,1173,1493">conversational</WORD>
            <WORD coords="1203,1570,1465,1514">purposes</WORD>
            <WORD coords="1494,1569,1660,1515">every</WORD>
            <WORD coords="1688,1569,1848,1495">forty</WORD>
            <WORD coords="1875,1550,1986,1516">sec-</WORD>
          </LINE>
          <LINE>
            <WORD coords="743,1632,777,1598">o</WORD>
            <WORD coords="802,1633,838,1599">n</WORD>
            <WORD coords="864,1633,901,1578">d</WORD>
            <WORD coords="926,1647,987,1599">s,</WORD>
            <WORD coords="1044,1652,1336,1577">obviously</WORD>
            <WORD coords="1387,1635,1446,1592">&#10;o</WORD>
          </LINE>
          <LINE>
            <WORD coords="742,1720,1117,1664">demonstrate</WORD>
            <WORD coords="1189,1720,1340,1665">what</WORD>
            <WORD coords="1414,1720,1441,1686">a</WORD>
          </LINE>
          <LINE>
            <WORD coords="743,1823,924,1748">really</WORD>
            <WORD coords="963,1820,1116,1761">great</WORD>
            <WORD coords="1158,1823,1294,1748">help</WORD>
            <WORD coords="1335,1805,1446,1749">this</WORD>
          </LINE>
          <LINE>
            <WORD coords="744,1889,1039,1833">invention</WORD>
            <WORD coords="1077,1889,1119,1834">is</WORD>
            <WORD coords="1154,1889,1212,1848">to</WORD>
            <WORD coords="1247,1890,1274,1855">a</WORD>
            <WORD coords="1310,1907,1446,1833">busy</WORD>
          </LINE>
          <LINE>
            <WORD coords="742,1974,893,1938">man.</WORD>
            <WORD coords="943,1974,1092,1917">This</WORD>
            <WORD coords="1119,1990,1446,1918">gentleman</WORD>
          </LINE>
          <LINE>
            <WORD coords="744,2077,1067,2001">ultimately</WORD>
            <WORD coords="1120,2077,1280,2003">helps</WORD>
            <WORD coords="1333,2077,1446,2024">YOll</WORD>
          </LINE>
          <LINE>
            <WORD coords="745,2142,833,2085">fill</WORD>
            <WORD coords="860,2142,958,2100">out</WORD>
            <WORD coords="985,2143,1054,2108">an</WORD>
            <WORD coords="1081,2163,1446,2087">Application</WORD>
          </LINE>
          <LINE>
            <WORD coords="743,2226,840,2170">for</WORD>
            <WORD coords="870,2226,1088,2171">Service</WORD>
            <WORD coords="1118,2227,1307,2172">which</WORD>
            <WORD coords="1332,2246,1445,2192">you</WORD>
          </LINE>
          <LINE>
            <WORD coords="744,2327,1040,2254">recognize</WORD>
            <WORD coords="1093,2311,1146,2276">as</WORD>
            <WORD coords="1197,2311,1293,2256">the</WORD>
            <WORD coords="1344,2312,1446,2256">old</WORD>
          </LINE>
          <LINE>
            <WORD coords="745,2395,965,2338">income</WORD>
            <WORD coords="1000,2396,1088,2355">tax</WORD>
            <WORD coords="1123,2396,1315,2340">blanks</WORD>
            <WORD coords="1349,2397,1446,2341">the</WORD>
          </LINE>
          <LINE>
            <WORD coords="745,2481,1139,2422">Government</WORD>
            <WORD coords="1196,2481,1330,2426">used</WORD>
            <WORD coords="1385,2482,1444,2425">in</WORD>
          </LINE>
          <LINE>
            <WORD coords="754,2565,928,2511">1919.</WORD>
          </LINE>
        </PARAGRAPH>
      </REGION>
      <REGION>
        <PARAGRAPH>
          <LINE>
            <WORD coords="4003,3841,4153,3784">THE</WORD>
            <WORD coords="4187,3842,4346,3787">NEW</WORD>
            <WORD coords="4377,3843,4636,3787">YORKER</WORD>
          </LINE>
          <LINE>
            <WORD coords="3383,6199,3583,6142">brands</WORD>
            <WORD coords="3612,6200,3681,6144">of</WORD>
            <WORD coords="3711,6202,3938,6144">hokum.</WORD>
          </LINE>
        </PARAGRAPH>
      </REGION>
      <REGION>
        <PARAGRAPH>
          <LINE>
            <WORD coords="3405,4362,3776,4291">&#10;cJ</WORD>
          </LINE>
        </PARAGRAPH>
      </REGION>
      <REGION>
        <PARAGRAPH>
          <LINE>
            <WORD coords="3382,6548,3656,6332">1-&amp;</WORD>
          </LINE>
        </PARAGRAPH>
      </REGION>
      <REGION>
        <PARAGRAPH>
          <LINE>
            <WORD coords="4188,6542,4630,6292">-</WORD>
          </LINE>
        </PARAGRAPH>
      </REGION>
      <REGION>
        <PARAGRAPH>
          <LINE>
            <WORD coords="3818,6536,4059,6430">&#10; </WORD>
          </LINE>
        </PARAGRAPH>
      </REGION>
    </PAGECOLUMN>
  </HIDDENTEXT>
</OBJECT>
<MAP name="page0000004.djvu"/>
</BODY>
</DjVuXML>

我在从这个文件中以分隔格式提取文本时遇到了很多麻烦。例如,使用 XML 包,我尝试了以下方法:

filename<- "~/Desktop/New Yorker/1925_02_21/xml/page0000004.xml"
xmlData<-xmlParse(filename)
xmlDataFrame<-xmlToDataFrame(xmlData)
xmlDataFrame$OBJECT[2]

但这只会产生一长串没有空格的文本。

我还尝试在 XML 文件中搜索特定节点:

rootnode <- xmlRoot(xmlData)
print(rootnode[[2]][[1]])

但这只会返回整个 xml 文件而不是特定节点。

我知道我可能错过了将其转换为可管理和可分析的数据框的非常简单的步骤。如果有帮助,我很乐意编辑问题或提供更多信息。正如我所提到的,我是这里的新手,所以任何建议都值得赞赏。

标签: rxml

解决方案


这是使用该xml2软件包的解决方案。假设您的 XML 文件保存到路径~/Desktop/New Yorker/1925_02_21/xml/page0000004.xml

library(xml2)
doc = read_xml("~/Desktop/New Yorker/1925_02_21/xml/page0000004.xml")
word_nodes = xml_find_all(doc, xpath = "//WORD")
extracted_text = paste(xml_text(word_nodes), collapse = " ")

xml_find_all代码使用 Xpath,它类似于正则表达式,但用于树。"//WORD"查找所有WORD元素,无论它们在文档中的什么位置。有关详细信息,请参阅此 W3 教程

上面的代码返回以下内容:

> extracted_text 
[1] "2 weeks, you will probably need one by that time anyhow. This two week's delay looms as a tremendous obstacle and you hasten breathlessly to the telephone company's office where you become part of a throng surrounding a counter for about an hour. At the end of that time you tell your story to a man at the counter who dodges to a desk telephone for conversational purposes every forty sec- o n d s, obviously \no demonstrate what a really great help this invention is to a busy man. This gentleman ultimately helps YOll fill out an Application for Service which you recognize as the old income tax blanks the Government used in 1919. THE NEW YORKER brands of hokum. \ncJ 1-& - \n "

推荐阅读