r - 在 R 中读取具有有限节点根的 XML 数据帧
问题描述
我有一个 XML 文件,其中包含来自《纽约客》杂志的存档文本。我是分析 XML 的新手,但我希望将此文件转换为 R 中的数据框,我可以在其中进行一些基本的文本分析(例如,词云)。
这是 XML 文件的子集:
<DjVuXML>
<HEAD>file://clustershare/storage/djvu/Conde%20Nast/New%20Yorker/1925_02_21/page0000004.djvu</HEAD>
<BODY>
<OBJECT data="file://clustershare/storage/djvu/Conde%20Nast/New%20Yorker/1925_02_21/page0000004.djvu" type="image/x.djvu" height="7080" width="5040" usemap="page0000004.djvu" >
<PARAM name="DPI" value="600" />
<PARAM name="GAMMA" value="2.200000" />
<HIDDENTEXT>
<PAGECOLUMN>
<REGION>
<PARAGRAPH>
<LINE>
<WORD coords="742,420,776,362">2</WORD>
</LINE>
</PARAGRAPH>
</REGION>
<REGION>
<PARAGRAPH>
<LINE>
<WORD coords="738,631,934,562">weeks,</WORD>
<WORD coords="969,638,1081,584">you</WORD>
<WORD coords="1119,620,1244,564">will</WORD>
<WORD coords="1281,641,1553,565">probably</WORD>
<WORD coords="1589,624,1734,567">need</WORD>
<WORD coords="1771,624,1878,589">one</WORD>
<WORD coords="1916,642,1988,568">by</WORD>
</LINE>
<LINE>
<WORD coords="738,702,858,646">that</WORD>
<WORD coords="899,703,1036,647">time</WORD>
<WORD coords="1076,723,1334,649">anyhow.</WORD>
</LINE>
</PARAGRAPH>
</REGION>
<REGION>
<PARAGRAPH>
<LINE>
<WORD coords="819,871,966,814">This</WORD>
<WORD coords="1014,872,1127,828">two</WORD>
<WORD coords="1175,873,1374,817">week's</WORD>
<WORD coords="1422,891,1588,817">delay</WORD>
<WORD coords="1634,876,1811,819">looms</WORD>
<WORD coords="1859,876,1911,841">as</WORD>
<WORD coords="1958,877,1988,842">a</WORD>
</LINE>
<LINE>
<WORD coords="737,956,1099,900">tremendous</WORD>
<WORD coords="1157,958,1395,900">obstacle</WORD>
<WORD coords="1455,959,1566,904">and</WORD>
<WORD coords="1623,977,1735,925">you</WORD>
<WORD coords="1794,962,1984,904">hasten</WORD>
</LINE>
<LINE>
<WORD coords="736,1060,1101,983">breathlessly</WORD>
<WORD coords="1128,1042,1186,1000">to</WORD>
<WORD coords="1214,1042,1310,987">the</WORD>
<WORD coords="1338,1063,1642,986">telephone</WORD>
<WORD coords="1670,1065,1988,990">company's</WORD>
</LINE>
<LINE>
<WORD coords="736,1124,890,1070">office</WORD>
<WORD coords="949,1125,1137,1070">where</WORD>
<WORD coords="1194,1145,1307,1091">you</WORD>
<WORD coords="1365,1128,1594,1071">become</WORD>
<WORD coords="1650,1148,1774,1087">part</WORD>
<WORD coords="1832,1128,1902,1073">of</WORD>
<WORD coords="1958,1128,1988,1094">a</WORD>
</LINE>
<LINE>
<WORD coords="737,1225,946,1153">throng</WORD>
<WORD coords="973,1229,1352,1155">surrounding</WORD>
<WORD coords="1380,1212,1407,1178">a</WORD>
<WORD coords="1436,1213,1669,1171">counter</WORD>
<WORD coords="1698,1214,1795,1157">for</WORD>
<WORD coords="1823,1214,1984,1160">about</WORD>
</LINE>
<LINE>
<WORD coords="735,1293,805,1259">an</WORD>
<WORD coords="838,1296,999,1239">hour.</WORD>
<WORD coords="1074,1296,1150,1240">At</WORD>
<WORD coords="1183,1296,1278,1241">the</WORD>
<WORD coords="1311,1297,1424,1240">end</WORD>
<WORD coords="1455,1297,1526,1241">of</WORD>
<WORD coords="1556,1297,1676,1242">that</WORD>
<WORD coords="1709,1298,1846,1242">time</WORD>
<WORD coords="1875,1316,1984,1262">you</WORD>
</LINE>
<LINE>
<WORD coords="736,1378,840,1322">tell</WORD>
<WORD coords="871,1398,1013,1343">your</WORD>
<WORD coords="1044,1398,1194,1337">story</WORD>
<WORD coords="1225,1380,1282,1337">to</WORD>
<WORD coords="1316,1381,1343,1346">a</WORD>
<WORD coords="1375,1381,1504,1346">man</WORD>
<WORD coords="1538,1381,1591,1340">at</WORD>
<WORD coords="1625,1381,1721,1326">the</WORD>
<WORD coords="1755,1382,1984,1339">counter</WORD>
</LINE>
<LINE>
<WORD coords="735,1463,865,1408">who</WORD>
<WORD coords="916,1481,1125,1407">dodges</WORD>
<WORD coords="1172,1464,1231,1421">to</WORD>
<WORD coords="1279,1464,1308,1429">a</WORD>
<WORD coords="1358,1465,1488,1409">desk</WORD>
<WORD coords="1535,1484,1839,1410">telephone</WORD>
<WORD coords="1890,1467,1984,1411">for</WORD>
</LINE>
<LINE>
<WORD coords="736,1548,1173,1493">conversational</WORD>
<WORD coords="1203,1570,1465,1514">purposes</WORD>
<WORD coords="1494,1569,1660,1515">every</WORD>
<WORD coords="1688,1569,1848,1495">forty</WORD>
<WORD coords="1875,1550,1986,1516">sec-</WORD>
</LINE>
<LINE>
<WORD coords="743,1632,777,1598">o</WORD>
<WORD coords="802,1633,838,1599">n</WORD>
<WORD coords="864,1633,901,1578">d</WORD>
<WORD coords="926,1647,987,1599">s,</WORD>
<WORD coords="1044,1652,1336,1577">obviously</WORD>
<WORD coords="1387,1635,1446,1592"> o</WORD>
</LINE>
<LINE>
<WORD coords="742,1720,1117,1664">demonstrate</WORD>
<WORD coords="1189,1720,1340,1665">what</WORD>
<WORD coords="1414,1720,1441,1686">a</WORD>
</LINE>
<LINE>
<WORD coords="743,1823,924,1748">really</WORD>
<WORD coords="963,1820,1116,1761">great</WORD>
<WORD coords="1158,1823,1294,1748">help</WORD>
<WORD coords="1335,1805,1446,1749">this</WORD>
</LINE>
<LINE>
<WORD coords="744,1889,1039,1833">invention</WORD>
<WORD coords="1077,1889,1119,1834">is</WORD>
<WORD coords="1154,1889,1212,1848">to</WORD>
<WORD coords="1247,1890,1274,1855">a</WORD>
<WORD coords="1310,1907,1446,1833">busy</WORD>
</LINE>
<LINE>
<WORD coords="742,1974,893,1938">man.</WORD>
<WORD coords="943,1974,1092,1917">This</WORD>
<WORD coords="1119,1990,1446,1918">gentleman</WORD>
</LINE>
<LINE>
<WORD coords="744,2077,1067,2001">ultimately</WORD>
<WORD coords="1120,2077,1280,2003">helps</WORD>
<WORD coords="1333,2077,1446,2024">YOll</WORD>
</LINE>
<LINE>
<WORD coords="745,2142,833,2085">fill</WORD>
<WORD coords="860,2142,958,2100">out</WORD>
<WORD coords="985,2143,1054,2108">an</WORD>
<WORD coords="1081,2163,1446,2087">Application</WORD>
</LINE>
<LINE>
<WORD coords="743,2226,840,2170">for</WORD>
<WORD coords="870,2226,1088,2171">Service</WORD>
<WORD coords="1118,2227,1307,2172">which</WORD>
<WORD coords="1332,2246,1445,2192">you</WORD>
</LINE>
<LINE>
<WORD coords="744,2327,1040,2254">recognize</WORD>
<WORD coords="1093,2311,1146,2276">as</WORD>
<WORD coords="1197,2311,1293,2256">the</WORD>
<WORD coords="1344,2312,1446,2256">old</WORD>
</LINE>
<LINE>
<WORD coords="745,2395,965,2338">income</WORD>
<WORD coords="1000,2396,1088,2355">tax</WORD>
<WORD coords="1123,2396,1315,2340">blanks</WORD>
<WORD coords="1349,2397,1446,2341">the</WORD>
</LINE>
<LINE>
<WORD coords="745,2481,1139,2422">Government</WORD>
<WORD coords="1196,2481,1330,2426">used</WORD>
<WORD coords="1385,2482,1444,2425">in</WORD>
</LINE>
<LINE>
<WORD coords="754,2565,928,2511">1919.</WORD>
</LINE>
</PARAGRAPH>
</REGION>
<REGION>
<PARAGRAPH>
<LINE>
<WORD coords="4003,3841,4153,3784">THE</WORD>
<WORD coords="4187,3842,4346,3787">NEW</WORD>
<WORD coords="4377,3843,4636,3787">YORKER</WORD>
</LINE>
<LINE>
<WORD coords="3383,6199,3583,6142">brands</WORD>
<WORD coords="3612,6200,3681,6144">of</WORD>
<WORD coords="3711,6202,3938,6144">hokum.</WORD>
</LINE>
</PARAGRAPH>
</REGION>
<REGION>
<PARAGRAPH>
<LINE>
<WORD coords="3405,4362,3776,4291"> cJ</WORD>
</LINE>
</PARAGRAPH>
</REGION>
<REGION>
<PARAGRAPH>
<LINE>
<WORD coords="3382,6548,3656,6332">1-&</WORD>
</LINE>
</PARAGRAPH>
</REGION>
<REGION>
<PARAGRAPH>
<LINE>
<WORD coords="4188,6542,4630,6292">-</WORD>
</LINE>
</PARAGRAPH>
</REGION>
<REGION>
<PARAGRAPH>
<LINE>
<WORD coords="3818,6536,4059,6430"> </WORD>
</LINE>
</PARAGRAPH>
</REGION>
</PAGECOLUMN>
</HIDDENTEXT>
</OBJECT>
<MAP name="page0000004.djvu"/>
</BODY>
</DjVuXML>
我在从这个文件中以分隔格式提取文本时遇到了很多麻烦。例如,使用 XML 包,我尝试了以下方法:
filename<- "~/Desktop/New Yorker/1925_02_21/xml/page0000004.xml"
xmlData<-xmlParse(filename)
xmlDataFrame<-xmlToDataFrame(xmlData)
xmlDataFrame$OBJECT[2]
但这只会产生一长串没有空格的文本。
我还尝试在 XML 文件中搜索特定节点:
rootnode <- xmlRoot(xmlData)
print(rootnode[[2]][[1]])
但这只会返回整个 xml 文件而不是特定节点。
我知道我可能错过了将其转换为可管理和可分析的数据框的非常简单的步骤。如果有帮助,我很乐意编辑问题或提供更多信息。正如我所提到的,我是这里的新手,所以任何建议都值得赞赏。
解决方案
这是使用该xml2
软件包的解决方案。假设您的 XML 文件保存到路径~/Desktop/New Yorker/1925_02_21/xml/page0000004.xml
:
library(xml2)
doc = read_xml("~/Desktop/New Yorker/1925_02_21/xml/page0000004.xml")
word_nodes = xml_find_all(doc, xpath = "//WORD")
extracted_text = paste(xml_text(word_nodes), collapse = " ")
该xml_find_all
代码使用 Xpath,它类似于正则表达式,但用于树。"//WORD"
查找所有WORD
元素,无论它们在文档中的什么位置。有关详细信息,请参阅此 W3 教程。
上面的代码返回以下内容:
> extracted_text
[1] "2 weeks, you will probably need one by that time anyhow. This two week's delay looms as a tremendous obstacle and you hasten breathlessly to the telephone company's office where you become part of a throng surrounding a counter for about an hour. At the end of that time you tell your story to a man at the counter who dodges to a desk telephone for conversational purposes every forty sec- o n d s, obviously \no demonstrate what a really great help this invention is to a busy man. This gentleman ultimately helps YOll fill out an Application for Service which you recognize as the old income tax blanks the Government used in 1919. THE NEW YORKER brands of hokum. \ncJ 1-& - \n "
推荐阅读
- mysql - 检查数组中的任何值是否与 json 列中的任何值匹配
- r - 如何基于 R 中的多个变量创建数据表?
- mysql - 将笑脸从 mssql 迁移到 mysql
- java - Java 运行时执行 SQL 脚本继续 SQL 错误
- java - Windows 10 上的 Hadoop-2.8.0 安装
- angular - Angular 7, implementing Jstree
- c++ - 如何使用 Qt 实现类似 VS 的窗口管理器(浮动选项卡)?
- c# - 接收回调
- google-docs-api - How to bold text in a google doc using the google docs api
- python - 如何组合两个数组