r - 如何创建一个从 html_nodes 提取数据并填充表的 for 循环
问题描述
我有一系列来自 RePEc 数据库的出版物标识符。我需要从数据库中获取参考列表,我可以这样做:
identifier <- "RePEc:imf:imfwpa:01/191"
url_base <- "http://citec.repec.org/api/amf/"
url <- paste0(url_base, identifier)
get_data <- read_html(url)
references <- html_nodes(get_data,'references') %>% html_nodes("text")
我得到一个如下所示的输出:
print(references)
{xml_nodeset (6)}
[1] <text ref="RePEc:rio:texdis:400"></text>
[2] <text ref="RePEc:fip:fednrp:9608"></text>
[3] <text ref="RePEc:nbr:nberwo:1172"></text>
[4] <text ref="RePEc:bla:ecnote:v:28:y:1999:i:3:p:335-355"></text>
[5] <text ref="RePEc:imf:imfwpa:00/69"></text>
[6] <text ref="RePEc:eee:jbfina:v:24:y:2000:i:1-2:p:203-227"></text>
我只想要个人标识符。换句话说,我只想要这个:
[1] "RePEc:rio:texdis:400"
[2] "RePEc:fip:fednrp:9608"
[3] "RePEc:nbr:nberwo:1172"
[4] "RePEc:bla:ecnote:v:28:y:1999:i:3:p:335-355"
[5] "RePEc:imf:imfwpa:00/69"
[6] "RePEc:eee:jbfina:v:24:y:2000:i:1-2:p:203-227"
我尝试使用html_text(references)
,但它只是给了我一系列空单元格..
获得这些数据后,我想创建一个数据框,其中每个值都位于原始标识符旁边。换句话说,我需要这样的东西:
identifier <- c("RePEc:imf:imfwpa:01/191", "RePEc:imf:imfwpa:01/191", "RePEc:imf:imfwpa:01/191", "RePEc:imf:imfwpa:01/191", "RePEc:imf:imfwpa:01/191", "RePEc:imf:imfwpa:01/191")
references <- c("RePEc:rio:texdis:400", "RePEc:fip:fednrp:9608", "RePEc:nbr:nberwo:1172", "RePEc:bla:ecnote:v:28:y:1999:i:3:p:335-355", "RePEc:imf:imfwpa:00/69", "RePEc:eee:jbfina:v:24:y:2000:i:1-2:p:203-227")
df <- data.frame(identifier, references)
我需要处理大约 180,000 个不同的文档。我想一旦我知道如何做一次,我就可以自己编写一个 for 循环,但是如果有人有聪明的方法来做到这一点,我将非常感谢您的建议。预先感谢您的帮助!
解决方案
文档是一个 XML。我觉得用xml2
比较合适。
library(xml2)
identifier <- "RePEc:imf:imfwpa:01/191"
url_base <- "http://citec.repec.org/api/amf/"
url <- paste0(url_base, identifier)
references <- read_xml(url) %>%
xml_find_all("//d1:references/d1:text") %>%
xml_attr("ref")
输出
# [1] "RePEc:rio:texdis:400"
# [2] "RePEc:fip:fednrp:9608"
# [3] "RePEc:nbr:nberwo:1172"
# [4] "RePEc:bla:ecnote:v:28:y:1999:i:3:p:335-355"
# [5] "RePEc:imf:imfwpa:00/69"
# [6] "RePEc:eee:jbfina:v:24:y:2000:i:1-2:p:203-227"
您需要安装xml2
软件包才能使其正常工作
install.packages("xml2")
或者,正如Ben所提到的rvest
,只需添加html_attr("ref")
到您的脚本中
get_data <- read_html(url)
references <- html_nodes(get_data,'references') %>%
html_nodes("text") %>%
html_attr("ref")
对于多个标识符,您可以将脚本包装在函数中,然后使用lapply
or调用它sapply
。
# function
get_reference <- function(identifier) {
url_base <- "http://citec.repec.org/api/amf/"
url <- paste0(url_base, identifier)
references <- read_xml(url) %>%
xml_find_all("//d1:references/d1:text") %>%
xml_attr("ref")
df <- data.frame(identifier = identifier, references = references, stringsAsFactors = F)
}
# list of identifier as input
identifier <- c("RePEc:imf:imfwpa:01/191","RePEc:imf:imfwpa:02/191")
# scrape and combine
df <- lapply(identifier, get_reference) %>% do.call(rbind, .)
输出
# identifier references
# 1 RePEc:imf:imfwpa:01/191 RePEc:rio:texdis:400
# 2 RePEc:imf:imfwpa:01/191 RePEc:fip:fednrp:9608
# 3 RePEc:imf:imfwpa:01/191 RePEc:nbr:nberwo:1172
# 4 RePEc:imf:imfwpa:01/191 RePEc:bla:ecnote:v:28:y:1999:i:3:p:335-355
# 5 RePEc:imf:imfwpa:01/191 RePEc:imf:imfwpa:00/69
# 6 RePEc:imf:imfwpa:01/191 RePEc:eee:jbfina:v:24:y:2000:i:1-2:p:203-227
# 7 RePEc:imf:imfwpa:02/191 RePEc:wck:wckewp:34/99
# 8 RePEc:imf:imfwpa:02/191 RePEc:nbr:nberwo:7018
# 9 RePEc:imf:imfwpa:02/191 RePEc:wop:wispod:1132-97
# 10 RePEc:imf:imfwpa:02/191 RePEc:aea:aecrev:v:88:y:1998:i:3:p:478-94
# 11 RePEc:imf:imfwpa:02/191 RePEc:mie:wpaper:341
# 12 RePEc:imf:imfwpa:02/191 RePEc:eee:inecon:v:4:y:1974:i:2:p:177-185
# 13 RePEc:imf:imfwpa:02/191 RePEc:imf:imfwpa:97/116
# 14 RePEc:imf:imfwpa:02/191 RePEc:nbr:nberwo:7539
# 15 RePEc:imf:imfwpa:02/191 RePEc:aea:aecrev:v:90:y:2000:i:2:p:161-167
# 16 RePEc:imf:imfwpa:02/191 RePEc:eee:inecon:v:50:y:2000:i:1:p:51-71
# 17 RePEc:imf:imfwpa:02/191 RePEc:nbr:nberwo:5427
# 18 RePEc:imf:imfwpa:02/191 RePEc:eee:ecochp:5-58
# 19 RePEc:imf:imfwpa:02/191 RePEc:nbr:nberwo:6591
推荐阅读
- bash - Bash Shell 脚本:读取文件夹修改日期并按月列出
- vb6 - 当我选择最后一行时,为什么在 MSFlexGrid 中选择了所有行?
- html - 使页面分区可点击移动
- amazon-web-services - 标准 SQS AWS 队列,检查双重交付
- javascript - x 不是函数(选择值)
- algorithm - O(nlogn) + O(n) 的时间复杂度是否只是 O(nlogn)?
- javascript - Openlayers 5 如何观察 view.center 的变化
- html - 您可以使用 selected="selected" 在加载时自动选择 selectize.js 吗?
- php - 快速单击链接会导致注销(会话锁定?)
- java - 如何在列表中的每第 n 个位置添加新行?