首页 > 解决方案 > 使用 rvest 抓取多个 URL

问题描述

read_html使用in时如何抓取多个网址rvest?目标是从各个 url 中获取由文本主体组成的单个文档,以在其上运行各种分析。

我试图连接网址:

 url <- c("https://www.vox.com/","https://www.cnn.com/")
   page <-read_html(url)
   page
   story <- page %>%
        html_nodes("p") %>%  
        html_text

read_html得到错误后:

 Error in doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) : 
 Expecting a single string value: [type=character; extent=3].

并不感到惊讶,因为read_html可能一次只处理一条路径。但是,我可以使用不同的功能或转换来同时抓取多个页面吗?

标签: htmlrscreen-scrapingrvest

解决方案


您可以使用map(或在基础 R: 中lapply)遍历每个url元素;这是一个例子

url <- c("https://www.vox.com/", "https://www.bbc.com/")
page <-map(url, ~read_html(.x) %>% html_nodes("p") %>% html_text())
str(page)
#List of 2
# $ : chr [1:22] "But he was acquitted on the two most serious charges he faced." "Health experts say it’s time to prepare for worldwide spread on all continents." "Wall Street is waking up to the threat of coronavirus as fears about the disease and its potential global econo"| __truncated__ "Johnson, who died Monday at age 101, did groundbreaking work in helping return astronauts safely to Earth." ...
# $ : chr [1:19] "" "\n                                                            The ex-movie mogul is handcuffed and led from cou"| __truncated__ "" "27°C" ...

返回对象是一个list.

PS。我已经更改了第二个url元素,因为"https://www.cnn.com/"返回NULLhtml_nodes("p") %>% html_text().


推荐阅读