r - Scrape and Loop with Rvest
问题描述
I have reviewed several answers to similar questions on SO related to this similar topic but neither seem to work for me.
(loop across multiple urls in r with rvest)
(Harvest (rvest) multiple HTML pages from a list of urls)
I have a list of URLs and I wish to grab the table from each and append it to a master dataframe.
## get all urls into one list
page<- (0:2)
urls <- list()
for (i in 1:length(page)) {
url<- paste0("https://www.mlssoccer.com/stats/season?page=",page[i])
urls[[i]] <- url
}
### loop over the urls and get the table from each page
table<- data.frame()
for (j in urls) {
tbl<- urls[j] %>%
read_html() %>%
html_node("table") %>%
html_table()
table[[j]] <- tbl
}
The first section works as expect and gets the list of urls I want to scrape. I get the following error:
Error in UseMethod("read_xml") :
no applicable method for 'read_xml' applied to an object of class "list"
Any suggestions on how to get correct for this error and get the 3 tables looped into a single DF? I appreciate any tips or pointers.
解决方案
尝试这个:
library(tidyverse)
library(rvest)
page<- (0:2)
urls <- list()
for (i in 1:length(page)) {
url<- paste0("https://www.mlssoccer.com/stats/season?page=",page[i])
urls[[i]] <- url
}
### loop over the urls and get the table from each page
tbl <- list()
j <- 1
for (j in seq_along(urls)) {
tbl[[j]] <- urls[[j]] %>% # tbl[[j]] assigns each table from your urls as an element in the tbl list
read_html() %>%
html_node("table") %>%
html_table()
j <- j+1 # j <- j+1 iterates over each url in turn and assigns the table from the second url as an element of tbl list, [[2]] in this case
}
#convert list to data frame
tbl <- do.call(rbind, tbl)
table[[j]] <- tbl
在原始代码中的 for 循环末尾是不必要的,因为我们在tbl
这里将每个 url 分配为列表的元素:tbl[[j]] <- urls[[j]]
推荐阅读
- bootstrap-4 - Bootstrap 4:如何使导航栏中的下拉链接头部可点击
- c++ - 什么时候调用全局对象的构造函数?
- python - 如何计算多边形内多个波段的平均像素值
- google-sheets - 我无法在谷歌表格中运行这个简单的代码
- c# - MainWindow 未订阅 UserControl 的事件
- sql-update - 根据选择更新房间
- r - read.delim 没有给我我想要的
- python - 如何打印所有 CSV 文件共有的列名
- windows - 应用程序锁定以前使用的应用程序,而不是其他任何东西。为什么会发生这种情况?
- sql - 根据配置表从数据表中查找对应的值