首页 > 解决方案 > 抓取脚本在同一站点的其他页面上工作时返回错误,例如“下标越界”和“找不到对象”

问题描述

我在修改这个工作脚本时遇到了问题,该脚本从 fangraphs 中抓取数据,到同一站点上的不同页面以获取小联盟数据。我更改了 URL,删除了有关替换百分比的部分,因为它们在我的特定页面上没有问题......

这是原始脚本

suppressMessages(library(dplyr))
suppressMessages(library(rvest))

### Load data from webpage

url <- "https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=30&type=2&season=2018&month=0&season1=2018&ind=0&team=0&rost=0&age=0&filter=&players=0&page=1_1000"

l1 <- read_html(url)
l1 <- html_nodes(l1, 'table')

### Extract table from html and remove 'bad' rows
fangraphs <- html_table(l1, fill = TRUE)[[12]] 
fangraphs <- fangraphs[-c(1,3),]

# Extract column names
columnNames <- as.list(fangraphs[1,])
# Take care of symbols in column names
columnNames <- gsub("%", ".p", columnNames)
columnNames <- gsub("/", "per", columnNames)

# Rename data frame and remove row with column names
colnames(fangraphs) <- columnNames
fangraphs <- fangraphs[-1,]

fangraphs[] <- sapply(fangraphs, function(x) gsub(" %","",x))
fangraphs[4:19] <- sapply(fangraphs[4:19],as.numeric)

这是我编辑的脚本以适合此 URL ( https://www.fangraphs.com/minorleaders.aspx?pos=all&stats=bat&lg=all&qual=0&type=c,4,5,6,7,8,9,10, 11,12,13,14,15,16,17,18,19,20,21,22,30,46,45,44,32,23&season=2018&team=0&players=&page=1_3000 )

suppressMessages(library(dplyr))
suppressMessages(library(rvest))

### Load data from webpage

url <- "https://www.fangraphs.com/minorleaders.aspx?pos=all&stats=bat&lg=all&qual=0&type=c,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,30,46,45,44,32,23&season=2018&team=0&players=&page=1_3000"

l1 <- read_html(url)
l1 <- html_nodes(l1, 'table')

### Extract table from html and remove 'bad' rows
fangraphs <- html_table(l1, fill = TRUE)[[12]] 
fangraphs <- fangraphs[-c(1,3),]

# Extract column names
columnNames <- as.list(fangraphs[1,])

# Rename data frame and remove row with column names
colnames(fangraphs) <- columnNames
fangraphs <- fangraphs[-1,]

fangraphs[3:26] <- sapply(fangraphs[3:26],as.numeric)

我收到此错误返回

> suppressMessages(library(dplyr))
> suppressMessages(library(rvest))
> 
> ### Load data from webpage
> 
> url <- "https://www.fangraphs.com/minorleaders.aspx?pos=all&stats=bat&lg=all&qual=0&type=c,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,30,46,45,44,32,23&season=2018&team=0&players=&page=1_3000"
> 
> l1 <- read_html(url)
> l1 <- html_nodes(l1, 'table')
> 
> ### Extract table from html and remove 'bad' rows
> fangraphs <- html_table(l1, fill = TRUE)[[12]] 
Error in html_table(l1, fill = TRUE)[[12]] : subscript out of bounds
> fangraphs <- fangraphs[-c(1,3),]
Error: object 'fangraphs' not found
> 
> # Extract column names
> columnNames <- as.list(fangraphs[1,])
Error in as.list(fangraphs[1, ]) : object 'fangraphs' not found
> 
> # Rename data frame and remove row with column names
> colnames(fangraphs) <- columnNames
Error in colnames(fangraphs) <- columnNames : 
  object 'fangraphs' not found
> fangraphs <- fangraphs[-1,]
Error: object 'fangraphs' not found
> 
> fangraphs[3:26] <- sapply(fangraphs[3:26],as.numeric)
Error in lapply(X = X, FUN = FUN, ...) : object 'fangraphs' not found

当我将 html_nodes 中的代码从 'table' 更改为 '#MinorBoard1_dg1_ctl00 .rgHeader , .grid_line_regular' 时,它并没有好转,我使用选择器小工具(尽管这也包括每列的名称)。

最后一个单独的问题是我是否需要一些代码来修复具有“。”的列。在它们被转换为数字列之前(这里我说的是 ISO、BABIP 和 AVG 统计信息。

标签: rdataframeweb-scraping

解决方案


推荐阅读