首页 > 解决方案 > 创建一个遍历页码的函数

问题描述

我有一个导入数据的脚本,如下所示:

library(tidyverse)
library(rvest)
library(magrittr)

page_number <- 1:20

base_url <- read_html("https://247sports.com/Season/2021-Football/CompositeRecruitRankings/?ViewPath=~%2FViews%2FSkyNet%2FPlayerSportRanking%2F_SimpleSetForSeason.ascx&Page=1")

rankings <- base_url %>% html_nodes(".meta , .score , .position , .rankings-page__name-link") %>%
  html_text() %>% 
  str_trim %>% 
  str_split("   ") %>% 
  unlist %>%
  matrix(ncol = 4, byrow = T) %>% 
  as.data.frame

您会注意到base_url,在最后,它包括&Page=1. 好吧,我试图这样做 20 页,因此:

page_number <- 1:20

将这些数字循环到 URL 中而无需编写 20 组不同的代码的最有效方法是什么?

标签: rweb-scrapingdplyrtidyverservest

解决方案


您可以使用paste0sprintf来构造所有 URL

all_urls <- paste0("https://247sports.com/Season/2021-Football/CompositeRecruitRankings/?ViewPath=~%2FViews%2FSkyNet%2FPlayerSportRanking%2F_SimpleSetForSeason.ascx&Page=", 1:20)

然后,您可以遍历每个 URL 并提取所需的数据。

library(tidyverse)
library(rvest)

rankings <- map(all_urls, ~.x %>% read_html %>%
            html_nodes(".meta , .score , .position , .rankings-page__name-link") %>%
            html_text() %>% 
            str_trim %>% 
            str_split("   ") %>% 
            unlist %>%
            matrix(ncol = 4, byrow = T) %>% 
            as.data.frame)

推荐阅读