首页 > 解决方案 > 尝试在 R 中进行网页抓取时创建多个数据框

问题描述

我对编码,尤其是网络抓取仍然很陌生,但这是我想要做的:

我想抓取 fbRef.com 为每支英超球队的比赛数据创建一个数据框。

我知道这适用于获取团队链接:

library(rvest)
page <- "https://fbref.com/en/comps/9/Premier-League-Stats"
scraped_page <- read_html(page)
teamLinks <- scraped_page%>%
   html_nodes("#stats_squads_standard_for a")%>%
   html_attr("href")
teamLinks <- paste0("https://fbref.com/",teamLinks)

我还可以根据相同的信息创建每个团队名称的列表

Team <- scraped_page%>%
   html_nodes('#stats_squads_standard_for .left')%>%
   html_text()%>%
   as.character()

但是现在我想为每个团队分别创建一个数据框,并抓取每个团队的页面以获取特定的统计数据。我有一个 for 循环来获取我需要的统计信息,但我不知道如何将它分开或如何用团队名称命名每个数据框。

for (i in 1:length(teamLinks)){
  url <- teamLinks[i]
  scraped_url <- read_html(url)
  Team <- scraped_page%>%
    html_nodes('#stats_squads_standard_for .left')%>%
    html_text()%>%
    as.character()
   df_name <- paste0(Team[i])
   df <- {
    Comp <- scraped_url%>%
      html_nodes(comp)%>%
      html_text()
    Venue <- scraped_url%>%
      html_nodes(venue)%>%
      html_text()
    Result <- scraped_url%>%
      html_nodes(result)%>%
      html_text()
    Goals_For <- scraped_url%>%
      html_nodes(GF)%>%
      html_text()
    Goals_Against <- scraped_url%>%
      html_nodes(GA)%>%
      html_text()
    Opponent <- scraped_url%>%
      html_nodes(Opp)%>%
      html_text()
    xG <- scraped_url%>%
      html_nodes(xg)%>%
      html_text()
    xGA <- scraped_url%>%
      html_nodes(xga)%>%
      html_text()
    Possession <- scraped_url%>%
      html_nodes(poss)%>%
      html_text()
    Formation <- scraped_url%>%
      html_nodes(formation)%>%
      html_text()
    data.frame(Comp,Venue,Goals_For,Goals_Against,
               Opponent,xG,xGA,Possession,Formation)
  }
}

也非常感谢清理该 for 循环的任何帮助

这些也是每个 html 变量的值:

comp <- ".left:nth-child(3) a"
venue <- ".left:nth-child(6)"
result <- "#matchlogs_for .left+ .center"
GF <- "#matchlogs_for .right:nth-child(8)"
GA <- "#matchlogs_for .right:nth-child(9)"
Opp <- ".left:nth-child(10)"
xg <- "#matchlogs_for td.left+ .right"
xga <- "#matchlogs_for .right:nth-child(12)"
poss <- "#matchlogs_for td:nth-child(13)"
formation <- ".left:nth-child(16)"

先感谢您!

标签: htmlrweb-scrapingrvest

解决方案


您可以在循环之前创建一个列表并将每个数据帧保存到该列表中,如下所示:

TeamList <- list()

for (i in 1:length(teamLinks)){

  # [...] your scraping code that leads to a "df"

  TeamList[[i]] <- df

}

然后根据每个团队命名TeamList的数据框,然后将数据框列表转换为多个数据框list2env()

names(TeamList) <- Team

list2env(TeamList, envir=.GlobalEnv)

推荐阅读