首页 > 解决方案 > 使用 rvest 抓取交互式表格

问题描述

我正在尝试抓取下面 URL 中显示的第二个表格,我遇到了可能与表格的交互性质有关的问题。

div_stats_standard 似乎是指感兴趣的表。

代码运行没有错误,但返回一个空列表。

url <- 'https://fbref.com/en/comps/9/stats/Premier-League-Stats'

data <- url %>%
  read_html() %>%
  html_nodes(xpath = '//*[(@id = "div_stats_standard")]') %>%
  html_table()

谁能告诉我哪里出错了?

标签: rweb-scrapingrvest

解决方案


寻找桌子。

library(rvest)

url <- "https://fbref.com/en/comps/9/stats/Premier-League-Stats"

page <- read_html(url)
nodes <- html_nodes(page, "table") # you can use Selectorgadget to identify the node
table <- html_table(nodes[[1]]) # each element of the nodes list is one table that can be extracted
head(table)

结果:

head(table)
                   Playing Time Playing Time Playing Time Performance Performance
1       Squad # Pl           MP       Starts          Min         Gls         Ast
2     Arsenal   26           27          297        2,430          39          26
3 Aston Villa   28           27          297        2,430          33          27
4 Bournemouth   25           28          308        2,520          27          17
5    Brighton   23           28          308        2,520          28          19
6     Burnley   21           28          308        2,520          32          23
  Performance Performance Performance Performance Per 90 Minutes Per 90 Minutes
1          PK       PKatt        CrdY        CrdR            Gls            Ast
2           2           2          64           3           1.44           0.96
3           1           3          54           1           1.22           1.00
4           1           1          60           3           0.96           0.61
5           1           1          44           2           1.00           0.68
6           2           2          53           0           1.14           0.82
  Per 90 Minutes Per 90 Minutes Per 90 Minutes Expected Expected Expected Per 90 Minutes
1            G+A           G-PK         G+A-PK       xG     npxG       xA             xG
2           2.41           1.37           2.33     35.0     33.5     21.3           1.30
3           2.22           1.19           2.19     30.6     28.2     22.0           1.13
4           1.57           0.93           1.54     31.2     30.5     20.8           1.12
5           1.68           0.96           1.64     33.8     33.1     22.4           1.21
6           1.96           1.07           1.89     30.9     29.4     18.9           1.10
  Per 90 Minutes Per 90 Minutes Per 90 Minutes Per 90 Minutes
1             xA          xG+xA           npxG        npxG+xA
2           0.79           2.09           1.24           2.03
3           0.81           1.95           1.04           1.86
4           0.74           1.86           1.09           1.83
5           0.80           2.01           1.18           1.98
6           0.68           1.78           1.05           1.73

推荐阅读