首页 > 解决方案 > 如何从 URL 中只有月份和年份的网站从多个表中提取内容

问题描述

这是对我之前的问题的跟进:

如何使用rvest提取div标签之间的内容然后绑定行

我试图从 div 标签之间提取数据的页面来自此站点:

http://bigbashboard.com/rankings/batsmen

这是与我之前的问题不同的页面(尽管它仍然是同一个站点)。主要区别在于 URL 中显示的日期仅显示为年/月,如下所示:

http://bigbashboard.com/rankings/batsmen/2020/10

与我之前的问题中的页面相反,该页面显示为年/月/日,如下所示:

http://bigbashboard.com/rankings/bbl/batsmen/2020/01/08

我仍然希望从出现在 div 标签之间的页面左侧提取相同的数据,如下所示:

击球手

1 Lokesh Rahul 167
2 Ravija Sanaruwan 150
3 David Warner 143

我还需要出现在右侧表格中的数据并将它们绑定在一起,使其看起来像这样,包括该页面的来源日期,如下所示:

   Date    Rank   Name               Points  Dates                  I   R       HS  Ave     SR      4s  6s  100s  50s
 Oct-20     1     Lokesh Rahul       167     Nov 2018 - Oct 2020    47  1910    132 50.26   141.38  171 76  2     17
 Oct-20     2     Ravija Sanaruwan   150     Jan 2019 - Feb 2020    15  577     103 44.38   165.80  52  36  1     4
 Oct-20     3     David Warner       143     Jan 2019 - Sep 2020    33  1475    100 61.46   138.89  128 39  2     16

我尝试使用上一篇文章中提供的代码作为解决方案:

library(rvest)
library(xml2)
library(dplyr)
library(furrr)

batsmen <- function(x) {
  x <- html_nodes(x, "div.cf.rankings-page div div ol li a")
  xml_remove(html_nodes(x, "span.rank small, span[class^='pos'] em"))
  score <- html_text(html_nodes(x, "span.rank"))
  rank <- html_text(html_nodes(x, "span[class^='pos']"), trim = TRUE)
  xml_remove(html_nodes(x, "span"))
  tibble(Rank = rank, Name = html_text(x), Points = score)
}

stats_table <- function(x) {
  as_tibble(html_table(x)[[1L]])
}

read_rankings <- function(url) {
  ymd <- as.Date(paste0(tail(strsplit(url, "/")[[1L]], 3L), collapse = "-"))
  read_html(url) %>% {bind_cols(Date = ymd, batsmen(.), stats_table(.))}
}

mas_url <- "http://bigbashboard.com/rankings/batsmen"

timeline <- 
  read_html(mas_url) %>% 
  html_nodes("div.timeline span a") %>% 
  html_attr("href") %>% 
  url_absolute(mas_url)

# Use parallel processing for speed.
plan(multiprocess)
future_map_dfr(timeline[1:100], read_rankings) # I only scrape a few links for test.

但是,这会产生错误:

Error in charToDate(x) : 
  character string is not in a standard unambiguous format

我不明白为什么会发生这种情况以及如何解决它。我假设这可能是因为日期格式不同。

标签: rweb-scrapingrvest

解决方案


下面的代码适用于所有三种情况

library(rvest)
library(xml2)
library(dplyr)
library(furrr)

batsmen <- function(x) {
  nms <- html_attr(html_nodes(x, "div.cf > a"), "name")
  x <- html_nodes(x, "div.cf.rankings-page")
  xml_remove(html_nodes(x, "li span.rank small, li span[class^='pos'] em"))
  x <- Map(function(i, nm) {
    i <- html_nodes(i, "li a")
    score <- html_text(html_nodes(i, "span.rank"))
    rank <- html_text(html_nodes(i, "span[class^='pos']"), trim = TRUE)
    xml_remove(html_nodes(i, "span"))
    tibble(Title = nm, Rank = rank, Name = html_text(i), Points = score)
  }, x, nms)
  bind_rows(x)
}

stats_table <- function(x) {
  as_tibble(bind_rows(
    lapply(html_table(x), function(df) setNames(df, make.unique(names(df))))
  ))
}

timeline <- function(mas_url) {
  links <- read_html(mas_url) %>% html_nodes("div.timeline span a")
  out <- links %>% html_attr("href") %>% url_absolute(mas_url)
  setNames(out, html_text(links))
}

read_rankings <- function(url, time) {
  read_html(url) %>% {bind_cols(Date = time, batsmen(.), stats_table(.))}
}

# Use parallel processing for speed.
plan(multiprocess)

案例1:该页面只有男性排名

# men only
future_imap_dfr(timeline("http://bigbashboard.com/rankings/bbl/batsmen")[1:10], ~read_rankings(.x, .y))

输出

# A tibble: 996 x 15
   Date      Title Rank  Name           Points Dates                         I     R    HS   Ave    SR  `4s`  `6s` `100s` `50s`
   <chr>     <chr> <chr> <chr>          <chr>  <chr>                     <int> <int> <int> <dbl> <dbl> <int> <int>  <int> <int>
 1 8 Feb '20 men   1     Matthew Wade   125    22 Dec 2018 - 30 Jan 2020    23   943   130  44.9  155.    78    36      1     9
 2 8 Feb '20 men   2     Marcus Stoinis 120    21 Dec 2018 - 08 Feb 2020    30  1238   147  53.8  134.   111    39      1    10
 3 8 Feb '20 men   3     D'Arcy Short   116    22 Dec 2018 - 30 Jan 2020    24   994   103  49.7  137.    93    36      1     9
 4 8 Feb '20 men   4     Alex Hales     115    17 Dec 2019 - 06 Feb 2020    17   576    85  38.4  147.    59    23      0     6
 5 8 Feb '20 men   5     Aaron Finch    89     07 Jan 2019 - 27 Jan 2020    17   583   109  36.4  130.    41    24      1     4
 6 8 Feb '20 men   6     Josh Inglis    87     26 Dec 2018 - 26 Jan 2020    18   517    73  28.7  149.    53    19      0     5
 7 8 Feb '20 men   7     Travis Head    87     11 Jan 2019 - 01 Feb 2020    10   291    79  29.1  132.    22    13      0     1
 8 8 Feb '20 men   8     Josh Philippe  84     22 Dec 2018 - 08 Feb 2020    31   791    86  34.4  140.    76    23      0     7
 9 8 Feb '20 men   9     Shaun Marsh    82     24 Jan 2019 - 21 Jan 2020    15   547    96  39.1  128.    45    19      0     4
10 8 Feb '20 men   10    Chris Lynn     78     19 Dec 2018 - 27 Jan 2020    27   772    94  32.2  137.    64    44      0     6
# ... with 986 more rows

案例二:男女同页排名

# men and women
future_imap_dfr(timeline("http://bigbashboard.com/rankings/batsmen")[1:10], ~read_rankings(.x, .y))

# A tibble: 2,000 x 15
   Date    Title Rank  Name              Points Dates                   I     R    HS   Ave    SR  `4s`  `6s` `100s` `50s`
   <chr>   <chr> <chr> <chr>             <chr>  <chr>               <int> <int> <int> <dbl> <dbl> <int> <int>  <int> <int>
 1 Oct '20 men   1     Lokesh Rahul      167    Nov 2018 - Oct 2020    47  1910   132  50.3  141.   171    76      2    17
 2 Oct '20 men   2     Ravija Sandaruwan 150    Jan 2019 - Feb 2020    15   577   103  44.4  166.    52    36      1     4
 3 Oct '20 men   3     David Warner      143    Jan 2019 - Sep 2020    33  1475   100  61.5  139.   128    39      2    16
 4 Oct '20 men   4     Kamran Khan       135    Jan 2019 - Feb 2020    21   630    88  31.5  135.    50    39      0     5
 5 Oct '20 men   5     Devdutt Padikkal  135    Nov 2019 - Sep 2020    15   691   122  57.6  167.    72    35      1     7
 6 Oct '20 men   6     Devon Conway      121    Dec 2018 - Jan 2020    20   906   105  56.6  145.   113    19      2     5
 7 Oct '20 men   7     Jos Buttler       121    Oct 2018 - Oct 2020    23   817    89  37.1  145.    93    32      0     8
 8 Oct '20 men   8     Virat Kohli       119    Nov 2018 - Sep 2020    35  1174   100  40.5  141.    90    43      1     8
 9 Oct '20 men   9     Kevin O'Brien     119    Oct 2018 - Sep 2020    38  1145   124  31.0  158.   107    59      1     5
10 Oct '20 men   10    Eoin Morgan       118    Oct 2018 - Oct 2020    34  1008    91  38.8  165.    69    66      0     8
# ... with 1,990 more rows

案例3:全能选手

# all-rounders
future_imap_dfr(timeline("http://bigbashboard.com/rankings/bbl/all-rounders")[1:10], ~read_rankings(.x, .y))

# A tibble: 547 x 13
   Date      Title Rank  Name             Points Dates                         M     R   Ave    SR     W  Econ Ave.1
   <chr>     <chr> <chr> <chr>            <chr>  <chr>                     <int> <int> <dbl> <dbl> <int> <dbl> <dbl>
 1 8 Feb '20 men   1     D'Arcy Short     70     22 Dec 2018 - 30 Jan 2020    24   994  49.7  137.    16  8.61  29.1
 2 8 Feb '20 men   2     Travis Head      49     11 Jan 2019 - 01 Feb 2020    11   291  29.1  132.     4  8.08  24.2
 3 8 Feb '20 men   3     Mohammad Nabi    40     20 Dec 2018 - 27 Jan 2020    20   388  29.8  129.    13  7.9   30.4
 4 8 Feb '20 men   4     Chris Morris     38     21 Dec 2019 - 06 Feb 2020    15   112  12.4  147.    22  8.01  19.4
 5 8 Feb '20 men   5     Glenn Maxwell    37     21 Dec 2018 - 08 Feb 2020    30   729  36.4  146.    13  7.36  31.2
 6 8 Feb '20 men   6     Daniel Sams      35     21 Dec 2018 - 06 Feb 2020    31   230   9.2  119.    45  8.19  17.3
 7 8 Feb '20 men   7     Ben Cutting      33     19 Dec 2018 - 27 Jan 2020    28   466  24.5  137.    23  8.92  27.5
 8 8 Feb '20 men   8     Mitchell Marsh   28     20 Dec 2018 - 26 Jan 2020    21   504  31.5  132.     6  9.56  43  
 9 8 Feb '20 men   9     Daniel Christian 27     20 Dec 2018 - 27 Jan 2020    30   382  21.2  124.    20  8.02  27.2
10 8 Feb '20 men   10    Rashid Khan      26     19 Dec 2018 - 01 Feb 2020    29   217  14.5  158.    38  6.65  19.5
# ... with 537 more rows

问答

日期如何运作?

新代码从网站上的同一时间线中抓取链接和日期。链接就是那个 href 属性;日期是文本。看那个timeline功能。这样,我避免使用 URL 来获取日期。

时间线

为什么我会遇到此错误:无法回收“日期”(尺寸 200)以匹配“..3”(尺寸 190)?

因为有如下表(另见此链接

不匹配

这与您的描述不同,排名和统计表始终具有相同的行数。


推荐阅读