首页 > 解决方案 > 网络抓取 rvest 问题篮球运动员

问题描述

我无法从 URL https://www.basketball-reference.com/leagues/NBA_2020_totals.html#totals_stats::pts读取数据。这是代码:

library(rvest)
url <- "https://www.basketball-reference.com/leagues/NBA_2020_totals.html#totals_stats::pts"
pagina <- read_html(url, as.data.frame=T, stringsAsFactors = TRUE, 
                encoding = "utf-8")
pagina %>%  
  html_nodes("table") %>% 
  .[[1]] %>% 
  html_table(fill=T) -> x

这会读取表格,但我不知道为什么它会像这样粘贴几行:

    Rk  Player  Pos Age Tm  G   GS  MP  FG  FGA FG% 3P  3PA 3P% 2P  2PA 2P% eFG%    FT  FTA FT% ORB DRB TRB AST STL BLK TOV PF  PTS
54  Rk  Player  Pos Age Tm  G   GS  MP  FG  FGA FG% 3P  3PA 3P% 2P  2PA 2P% eFG%    FT  FTA FT% ORB DRB TRB AST STL BLK TOV PF  PTS
77  Rk  Player  Pos Age Tm  G   GS  MP  FG  FGA FG% 3P  3PA 3P% 2P  2PA 2P% eFG%    FT  FTA FT% ORB DRB TRB AST STL BLK TOV PF  PTS
102 Rk  Player  Pos Age Tm  G   GS  MP  FG  FGA FG% 3P  3PA 3P% 2P  2PA 2P% eFG%    FT  FTA FT% ORB DRB TRB AST STL BLK TOV PF  PTS
133 Rk  Player  Pos Age Tm  G   GS  MP  FG  FGA FG% 3P  3PA 3P% 2P  2PA 2P% eFG%    FT  FTA FT% ORB DRB TRB AST STL BLK TOV PF  PTS
162 Rk  Player  Pos Age Tm  G   GS  MP  FG  FGA FG% 3P  3PA 3P% 2P  2PA 2P% eFG%    FT  FTA FT% ORB DRB TRB AST STL BLK TOV PF  PTS
189 Rk  Player  Pos Age Tm  G   GS  MP  FG  FGA FG% 3P  3PA 3P% 2P  2PA 2P% eFG%    FT  FTA FT% ORB DRB TRB AST STL BLK TOV PF  PTS
218 Rk  Player  Pos Age Tm  G   GS  MP  FG  FGA FG% 3P  3PA 3P% 2P  2PA 2P% eFG%    FT  FTA FT% ORB DRB TRB AST STL BLK TOV PF  PTS

我得到了球员的行,但我也得到了那些行。我不知道这些行是否也是没有被很好地阅读的玩家,或者它们只是随机粘贴的行,因为我在代码中做错了。我想删除这些行(如您所见,它们处于随机位置)或修改读取的代码,这样我就不会得到它们。

提前致谢。

阿尔贝托

标签: htmlrrvest

解决方案


您应该忽略这些行并仅获取相关行。

library(rvest)
library(dplyr)

url <- "https://www.basketball-reference.com/leagues/NBA_2020_totals.html"
webpage <- url %>%  read_html 

webpage %>%
   html_table() %>%  
   .[[1]] %>%
   filter(!grepl('Rk', Rk)) %>%
   type.convert(as.is = TRUE) 


#   Rk                   Player Pos Age  Tm  G GS   MP  FG  FGA   FG% ...
#1   1             Steven Adams   C  26 OKC 58 58 1564 262  443 0.591 ...
#2   2              Bam Adebayo  PF  22 MIA 65 65 2235 408  719 0.567 ...
#3   3        LaMarcus Aldridge   C  34 SAS 53 53 1754 391  793 0.493 ...
#4   4 Nickeil Alexander-Walker  SG  21 NOP 41  0  501  77  227 0.339 ...
#5   5            Grayson Allen  SG  24 MEM 30  0  498  79  176 0.449 ...
#6   6            Jarrett Allen   C  21 BRK 64 58 1647 267  413 0.646 ...
#7   7             Kadeem Allen  SG  27 NYK 10  0  117  19   44 0.432 ...
#8   8          Al-Farouq Aminu  PF  29 ORL 18  2  380  25   86 0.291 ...
#9   9          Justin Anderson  SF  26 BRK  3  0   17   1    6 0.167 ...
#10 10            Kyle Anderson  PF  26 MEM 59 20 1140 138  280 0.493 ...
#11 11            Ryan Anderson  PF  31 HOU  2  0   14   2    7 0.286 ...
#...
#...

推荐阅读