首页 > 解决方案 > 用 Rvest 抓取 Transfermarket

问题描述

我在抓取 Transfermarket 时遇到问题。我想收集过去 20 个赛季欧洲前 5 名联赛(英超、西甲、意甲、法甲、德甲)的数据。在此我想收集一系列详细信息——球员姓名、年龄、球员位置、球员俱乐部、离开球员俱乐部、费用。但即使使用这个非常基本的代码,我只写了 18/19 英超联赛转会的一页,收集球队和名字(添加),我得到一个我不明白的错误。我也一直在使用选择器小工具。

我的代码:

require(rvest)

page = "https://www.transfermarkt.com/premier-league/transfers/wettbewerb/GB1/plus/?saison_id=2012&s_w=&leihe=0&leihe=1&intern=0"

scraped_page <- read_html(page)

Team_html  = html_nodes(page, ".tooltipstered+ .tooltipstered") 
Team = html_text(Team_html)
Addition_html = html_nodes(page, ".table-header+ .responsive-table .spielprofil_tooltip")
Addition = html_text(Addition_html)


df <- data.frame(Team, Addition)

head(df)

R返回什么:

> page = "https://www.transfermarkt.com/premier-league/transfers/wettbewerb/GB1/plus/?saison_id=2012&s_w=&leihe=0&leihe=1&intern=0"
> 
> scraped_page <- read_html(page)
> 
> Team_html  = html_nodes(page, ".tooltipstered+ .tooltipstered") 
Error in UseMethod("xml_find_all") : 
  no applicable method for 'xml_find_all' applied to an object of class "character"
> Team = html_text(Team_html)
> Addition_html = html_nodes(page, ".table-header+ .responsive-table .spielprofil_tooltip")
Error in UseMethod("xml_find_all") : 
  no applicable method for 'xml_find_all' applied to an object of class "character"
> Addition = html_text(Addition_html)
> 
> 
> df <- data.frame(Team, Addition)
Error in data.frame(Team, Addition) : 
  arguments imply differing number of rows: 0, 922
> 
> head(df)

1 function (x, df1, df2, ncp, log = FALSE)    
2 {                                           
3     if (missing(ncp))                       
4         .Call(C_df, x, df1, df2, log)       
5     else .Call(C_dnf, x, df1, df2, ncp, log)
6 }                     

我正在考虑从这里开始,然后使用 gsub 和其他一些命令在几年和联赛中循环...

标签: rweb-scrapingrvest

解决方案


你遇到的主要问题是

 Team_html  = html_nodes(page, ".tooltipstered+ .tooltipstered") 

应该

 Team_html  = html_nodes(scraped_page, ".tooltipstered+ .tooltipstered") 

另外,我认为您没有正确指定选择器。我想你可能想做一些这样的事情......

更新

潜在的解决方案 1:

按团队单独抓取每个表,他们在堆叠数据之前手动添加团队名称。在下面的代码中,我为前 5 个团队执行此操作

in1<-html_nodes(scraped_page, '#main > div:nth-child(13) > div.large-8.columns > div:nth-child(4) > div:nth-child(2) > table') %>% html_table()
in2<-html_nodes(scraped_page, '#main > div:nth-child(13) > div.large-8.columns > div:nth-child(5) > div:nth-child(2) > table') %>% html_table()
in3<-html_nodes(scraped_page, '#main > div:nth-child(13) > div.large-8.columns > div:nth-child(6) > div:nth-child(2) > table') %>% html_table()
in4<-html_nodes(scraped_page, '#main > div:nth-child(13) > div.large-8.columns > div:nth-child(7) > div:nth-child(2) > table') %>% html_table()
in5<-html_nodes(scraped_page, '#main > div:nth-child(13) > div.large-8.columns > div:nth-child(8) > div:nth-child(2) > table') %>% html_table()

out1<-html_nodes(scraped_page, '#main > div:nth-child(13) > div.large-8.columns > div:nth-child(4) > div:nth-child(4) > table') %>% html_table()
out2<-html_nodes(scraped_page, '#main > div:nth-child(13) > div.large-8.columns > div:nth-child(5) > div:nth-child(4) > table') %>% html_table()
out3<-html_nodes(scraped_page, '#main > div:nth-child(13) > div.large-8.columns > div:nth-child(6) > div:nth-child(4) > table') %>% html_table()
out4<-html_nodes(scraped_page, '#main > div:nth-child(13) > div.large-8.columns > div:nth-child(7) > div:nth-child(4) > table') %>% html_table()
out5<-html_nodes(scraped_page, '#main > div:nth-child(13) > div.large-8.columns > div:nth-child(8) > div:nth-child(4) > table') %>% html_table()

in1<-in1[[1]]
in2<-in2[[1]]
in3<-in3[[1]]
in4<-in4[[1]]
in5<-in5[[1]]

out1<-out1[[1]]
out2<-out2[[1]]
out3<-out3[[1]]
out4<-out4[[1]]
out5<-out5[[1]]

in1$team<-"Arsenal"
in2$team<-"Man U"
in3$team<-"West Brom"
in4$team<-"Fulham"
in5$team<-"New Castle"


out1$team<-"Arsenal"
out2$team<-"Man U"
out3$team<-"West Brom"
out4$team<-"Fulham"
out5$team<-"New Castle"


ins<-rbind(in1,in2,in3,in4,in5)
outs<-rbind(out1,out2,out3,out4,out5)

潜在解决方案 2:

该解决方案不会保留团队名称,而是更有效地进行拼接。

tab<-html_nodes(scraped_page, ".responsive-table td") %>% html_text()
temp<-data.frame(value=tab, index=index)
df<-data.frame(x1=character(981), x2=character(981), x3=character(981), x4=character(981), x5=character(981),
               x6=character(981),x7=character(981),x8=character(981),x9=character(981))
for (i in 1:9){
df[,i]<-temp$value[temp$index==i]

}
head(df)

推荐阅读