r - 使用 R 包 rvest 从 transfermarkt 抓取
问题描述
我正在学习抓取数据,并且正在使用 transfermakt,但今天我遇到了两个问题。我用过选择器小工具。我的代码是这样的:
library(rvest)
url <- "https://www.transfermarkt.es/fc-granada/startseite/verein/16795"
webpage <- read_html(url)
players_html <- html_nodes(webpage,"#yw1 .tooltipstered")
players <- html_text(players_html)
players
valores_html <- html_nodes(webpage,'.rechts.hauptlink')
valores <- html_text(valores_html)
valores
valores <- gsub(" miles €","000", valores)
valores <- gsub(" mill. €","0000", valores)
valores
valores <- gsub(",","",valores)
valores <- gsub(" ","", valores)
valores
我在选择球员时遇到了第一个问题。这是输出。
> players_html <- html_nodes(webpage,"#yw1 .tooltipstered")
> players <- html_text(players_html)
> players
character(0)
我认为问题出在 CSS 选择器上,但它是在选择播放器时向我显示 Selector Gadget 的那个,所以我不知道如何解决这个问题。
另一个问题是选择它们的市场价值。Gsub 不会删除一些最终的空格,以避免将字符作为数字。这是输出:
> valores_html <- html_nodes(webpage,'.rechts.hauptlink')
> valores <- html_text(valores_html)
> valores
[1] "700 miles € " "300 miles € " "800 miles € " "500 miles € "
"300 miles € "
[6] "300 miles € " "1,00 mill. € " "300 miles € " "1,20 mill. €
" "500 miles € "
[11] "1,70 mill. € " "1,50 mill. € " "1,00 mill. € " "800 miles €
" "800 miles € "
[16] "300 miles € " "2,00 mill. € " "800 miles € " "700 miles €
" "400 miles € "
[21] "700 miles € " "1,00 mill. € " "800 miles € "
> valores <- gsub(" miles €","000", valores)
> valores <- gsub(" mill. €","0000", valores)
> valores
[1] "700000 " "300000 " "800000 " "500000 " "300000 "
"300000 " "1,000000 "
[8] "300000 " "1,200000 " "500000 " "1,700000 " "1,500000 "
"1,000000 " "800000 "
[15] "800000 " "300000 " "2,000000 " "800000 " "700000 "
"400000 " "700000 "
[22] "1,000000 " "800000 "
> valores <- gsub(",","",valores)
> valores <- gsub(" ","", valores)
> valores
[1] "700000 " "300000 " "800000 " "500000 " "300000 "
"300000 " "1000000 " "300000 "
[9] "1200000 " "500000 " "1700000 " "1500000 " "1000000 "
"800000 " "800000 " "300000 "
[17] "2000000 " "800000 " "700000 " "400000 " "700000 "
"1000000 " "800000 "
基本上,在这种情况下,用于删除最终空白的最后一个 gsub 没有任何作用。有人可以帮我解决这两个问题吗?
PS:我在西班牙语中使用 transfermarkt。
解决方案
至于gsub
,我们可以使用
valores <- html_text(valores_html)
valores <- gsub(" miles €", "000", valores)
valores <- gsub(" mill. €", "0000", valores)
valores <- gsub("\\D", "", valores)
valores
# [1] "700000" "300000" "800000" "500000" "300000" "300000" "1000000" "300000" "1200000"
# [10] "500000" "1700000" "1500000" "1000000" "800000" "800000" "300000" "2000000" "800000"
# [19] "700000" "400000" "700000" "1000000" "800000"
除了数字之外的任何东西在哪里\\D
。
对于球员姓名,我们可以写
players_html <- html_nodes(webpage,"#yw1 span.hide-for-small a.spielprofil_tooltip")
players <- html_text(players_html)
players
# [1] "Rui Silva" "Aarón Escandell" "Bernardo Cruz"
# [4] "José Antonio Martínez" "Germán Sánchez" "Pablo Vázquez"
# [7] "Álex Martínez" "Adrián Castellano" "Víctor Díaz"
# [10] "Quini" "Nicolás Aguirre" "Fede San Emeterio"
# [13] "Ángel Montoro" "Fran Rico" "Alberto Martín"
# [16] "José Antonio González" "Alejandro Pozo" "Antonio Puertas"
# [19] "Fede Vico" "Daniel Ojeda" "Álvaro Vadillo"
# [22] "Adrián Ramos" "Rodri"
通过这种方式,我们也只能得到一组(完整)名称。例如,使用"#yw1 a.spielprofil_tooltip"
也会返回他们的简短版本。
推荐阅读
- reactjs - 类型错误:无法使用 react-redux 读取未定义 i 的属性“地图”
- ios - Codable 和 NSManaged 子类,具有单独的保存和解码操作
- python - 如何按特定顺序合并列表python
- kubernetes - 如何在同一 POD 的不同实例中将环境变量设置为不同的值?
- c++ - #pragma 警告不适用于 catch 语句
- github - 如何自动更新 github action runner
- react-native - react-native 材质下拉列表中的状态使用
- java - 如何通过滚动一项来水平滚动列表视图中的其他项目?
- python-3.x - 我如何只读取 .text 文件的特定行?Python
- c - 我可以在 C 中 typedef struct aaa struct x_aaa 吗?