首页 > 解决方案 > R:在网页抓取多个页面时获取选择器的问题

问题描述

我试图在多个页面中获取网络抓取积分,可悲的是我在选择器中遇到了问题(我使用了 SelectorGadget 但没有成功)。

我只有个人网络抓取成功

library(rvest)
points <- read_html("https://www.winemag.com/buying-guide/lagar-de-bezana-2014-aluvion-ensamblaje-red-cachapoal-valley/")

points %>% 
  html_node(".rating") %>%
  html_text() 

[1] "93points"

对于多页结果不是真实值:

library(rvest)

points <- lapply(paste0('https://www.winemag.com/?s=chile&search_type=all', 1:5),
                function(url){
                    url %>% read_html() %>% 
                        html_nodes(".rating") %>% 
                        html_text()
                })
points

[[1]]
[1] "93 Points" "92 Points" "92 Points" "92 Points" "92 Points" "92 Points"

[[2]]
[1] "93 Points" "92 Points" "92 Points" "92 Points" "92 Points" "92 Points"

[[3]]
[1] "93 Points" "92 Points" "92 Points" "92 Points" "92 Points" "92 Points"

[[4]]
[1] "93 Points" "92 Points" "92 Points" "92 Points" "92 Points" "92 Points"

[[5]]
[1] "93 Points" "92 Points" "92 Points" "92 Points" "92 Points" "92 Points"

标签: rweb-scraping

解决方案


这个解决方案似乎有效。我改变了创建 url 的方式:

library(rvest)

points <- lapply(paste0('https://www.winemag.com/?s=chile&drink_type=wine&page=', 1:5, '&search_type=all3', 1:5),
                 function(url){
                   url %>% read_html() %>% 
                     html_nodes(".rating") %>% 
                     html_text()
                 })
points

我个人会这样写,尽管这肯定是一个偏好问题:

library(rvest)

df <- tibble(url = paste0('https://www.winemag.com/?s=chile&drink_type=wine&page=', 1:5, '&search_type=all3', 1:5)) %>%
  rowwise() %>%
  mutate(
    rating = read_html(url) %>% 
      html_nodes(".rating") %>%
      html_text() %>%
      list()
  ) %>%
  unnest(cols = c(rating))

推荐阅读