首页 > 解决方案 > R:网络抓取返回错误的产品价格

问题描述

有人可以看看为什么我的代码返回错误的产品价格?

例如,让我们看一下这台电视:

电视 海信 LED 超高清 4K 55" 智能电视 55A6GSV

-网页中的 Precio antes (NORMAL):S/ 2,299
-我的结果中的 Precio antes (NORMAL):S/ S/ 2,999

-网页中的实际价格(互联网):S/ 1,699
-我的结果中的实际价格(互联网):S/ 2,199

-网页中的Precio tarjeta:
不适用 -我的结果中的Precio tarjeta:S/ 1,999

在此处输入图像描述

在此处输入图像描述

代码:

library(rvest)
library(purrr)
library(tidyverse)

urls <- list("https://simple.ripley.com.pe/tecnologia/tv-y-cine-en-casa/televisores?page=1",
             "https://simple.ripley.com.pe/tecnologia/tv-y-cine-en-casa/televisores?page=2")



h <- urls %>% map(read_html)    # scrape once, parse as necessary

m <- h %>% map_df(~{
  r.precio.antes <- html_nodes(.x, '.catalog-prices__list-price') %>% html_text
  r.precio.actual <- html_nodes(.x, '.catalog-prices__offer-price') %>% html_text
  r.precio.tarjeta <- html_nodes(.x, '.catalog-prices__card-price') %>% html_text 
  
  
  tibble(
    periodo = lubridate::year(Sys.Date()),
    fecha = Sys.Date(),
    ecommerce = "ripley",
    producto = html_nodes(.x, ".catalog-product-details__name") %>% html_text,
    precio.antes = ifelse(length(r.precio.antes) == 0, NA, r.precio.antes),
    precio.actual = ifelse(length(r.precio.actual) == 0, NA,  r.precio.actual),
    precio.tarjeta = ifelse(length(r.precio.tarjeta) == 0, NA,  r.precio.tarjeta)
  )})

标签: rpurrrrvest

解决方案


问题似乎在于ifelse要求所有参数的长度相同。在这里,nocaselength大于 1。最好使用if/else并返回listasdata.frame/tibble要求列相同length

m <- h %>% map(~{
  r.precio.antes <- html_nodes(.x, '.catalog-prices__list-price') %>% html_text
  r.precio.actual <- html_nodes(.x, '.catalog-prices__offer-price') %>% html_text
  r.precio.tarjeta <- html_nodes(.x, '.catalog-prices__card-price') %>% html_text 
  
  r.precio.antes <- if(length(r.precio.antes) == 0) NA else r.precio.antes
  r.precio.actual <- if(length(r.precio.actual) == 0) NA else r.precio.actual
  r.precio.tarjeta <- if(length(r.precio.tarjeta) == 0) NA  else r.precio.tarjeta
 
 list(
      periodo = lubridate::year(Sys.Date()),
      fecha = Sys.Date(),
      ecommerce = "ripley",
      producto = html_nodes(.x, ".catalog-product-details__name") %>% html_text,
      precio.antes =r.precio.antes, precio.actual = r.precio.actual, precio.tarjeta = r.precio.tarjeta)
  })

-检查length嵌套列表的每个元素

map(m, lengths)
[[1]]
       periodo          fecha      ecommerce       producto   precio.antes  precio.actual precio.tarjeta 
             1              1              1             48             44             48             18 

[[2]]
       periodo          fecha      ecommerce       producto   precio.antes  precio.actual precio.tarjeta 
             1              1              1             46             45             46              2 

一个选项可能是

library(dplyr)
library(purrr)
library(tidyr)
library(data.table)
out <- h %>%
    map_dfr(~ html_nodes(.x, ".catalog-product-details__name, .catalog-prices__list-price, .catalog-prices__offer-price, .catalog-prices__card-price") %>%
    {tibble(col1 = html_attr(., "title"), col2 = html_text(.)) %>% 
      mutate(col1 = case_when(is.na(col1) ~ "product", TRUE ~ col1)) %>%
           mutate(grp = cumsum(col1 == "product"))  %>%
     pivot_wider(names_from = col1, values_from = col2) %>% 
        select(-grp) })

-输出

> out
# A tibble: 94 x 4
   product                                                                   `Precio Normal` `Precio Internet` `Precio Ripley`
   <chr>                                                                     <chr>           <chr>             <chr>          
 1 "TELEVISOR LG LED ULTRA HD 4K 50\" SMART TV THINQ AI 50UP7750PSB (2021)"  S/ 2,999        S/ 2,199          "S/ 1,999 "    
 2 "TELEVISOR SAMSUNG LED CRYSTAL ULTRA HD 4K SMART TV 65\" UN65AU7000GXPE"  S/ 4,099        S/ 2,699          "S/ 2,499 "    
 3 "TELEVISOR SAMSUNG CRYSTAL ULTRA HD 4K 58'' SMART TV UN58AU7000GXPE"      S/ 3,199        S/ 2,399          "S/ 2,299 "    
 4 "TELEVISOR LG OLED ULTRA HD 4K 48\" SMART TV THINQ AI OLED48A1PSA (2021)" S/ 4,799        S/ 3,699          "S/ 3,499 "    
 5 "TELEVISOR SAMSUNG QLED LIFESTYLE THE FRAME 55\" LS03A QLED 4K"           S/ 4,899        S/ 3,999           <NA>          
 6 "TELEVISOR TCL QLED ULTRA HD 4K 65\" SMART TV 65C715"                     S/ 3,499        S/ 3,199          "S/ 2,999 "    
 7 "TELEVISOR LG LED ULTRA HD 4K 43\" SMART TV THINQ AI 43UP7700PSB (2021)"  S/ 2,299        S/ 1,899          "S/ 1,799 "    
 8 "TELEVISOR HISENSE LED ULTRA HD 4K 55\" SMART TV 55A6GSV"                 S/ 2,299        S/ 1,699           <NA>          
 9 "TELEVISOR AOC LED ULTRA HD 4K 50\" SMART TV LE50U6305"                   S/ 2,299        S/ 1,749          "S/ 1,649 "    
10 "TELEVISOR LG LED ULTRA HD 4K 60\" SMART TV THINQ AI 60UP7750PSB (2021)"  S/ 3,899        S/ 3,199          "S/ 3,099 "    
# … with 84 more rows

- 检查 OP 的评论

> out %>% 
   filter(product == "TELEVISOR LG NANOCELL ULTRA HD 4K 65\" SMART TV 65NANO96SNA (2020)")
# A tibble: 1 x 4
  product                                                              `Precio Normal` `Precio Internet` `Precio Ripley`
  <chr>                                                                <chr>           <chr>             <chr>          
1 "TELEVISOR LG NANOCELL ULTRA HD 4K 65\" SMART TV 65NANO96SNA (2020)" S/ 24,999       S/ 8,999          <NA>           

这与网页中的相同

在此处输入图像描述


或者 OP 帖子中显示的第二个产品

> out %>% 
   filter(str_detect(product, "55A6GSV"))
# A tibble: 1 x 4
  product                                                   `Precio Normal` `Precio Internet` `Precio Ripley`
  <chr>                                                     <chr>           <chr>             <chr>          
1 "TELEVISOR HISENSE LED ULTRA HD 4K 55\" SMART TV 55A6GSV" S/ 2,299        S/ 1,699          <NA>         

推荐阅读