r - R:网络抓取返回错误的产品价格
问题描述
有人可以看看为什么我的代码返回错误的产品价格?
例如,让我们看一下这台电视:
电视 海信 LED 超高清 4K 55" 智能电视 55A6GSV
-网页中的 Precio antes (NORMAL):S/ 2,299
-我的结果中的 Precio antes (NORMAL):S/ S/ 2,999
-网页中的实际价格(互联网):S/ 1,699
-我的结果中的实际价格(互联网):S/ 2,199
-网页中的Precio tarjeta:
不适用 -我的结果中的Precio tarjeta:S/ 1,999
代码:
library(rvest)
library(purrr)
library(tidyverse)
urls <- list("https://simple.ripley.com.pe/tecnologia/tv-y-cine-en-casa/televisores?page=1",
"https://simple.ripley.com.pe/tecnologia/tv-y-cine-en-casa/televisores?page=2")
h <- urls %>% map(read_html) # scrape once, parse as necessary
m <- h %>% map_df(~{
r.precio.antes <- html_nodes(.x, '.catalog-prices__list-price') %>% html_text
r.precio.actual <- html_nodes(.x, '.catalog-prices__offer-price') %>% html_text
r.precio.tarjeta <- html_nodes(.x, '.catalog-prices__card-price') %>% html_text
tibble(
periodo = lubridate::year(Sys.Date()),
fecha = Sys.Date(),
ecommerce = "ripley",
producto = html_nodes(.x, ".catalog-product-details__name") %>% html_text,
precio.antes = ifelse(length(r.precio.antes) == 0, NA, r.precio.antes),
precio.actual = ifelse(length(r.precio.actual) == 0, NA, r.precio.actual),
precio.tarjeta = ifelse(length(r.precio.tarjeta) == 0, NA, r.precio.tarjeta)
)})
解决方案
问题似乎在于ifelse
要求所有参数的长度相同。在这里,no
caselength
大于 1。最好使用if/else
并返回list
asdata.frame/tibble
要求列相同length
m <- h %>% map(~{
r.precio.antes <- html_nodes(.x, '.catalog-prices__list-price') %>% html_text
r.precio.actual <- html_nodes(.x, '.catalog-prices__offer-price') %>% html_text
r.precio.tarjeta <- html_nodes(.x, '.catalog-prices__card-price') %>% html_text
r.precio.antes <- if(length(r.precio.antes) == 0) NA else r.precio.antes
r.precio.actual <- if(length(r.precio.actual) == 0) NA else r.precio.actual
r.precio.tarjeta <- if(length(r.precio.tarjeta) == 0) NA else r.precio.tarjeta
list(
periodo = lubridate::year(Sys.Date()),
fecha = Sys.Date(),
ecommerce = "ripley",
producto = html_nodes(.x, ".catalog-product-details__name") %>% html_text,
precio.antes =r.precio.antes, precio.actual = r.precio.actual, precio.tarjeta = r.precio.tarjeta)
})
-检查length
嵌套列表的每个元素
map(m, lengths)
[[1]]
periodo fecha ecommerce producto precio.antes precio.actual precio.tarjeta
1 1 1 48 44 48 18
[[2]]
periodo fecha ecommerce producto precio.antes precio.actual precio.tarjeta
1 1 1 46 45 46 2
一个选项可能是
library(dplyr)
library(purrr)
library(tidyr)
library(data.table)
out <- h %>%
map_dfr(~ html_nodes(.x, ".catalog-product-details__name, .catalog-prices__list-price, .catalog-prices__offer-price, .catalog-prices__card-price") %>%
{tibble(col1 = html_attr(., "title"), col2 = html_text(.)) %>%
mutate(col1 = case_when(is.na(col1) ~ "product", TRUE ~ col1)) %>%
mutate(grp = cumsum(col1 == "product")) %>%
pivot_wider(names_from = col1, values_from = col2) %>%
select(-grp) })
-输出
> out
# A tibble: 94 x 4
product `Precio Normal` `Precio Internet` `Precio Ripley`
<chr> <chr> <chr> <chr>
1 "TELEVISOR LG LED ULTRA HD 4K 50\" SMART TV THINQ AI 50UP7750PSB (2021)" S/ 2,999 S/ 2,199 "S/ 1,999 "
2 "TELEVISOR SAMSUNG LED CRYSTAL ULTRA HD 4K SMART TV 65\" UN65AU7000GXPE" S/ 4,099 S/ 2,699 "S/ 2,499 "
3 "TELEVISOR SAMSUNG CRYSTAL ULTRA HD 4K 58'' SMART TV UN58AU7000GXPE" S/ 3,199 S/ 2,399 "S/ 2,299 "
4 "TELEVISOR LG OLED ULTRA HD 4K 48\" SMART TV THINQ AI OLED48A1PSA (2021)" S/ 4,799 S/ 3,699 "S/ 3,499 "
5 "TELEVISOR SAMSUNG QLED LIFESTYLE THE FRAME 55\" LS03A QLED 4K" S/ 4,899 S/ 3,999 <NA>
6 "TELEVISOR TCL QLED ULTRA HD 4K 65\" SMART TV 65C715" S/ 3,499 S/ 3,199 "S/ 2,999 "
7 "TELEVISOR LG LED ULTRA HD 4K 43\" SMART TV THINQ AI 43UP7700PSB (2021)" S/ 2,299 S/ 1,899 "S/ 1,799 "
8 "TELEVISOR HISENSE LED ULTRA HD 4K 55\" SMART TV 55A6GSV" S/ 2,299 S/ 1,699 <NA>
9 "TELEVISOR AOC LED ULTRA HD 4K 50\" SMART TV LE50U6305" S/ 2,299 S/ 1,749 "S/ 1,649 "
10 "TELEVISOR LG LED ULTRA HD 4K 60\" SMART TV THINQ AI 60UP7750PSB (2021)" S/ 3,899 S/ 3,199 "S/ 3,099 "
# … with 84 more rows
- 检查 OP 的评论
> out %>%
filter(product == "TELEVISOR LG NANOCELL ULTRA HD 4K 65\" SMART TV 65NANO96SNA (2020)")
# A tibble: 1 x 4
product `Precio Normal` `Precio Internet` `Precio Ripley`
<chr> <chr> <chr> <chr>
1 "TELEVISOR LG NANOCELL ULTRA HD 4K 65\" SMART TV 65NANO96SNA (2020)" S/ 24,999 S/ 8,999 <NA>
这与网页中的相同
或者 OP 帖子中显示的第二个产品
> out %>%
filter(str_detect(product, "55A6GSV"))
# A tibble: 1 x 4
product `Precio Normal` `Precio Internet` `Precio Ripley`
<chr> <chr> <chr> <chr>
1 "TELEVISOR HISENSE LED ULTRA HD 4K 55\" SMART TV 55A6GSV" S/ 2,299 S/ 1,699 <NA>
推荐阅读
- django - 在 python 中省略 imprts 以提高性能是否合理?
- c# - 无法创建“MyContext”类型的对象。对于设计时支持的不同模式
- java - 当它是回文时,为什么这个反转的 StringBuilder 不等于原始字符串?
- qt - Yank 在 Qt Creator FakeVim 上注册,取决于键盘?
- php - NAS Synology 计划任务:如何捕获 PHP 脚本错误?
- ios - 如何确保在 didSelectRowAtIndexPath 之前将所有单元格(不仅是可见单元格)加载到 tableview 中?
- javascript - React:使用过滤器时,使用 dhtmlx-gantt 库无法正确绘制 UI
- c# - 在 Web API 中使用异步等待时出现 500 内部服务器错误
- swift - 如何滚动循环 CollectionView
- flutter - 如何使用 Post 将数据发送到 php 文件