r - R + rvest + 礼貌用于网络抓取 etsy.com
问题描述
我对网络抓取完全陌生,我可能会淹死在茶杯里。我想自动执行以下操作
- 在 etsy.com 上运行以下查询
https://www.etsy.com/search?q=Christmas+candle&order=most_relevant&view_type=gallery
即只需在 Etsy 上查找“圣诞蜡烛”
- 然后分别检索产品的标题和描述,可能给出我想在搜索中包含的页数作为我的函数或管道的输入。
我看了基本的例子
https://github.com/dmi3kno/polite
但是当我尝试使其适应我的需要时(请参阅帖子末尾的reprex),它未能准确返回......什么都没有!
谁能指出我正确的方向?非常感谢!
library(polite)
library(rvest)
session <- bow("https://www.cheese.com/by_type", force = TRUE)
result <- scrape(session, query=list(t="semi-soft", per_page=100)) %>%
html_node("#main-body") %>%
html_nodes("h3") %>%
html_text()
result
#> [1] "3-Cheese Italian Blend" "Abbaye de Citeaux"
#> [3] "Abbaye du Mont des Cats" "Adelost"
#> [5] "ADL Brick Cheese" "Ailsa Craig"
#> [7] "Airedale" "Aisy Cendre"
#> [9] "Alpe di Frabosa" "Alpine Gold"
#> [11] "Alta Badia" "Amablu Blue cheese"
#> [13] "Ameribella" "American Cheese"
#> [15] "Ami du Chambertin" "Amsterdammer (British Columbia)"
#> [17] "Amul Pizza Mozzarella Cheese" "Anthotyro Fresco"
#> [19] "Aphrodite Haloumi " "Appalachian"
#> [21] "Applewood Smoked Chevre" "Ardrahan"
#> [23] "Armenian String Cheese" "Aromes au Gene de Marc"
#> [25] "Asher Blue" "Asiago Pressato DOP"
#> [27] "Aura" "Azeitao"
#> [29] "Baby Swiss" "Baluchon"
#> [31] "Bandal" "Basajo"
#> [33] "Basils Original Rauchkäse" "Baskeriu"
#> [35] "Basket Cheese" "Bassigny au porto"
#> [37] "Beaumont" "Beemster 2% Milk"
#> [39] "Bel Paese" "Bergere Bleue"
#> [41] "Bermuda Triangle" "Beyaz Peynir"
#> [43] "Bica de Queijo" "Bierkase"
#> [45] "Bijou" "Blarney Castle"
#> [47] "Bleu Bénédictin" "Bleu d'Auvergne"
#> [49] "Bleu Des Causses" "Bleu L'Ermite"
#> [51] "Blue Benedictine" "Blue Lupine"
#> [53] "Blue Rathgore" "Blue Vein (Australian)"
#> [55] "Blue Vein Cheese" "Blue Yonder"
#> [57] "Bocconcini" "Boivin Marbled Cheddar"
#> [59] "Bossa" "Boulder Chevre"
#> [61] "Brewer's Gold" "Brie de Melun"
#> [63] "Brillat-Savarin" "Brin"
#> [65] "Brin d'Amour" "Bruder Basil"
#> [67] "Brunost" "Brutal Blue"
#> [69] "Burwash Rose" "Buttercup"
#> [71] "Butterkase" "Buttermilk Blue Affinee"
#> [73] "Buttermilk Gorgonzola" "Caciobarricato"
#> [75] "Cacio De Roma®" "Caciotta"
#> [77] "Caciotta Al Tartufo" "Cacow Belle"
#> [79] "Calenzana (Calinzanincu)" "Cambozola Grand Noir"
#> [81] "Cameo" "Cana de Cabra"
#> [83] "Cape Vessey" "Capra al Fieno"
#> [85] "Capra Nouveau" "Cardo "
#> [87] "Carr Valley Glacier Wildfire Blue" "Casatica"
#> [89] "Casciotta di Urbino" "Cashel Blue"
#> [91] "Castelo Branco" "Castle Blue"
#> [93] "Celtic Promise" "Chabichou du Poitou"
#> [95] "Charolais" "Chaumes"
#> [97] "Chevre" "Chevre en Marinade"
#> [99] "Chile Caciotta" "Chile Jack"
## My naive attempt to adapt the code to etsy.com fails miserably
session_etsy <- bow("https://www.etsy.com", force = TRUE)
result_etsy <- scrape(session_etsy, query=list(t="Christmas candle", per_page=100)) %>% html_node("#main-body") %>%
html_nodes("h3") %>%
html_text()
result_etsy
#> character(0)
由reprex 包于 2021-09-30 创建(v2.0.1)
解决方案
我能够在第一页使用以下代码提取产品描述(注意:我在此示例中使用了 Windows):
library(RDCOMClient)
library(stringr)
IEApp <- COMCreate("InternetExplorer.Application")
IEApp[['Visible']] <- TRUE
IEApp$Navigate("https://www.etsy.com/search?q=Christmas+candle&order=most_relevant&view_type=gallery")
Sys.sleep(5)
doc <- IEApp$Document()
Sys.sleep(5)
inner_Text <- doc$documentElement()$innerText()
Sys.sleep(5)
inner_Text_Splitted <- strsplit(inner_Text, "\n")[[1]]
bool_Backslach_R <- inner_Text_Splitted %in% c("\r", " \r", " \r")
inner_Text_Splitted <- inner_Text_Splitted[!bool_Backslach_R]
inner_Text_Splitted <- inner_Text_Splitted[!(nchar(inner_Text_Splitted) > 2000)]
index_Price_Candle <- which(str_detect(string = inner_Text_Splitted, pattern = "\\d{1,3}\\,\\d{0,2}[:space:]CA\\$"))
inner_Text_Splitted_Price <- inner_Text_Splitted[(min(index_Price_Candle) - 10) : (max(index_Price_Candle) + 10)]
inner_Text_Splitted_Price <- inner_Text_Splitted_Price[-(1 : 7)]
bool_To_Remove <- inner_Text_Splitted_Price %in% c("Chargement...\r", " Ajouter aux favoris \r",
"Publicité d'un créateur Etsy \r", " Ajouter aux favoris \r",
"Populaire \r", "Top vendeur\r")
inner_Text_Splitted_Price <- inner_Text_Splitted_Price[!bool_To_Remove]
bool_To_Remove <- str_detect(inner_Text_Splitted_Price, pattern = "[:space:]{5}\\(\\d{1,3}\\)[:space:]{1,5}\\r")
inner_Text_Splitted_Price <- inner_Text_Splitted_Price[!bool_To_Remove]
# More cleaning required but the information is there...
# Etsy was in french me for me.
另外,如果你想去另一个页面,你可以使用
# Here, we go to the second page, we added &ref=pagination&page=2 to the original link to go on the second page
IEApp$Navigate("https://www.etsy.com/search?q=Christmas+candle&order=most_relevant&view_type=gallery&ref=pagination&page=2")
# Here, we go to the third page, we added &ref=pagination&page=3 to the original link to go on the third page
IEApp$Navigate("https://www.etsy.com/search?q=Christmas+candle&order=most_relevant&view_type=gallery&ref=pagination&page=3")
推荐阅读
- c++ - Qt/C++ 连接经典蓝牙设备
- python - 在没有条件语句的情况下在 python 中实现重载函数的最佳方法是什么?
- python-3.x - ValueError:检查输入时出错:预期 lstm_10_input 的形状为 (679, 1) 但得到的数组的形状为 (1, 1)
- c++ - 如何确定 QTextCodec 是否与 ASCII 兼容?
- python - 加快从 pandas 数据帧到 mysql 的数据插入
- javascript - jQuery(document).on("click") & document.getElementById('sign-out').addEventListener('click') 的区别
- javascript - 如何根据颜色选择器值 onclick 的值更改元素颜色
- javascript - 如何删除不必要的括号?
- python - 在 cmd 和 IDE 上运行代码的函数内定义相对路径文件夹
- laravel - CORS 策略已阻止从源“http://localhost:8100”访问“http://127.0.0.1:8000/api/studentpost”处的 XMLHttpRequest: