首页 > 解决方案 > R + rvest + 礼貌用于网络抓取 etsy.com

问题描述

我对网络抓取完全陌生,我可能会淹死在茶杯里。我想自动执行以下操作

  1. 在 etsy.com 上运行以下查询

https://www.etsy.com/search?q=Christmas+candle&order=most_relevant&view_type=gallery

即只需在 Etsy 上查找“圣诞蜡烛”

  1. 然后分别检索产品的标题和描述,可能给出我想在搜索中包含的页数作为我的函数或管道的输入。

我看了基本的例子

https://github.com/dmi3kno/polite

但是当我尝试使其适应我的需要时(请参阅帖子末尾的reprex),它未能准确返回......什么都没有!

谁能指出我正确的方向?非常感谢!

library(polite)
library(rvest)

session <- bow("https://www.cheese.com/by_type", force = TRUE)

result <- scrape(session, query=list(t="semi-soft", per_page=100)) %>%
    html_node("#main-body") %>% 
    html_nodes("h3") %>% 
    html_text()

result
#>   [1] "3-Cheese Italian Blend"            "Abbaye de Citeaux"                
#>   [3] "Abbaye du Mont des Cats"           "Adelost"                          
#>   [5] "ADL Brick Cheese"                  "Ailsa Craig"                      
#>   [7] "Airedale"                          "Aisy Cendre"                      
#>   [9] "Alpe di Frabosa"                   "Alpine Gold"                      
#>  [11] "Alta Badia"                        "Amablu Blue cheese"               
#>  [13] "Ameribella"                        "American Cheese"                  
#>  [15] "Ami du Chambertin"                 "Amsterdammer (British Columbia)"  
#>  [17] "Amul Pizza Mozzarella Cheese"      "Anthotyro Fresco"                 
#>  [19] "Aphrodite Haloumi "                "Appalachian"                      
#>  [21] "Applewood Smoked Chevre"           "Ardrahan"                         
#>  [23] "Armenian String Cheese"            "Aromes au Gene de Marc"           
#>  [25] "Asher Blue"                        "Asiago Pressato DOP"              
#>  [27] "Aura"                              "Azeitao"                          
#>  [29] "Baby Swiss"                        "Baluchon"                         
#>  [31] "Bandal"                            "Basajo"                           
#>  [33] "Basils Original Rauchkäse"         "Baskeriu"                         
#>  [35] "Basket Cheese"                     "Bassigny au porto"                
#>  [37] "Beaumont"                          "Beemster 2% Milk"                 
#>  [39] "Bel Paese"                         "Bergere Bleue"                    
#>  [41] "Bermuda Triangle"                  "Beyaz Peynir"                     
#>  [43] "Bica de Queijo"                    "Bierkase"                         
#>  [45] "Bijou"                             "Blarney Castle"                   
#>  [47] "Bleu Bénédictin"                   "Bleu d'Auvergne"                  
#>  [49] "Bleu Des Causses"                  "Bleu L'Ermite"                    
#>  [51] "Blue Benedictine"                  "Blue Lupine"                      
#>  [53] "Blue Rathgore"                     "Blue Vein (Australian)"           
#>  [55] "Blue Vein Cheese"                  "Blue Yonder"                      
#>  [57] "Bocconcini"                        "Boivin Marbled Cheddar"           
#>  [59] "Bossa"                             "Boulder Chevre"                   
#>  [61] "Brewer's Gold"                     "Brie de Melun"                    
#>  [63] "Brillat-Savarin"                   "Brin"                             
#>  [65] "Brin d'Amour"                      "Bruder Basil"                     
#>  [67] "Brunost"                           "Brutal Blue"                      
#>  [69] "Burwash Rose"                      "Buttercup"                        
#>  [71] "Butterkase"                        "Buttermilk Blue Affinee"          
#>  [73] "Buttermilk Gorgonzola"             "Caciobarricato"                   
#>  [75] "Cacio De Roma®"                    "Caciotta"                         
#>  [77] "Caciotta Al Tartufo"               "Cacow Belle"                      
#>  [79] "Calenzana (Calinzanincu)"          "Cambozola Grand Noir"             
#>  [81] "Cameo"                             "Cana de Cabra"                    
#>  [83] "Cape Vessey"                       "Capra al Fieno"                   
#>  [85] "Capra Nouveau"                     "Cardo "                           
#>  [87] "Carr Valley Glacier Wildfire Blue" "Casatica"                         
#>  [89] "Casciotta di Urbino"               "Cashel Blue"                      
#>  [91] "Castelo Branco"                    "Castle Blue"                      
#>  [93] "Celtic Promise"                    "Chabichou du Poitou"              
#>  [95] "Charolais"                         "Chaumes"                          
#>  [97] "Chevre"                            "Chevre en Marinade"               
#>  [99] "Chile Caciotta"                    "Chile Jack"

## My naive attempt to adapt the code to etsy.com fails miserably

session_etsy <- bow("https://www.etsy.com", force = TRUE)

result_etsy <- scrape(session_etsy, query=list(t="Christmas candle", per_page=100)) %>% html_node("#main-body") %>% 
    html_nodes("h3") %>%
    html_text()


result_etsy
#> character(0)

reprex 包于 2021-09-30 创建(v2.0.1)

标签: rweb-scrapingrvest

解决方案


我能够在第一页使用以下代码提取产品描述(注意:我在此示例中使用了 Windows):

library(RDCOMClient)
library(stringr)
IEApp <- COMCreate("InternetExplorer.Application")
IEApp[['Visible']] <- TRUE
IEApp$Navigate("https://www.etsy.com/search?q=Christmas+candle&order=most_relevant&view_type=gallery")
Sys.sleep(5)
doc <- IEApp$Document()
Sys.sleep(5)
inner_Text <- doc$documentElement()$innerText()
Sys.sleep(5)

inner_Text_Splitted <- strsplit(inner_Text, "\n")[[1]]
bool_Backslach_R <- inner_Text_Splitted %in% c("\r", " \r", "  \r")
inner_Text_Splitted <- inner_Text_Splitted[!bool_Backslach_R]
inner_Text_Splitted <- inner_Text_Splitted[!(nchar(inner_Text_Splitted) > 2000)]
index_Price_Candle <- which(str_detect(string = inner_Text_Splitted, pattern = "\\d{1,3}\\,\\d{0,2}[:space:]CA\\$"))
inner_Text_Splitted_Price <- inner_Text_Splitted[(min(index_Price_Candle) - 10) : (max(index_Price_Candle) + 10)]
inner_Text_Splitted_Price <- inner_Text_Splitted_Price[-(1 : 7)]

bool_To_Remove <- inner_Text_Splitted_Price %in% c("Chargement...\r", "    Ajouter aux favoris   \r",
                                                   "Publicité d'un créateur Etsy \r", "    Ajouter aux favoris   \r",
                                                   "Populaire  \r", "Top vendeur\r")

inner_Text_Splitted_Price <- inner_Text_Splitted_Price[!bool_To_Remove]
bool_To_Remove <- str_detect(inner_Text_Splitted_Price, pattern = "[:space:]{5}\\(\\d{1,3}\\)[:space:]{1,5}\\r")
inner_Text_Splitted_Price <- inner_Text_Splitted_Price[!bool_To_Remove]

# More cleaning required but the information is there...
# Etsy was in french me for me.

另外,如果你想去另一个页面,你可以使用

# Here, we go to the second page, we added &ref=pagination&page=2 to the original link to go on the second page
IEApp$Navigate("https://www.etsy.com/search?q=Christmas+candle&order=most_relevant&view_type=gallery&ref=pagination&page=2")

# Here, we go to the third page, we added &ref=pagination&page=3 to the original link to go on the third page
IEApp$Navigate("https://www.etsy.com/search?q=Christmas+candle&order=most_relevant&view_type=gallery&ref=pagination&page=3")

推荐阅读