首页 > 解决方案 > 如何从该特定网页中抓取数据并将输出保存在数据框中?

问题描述

我是 webscraping 的新手,R我需要帮助来完成这项任务。我正在尝试从这个特定网页中抓取数据,但我在整个过程中被困在某个特定点。

这是网址:网页

基本上,我试图从网页中捕获 3 个元素:

(1) 房间类型(css 选择器.room h3:)

(2) 膳食计划(css 选择器.meal-plan-title:)

(3)价格(css选择器.price:)

我已经能够从网页中提取这些值。但是,我很难匹配网页上显示的值。

以下是我的R代码的立场:

library(rvest)
library(dplyr)
library(stringr)
library(tables)

MealPlan <- read_html("https://www.hotelissima.fr/s/h/ile-maurice/mahebourg/astroea-beach.html?searchType=accomodation&searchId=4&guideId=&filters=&withFlights=false&airportCode=PAR&airport=Paris&search=astroea+beach&startdate=08%2F11%2F2021&stopdate=15%2F11%2F2021&duration=7&travelers=En+couple&travelType=&rooms%5B0%5D.nbAdults=2&rooms%5B0%5D.nbChilds=0&rooms%5B0%5D.birthdates%5B0%5D=&rooms%5B0%5D.birthdates%5B1%5D=&rooms%5B0%5D.birthdates%5B2%5D=&rooms%5B0%5D.birthdates%5B3%5D=&rooms%5B0%5D.birthdates%5B4%5D=") %>%
#html_nodes(".meal-plan-text") %>%
html_nodes(".meal-plan-title") %>%
html_text()

MealPlan

Price <- read_html("https://www.hotelissima.fr/s/h/ile-maurice/mahebourg/astroea-beach.html?searchType=accomodation&searchId=4&guideId=&filters=&withFlights=false&airportCode=PAR&airport=Paris&search=astroea+beach&startdate=08%2F11%2F2021&stopdate=15%2F11%2F2021&duration=7&travelers=En+couple&travelType=&rooms%5B0%5D.nbAdults=2&rooms%5B0%5D.nbChilds=0&rooms%5B0%5D.birthdates%5B0%5D=&rooms%5B0%5D.birthdates%5B1%5D=&rooms%5B0%5D.birthdates%5B2%5D=&rooms%5B0%5D.birthdates%5B3%5D=&rooms%5B0%5D.birthdates%5B4%5D=") %>%
  html_nodes(".price") %>%
  html_text()

Price


RoomType <- read_html("https://www.hotelissima.fr/s/h/ile-maurice/mahebourg/astroea-beach.html?searchType=accomodation&searchId=4&guideId=&filters=&withFlights=false&airportCode=PAR&airport=Paris&search=astroea+beach&startdate=08%2F11%2F2021&stopdate=15%2F11%2F2021&duration=7&travelers=En+couple&travelType=&rooms%5B0%5D.nbAdults=2&rooms%5B0%5D.nbChilds=0&rooms%5B0%5D.birthdates%5B0%5D=&rooms%5B0%5D.birthdates%5B1%5D=&rooms%5B0%5D.birthdates%5B2%5D=&rooms%5B0%5D.birthdates%5B3%5D=&rooms%5B0%5D.birthdates%5B4%5D=") %>%
  html_nodes(".room h3") %>%
  html_text()

RoomType

我想在数据框中有如下输出:

   RoomType               MealPlan         Price

Chambre Standard     Petit Dej.+Diner    584 € / pers
Chambre Standard     All inclusive       864 € / pers
Chambre Confort      Petit Dej.+Diner    715 € / pers
Chambre Confort      All inclusive       995 € / pers
Bungalow             Petit Dej.+Diner    781 € / pers
Bungalow             All inclusive       1061 € / pers
Chambre Deluxe       Petit Dej.+Diner    847 € / pers
Chambre Deluxe       All inclusive       1127 € / pers

任何帮助将不胜感激。

标签: rweb-scrapingrvest

解决方案


一种较慢的方法来回答。我添加了trim = TRUE删除多余空格的属性。

一个问题MealPlan是有几个 class .noprice。Oneo 排除它们的方法是使用xpathinhtml_nodes而不是 CSS 选择器。我不知道是否有办法使用 CSS 选择器来做到这一点。我在下面所做的是提取两者,然后对它们进行一组差异。

对于价格,我使用正则表达式来消除价格中的额外空间。

library(rvest)
library(dplyr)
library(stringr)

url <- "https://www.hotelissima.fr/s/h/ile-maurice/mahebourg/astroea-beach.html?searchType=accomodation&searchId=4&guideId=&filters=&withFlights=false&airportCode=PAR&airport=Paris&search=astroea+beach&startdate=08%2F11%2F2021&stopdate=15%2F11%2F2021&duration=7&travelers=En+couple&travelType=&rooms%5B0%5D.nbAdults=2&rooms%5B0%5D.nbChilds=0&rooms%5B0%5D.birthdates%5B0%5D=&rooms%5B0%5D.birthdates%5B1%5D=&rooms%5B0%5D.birthdates%5B2%5D=&rooms%5B0%5D.birthdates%5B3%5D=&rooms%5B0%5D.birthdates%5B4%5D="

Price <- read_html(url) %>%
  html_nodes(".price") %>%
  html_text(trim = TRUE) %>% 
  str_replace("(\\d)\\s(\\d)", "\\1\\2")

RoomType <- read_html(url) %>%
  html_nodes(".room h3") %>%
  html_text(trim = TRUE)

AllMealPlans <- read_html(url) %>%
  html_nodes(".meal-plan-text") %>%
  html_text(trim = TRUE)

MealPlansNoPrice <- read_html(url) %>%
  html_nodes(".noprice .meal-plan-text") %>%
  html_text(trim = TRUE)

MealPlan <- setdiff(AllMealPlans, MealPlansNoPrice) 

NumberMealPlans <- length(MealPlan)
NumberRoomTypes <- length(RoomType)

MealPlanColumn <- MealPlan %>% rep(times=NumberRoomTypes)

RoomTypeColumn <- RoomType %>% 
  rep(each = NumberMealPlans)
  
bind_cols(RoomType = RoomTypeColumn, MealPlan = MealPlanColumn, Price = Price)

推荐阅读