r - 如何从该特定网页中抓取数据并将输出保存在数据框中?
问题描述
我是 webscraping 的新手,R
我需要帮助来完成这项任务。我正在尝试从这个特定网页中抓取数据,但我在整个过程中被困在某个特定点。
这是网址:网页
基本上,我试图从网页中捕获 3 个元素:
(1) 房间类型(css 选择器.room h3
:)
(2) 膳食计划(css 选择器.meal-plan-title
:)
(3)价格(css选择器.price
:)
我已经能够从网页中提取这些值。但是,我很难匹配网页上显示的值。
以下是我的R
代码的立场:
library(rvest)
library(dplyr)
library(stringr)
library(tables)
MealPlan <- read_html("https://www.hotelissima.fr/s/h/ile-maurice/mahebourg/astroea-beach.html?searchType=accomodation&searchId=4&guideId=&filters=&withFlights=false&airportCode=PAR&airport=Paris&search=astroea+beach&startdate=08%2F11%2F2021&stopdate=15%2F11%2F2021&duration=7&travelers=En+couple&travelType=&rooms%5B0%5D.nbAdults=2&rooms%5B0%5D.nbChilds=0&rooms%5B0%5D.birthdates%5B0%5D=&rooms%5B0%5D.birthdates%5B1%5D=&rooms%5B0%5D.birthdates%5B2%5D=&rooms%5B0%5D.birthdates%5B3%5D=&rooms%5B0%5D.birthdates%5B4%5D=") %>%
#html_nodes(".meal-plan-text") %>%
html_nodes(".meal-plan-title") %>%
html_text()
MealPlan
Price <- read_html("https://www.hotelissima.fr/s/h/ile-maurice/mahebourg/astroea-beach.html?searchType=accomodation&searchId=4&guideId=&filters=&withFlights=false&airportCode=PAR&airport=Paris&search=astroea+beach&startdate=08%2F11%2F2021&stopdate=15%2F11%2F2021&duration=7&travelers=En+couple&travelType=&rooms%5B0%5D.nbAdults=2&rooms%5B0%5D.nbChilds=0&rooms%5B0%5D.birthdates%5B0%5D=&rooms%5B0%5D.birthdates%5B1%5D=&rooms%5B0%5D.birthdates%5B2%5D=&rooms%5B0%5D.birthdates%5B3%5D=&rooms%5B0%5D.birthdates%5B4%5D=") %>%
html_nodes(".price") %>%
html_text()
Price
RoomType <- read_html("https://www.hotelissima.fr/s/h/ile-maurice/mahebourg/astroea-beach.html?searchType=accomodation&searchId=4&guideId=&filters=&withFlights=false&airportCode=PAR&airport=Paris&search=astroea+beach&startdate=08%2F11%2F2021&stopdate=15%2F11%2F2021&duration=7&travelers=En+couple&travelType=&rooms%5B0%5D.nbAdults=2&rooms%5B0%5D.nbChilds=0&rooms%5B0%5D.birthdates%5B0%5D=&rooms%5B0%5D.birthdates%5B1%5D=&rooms%5B0%5D.birthdates%5B2%5D=&rooms%5B0%5D.birthdates%5B3%5D=&rooms%5B0%5D.birthdates%5B4%5D=") %>%
html_nodes(".room h3") %>%
html_text()
RoomType
我想在数据框中有如下输出:
RoomType MealPlan Price
Chambre Standard Petit Dej.+Diner 584 € / pers
Chambre Standard All inclusive 864 € / pers
Chambre Confort Petit Dej.+Diner 715 € / pers
Chambre Confort All inclusive 995 € / pers
Bungalow Petit Dej.+Diner 781 € / pers
Bungalow All inclusive 1061 € / pers
Chambre Deluxe Petit Dej.+Diner 847 € / pers
Chambre Deluxe All inclusive 1127 € / pers
任何帮助将不胜感激。
解决方案
一种较慢的方法来回答。我添加了trim = TRUE
删除多余空格的属性。
一个问题MealPlan
是有几个 class .noprice
。Oneo 排除它们的方法是使用xpath
inhtml_nodes
而不是 CSS 选择器。我不知道是否有办法使用 CSS 选择器来做到这一点。我在下面所做的是提取两者,然后对它们进行一组差异。
对于价格,我使用正则表达式来消除价格中的额外空间。
library(rvest)
library(dplyr)
library(stringr)
url <- "https://www.hotelissima.fr/s/h/ile-maurice/mahebourg/astroea-beach.html?searchType=accomodation&searchId=4&guideId=&filters=&withFlights=false&airportCode=PAR&airport=Paris&search=astroea+beach&startdate=08%2F11%2F2021&stopdate=15%2F11%2F2021&duration=7&travelers=En+couple&travelType=&rooms%5B0%5D.nbAdults=2&rooms%5B0%5D.nbChilds=0&rooms%5B0%5D.birthdates%5B0%5D=&rooms%5B0%5D.birthdates%5B1%5D=&rooms%5B0%5D.birthdates%5B2%5D=&rooms%5B0%5D.birthdates%5B3%5D=&rooms%5B0%5D.birthdates%5B4%5D="
Price <- read_html(url) %>%
html_nodes(".price") %>%
html_text(trim = TRUE) %>%
str_replace("(\\d)\\s(\\d)", "\\1\\2")
RoomType <- read_html(url) %>%
html_nodes(".room h3") %>%
html_text(trim = TRUE)
AllMealPlans <- read_html(url) %>%
html_nodes(".meal-plan-text") %>%
html_text(trim = TRUE)
MealPlansNoPrice <- read_html(url) %>%
html_nodes(".noprice .meal-plan-text") %>%
html_text(trim = TRUE)
MealPlan <- setdiff(AllMealPlans, MealPlansNoPrice)
NumberMealPlans <- length(MealPlan)
NumberRoomTypes <- length(RoomType)
MealPlanColumn <- MealPlan %>% rep(times=NumberRoomTypes)
RoomTypeColumn <- RoomType %>%
rep(each = NumberMealPlans)
bind_cols(RoomType = RoomTypeColumn, MealPlan = MealPlanColumn, Price = Price)
推荐阅读
- excel - 将单元格值从另一个工作簿导入工作表
- perl - 聚类(组)字符串数组
- javascript - 为什么在 hashchange 事件侦听器中重复输出?
- linux - 如何优雅地关闭运行在 Kubernetes 上的 Go 服务
- .net - 将一个表单实例的 TextBox 控件的更改值更新为所有实例
- r - 如何在 R 中安装 XCMS 包
- python-3.x - pymodm 找不到对象,而 pymongo 成功找到它
- javascript - 如何使用一个功能自行选择每个下拉列
- html - 垂直对齐的 inline-flexbox 的宽度随着子元素的数量而增长
- css - 当其中一个容器与内部中心对齐时,对齐顶部的两个 flex 容器