首页 > 解决方案 > 无法使用 rvest 提取完整的数据

问题描述

我正在尝试使用 rvest 和 SelectorGadget 从 expedia 网站上取消航班价格以获取 CSS 选择器。以下是我的代码:


library(rvest)
library(lubridate)  

url <- paste('https://www.expedia.com/Flights-Search?trip=oneway&leg1=from%3AAustin%2C%20TX%2C%20United%20States%20(AUS)%2Cto%3ASan%20Francisco%2C%20CA%2C%20United%20States%20of%20America%20(SFO)%2Cdeparture%3A', 06,'%2F', 10,'%2F',2018,'TANYT&passengers=adults%3A1%2Cchildren%3A0%2Cseniors%3A0%2Cinfantinlap%3AY&options=cabinclass%3Aeconomy&mode=search&origref=www.expedia.com', sep = "")

  webpage <- read_html(url)

  departure_time_data_html <- html_nodes(webpage,'.medium-bold span:nth-child(1)')
  departure_time_data <- html_text(departure_time_data_html)
  departure_time_data

[1] “11:40am” “7:45am” “6:29am” “6:00am” “5:55am”

在实际网站中,单个页面有 42 个条目,但代码只提取了 5 个值。以下是网站链接:

https://www.expedia.com/Flights-Search?trip=oneway&leg1=from%3AAustin%2C%20TX%2C%20United%20States%20(AUS)%2Cto%3ASan%20Francisco%2C%20CA%2C%20United %20States%20of%20America%20(SFO)%2Cdeparture%3A6%2F10%2F2018TANYT&passengers=adults%3A1%2Cchildren%3A0%2Cseniors%3A0%2Cinfantinlap%3AY&options=cabinclass%3Aeconomy&mode=search&origref=www.expedia.com

很高兴收到你们任何人的来信。谢谢你。

标签: rweb-scrapingrvest

解决方案


发生的情况是网站将数据存储为 JSON 字符串,该字符串由浏览器解析。事实上,您可以直接从该 JSON 字符串中提取信息。(以下是页面源代码。)

在此处输入图像描述

library(rvest)
library(jsonlite)
library(purrr)

url <- paste('https://www.expedia.com/Flights-Search?trip=oneway&leg1=from%3AAustin%2C%20TX%2C%20United%20States%20(AUS)%2Cto%3ASan%20Francisco%2C%20CA%2C%20United%20States%20of%20America%20(SFO)%2Cdeparture%3A', 06,'%2F', 10,'%2F',2018,'TANYT&passengers=adults%3A1%2Cchildren%3A0%2Cseniors%3A0%2Cinfantinlap%3AY&options=cabinclass%3Aeconomy&mode=search&origref=www.expedia.com', sep = "")

webpage <- read_html(url)

departure_time_data_html <- html_node(webpage,'#cachedResultsJson') # id to the json string
json_text <- departure_time_data_html %>% html_text() # get json string as text

result <- fromJSON(json_text) # parse the json string content into list
result1 <- fromJSON(result$content) # parse the json string content into list

result1$legs$`0c46a88d484464ad78b9a0985e80ab4e`$timeline$departureTime # a sample of how to extract info from one flight

map(result1$legs,~ .x$timeline$departureTime) # extract all info using map

样本结果:

> map(result1$legs,~ .x$timeline$departureTime)
$`0c46a88d484464ad78b9a0985e80ab4e`
date dateLongStr   time     dateTime travelDate                        isoStr hour
1 6/10/2018 Sun, Jun 10 7:05am 1.528632e+12   06/10/18 2018-06-10T07:05:00.000-05:00   NA
2      <NA>        <NA>   <NA>           NA       <NA>                          <NA>   NA
3 6/10/2018 Sun, Jun 10 9:02am 1.528639e+12   06/10/18 2018-06-10T09:02:00.000-05:00   NA

$`90341ad9782711784a797ffeb22a5e44`
date dateLongStr   time    dateTime travelDate                        isoStr hour
1 6/10/2018 Sun, Jun 10 5:30pm 1.52867e+12   06/10/18 2018-06-10T17:30:00.000-05:00   NA

$c40e4d757819356926cc693ca1820827
date dateLongStr   time     dateTime travelDate                        isoStr hour
1 6/10/2018 Sun, Jun 10 7:50pm 1.528678e+12   06/10/18 2018-06-10T19:50:00.000-05:00   NA
2      <NA>        <NA>   <NA>           NA       <NA>                          <NA>   NA
3 6/10/2018 Sun, Jun 10 9:42pm 1.528685e+12   06/10/18 2018-06-10T21:42:00.000-05:00   NA

$`83d7b1595e668e9c4fa886b164202f37`
date dateLongStr   time     dateTime travelDate                        isoStr hour
1 6/10/2018 Sun, Jun 10 5:54pm 1.528671e+12   06/10/18 2018-06-10T17:54:00.000-05:00   NA
2      <NA>        <NA>   <NA>           NA       <NA>                          <NA>   NA
3 6/10/2018 Sun, Jun 10 7:45pm 1.528678e+12   06/10/18 2018-06-10T19:45:00.000-05:00   NA

推荐阅读