r - 如何从html页面下载多个同名文件?
问题描述
我想从http://insideairbnb.com/get-the-data.html下载所有名为“listings.csv.gz”的文件,这些文件指的是美国城市,我可以通过编写每个链接来做到这一点,但有可能循环执行?
最后,我将只保留每个文件中的几列并将它们合并到一个文件中。
由于@CodeNoob 解决了这个问题,我想分享一下它是如何解决的:
page <- read_html("http://insideairbnb.com/get-the-data.html")
# Get all hrefs (i.e. all links present on the website)
links <- page %>%
html_nodes("a") %>%
html_attr("href")
# Filter for listings.csv.gz, USA cities, data for March 2019
wanted <- grep('listings.csv.gz', links)
USA <- grep('united-states', links)
wanted.USA = wanted[wanted %in% USA]
wanted.links <- links[wanted.USA]
wanted.links = grep('2019-03', wanted.links, value = TRUE)
wanted.cols = c("host_is_superhost", "summary", "host_identity_verified", "street",
"city", "property_type", "room_type", "bathrooms",
"bedrooms", "beds", "price", "security_deposit", "cleaning_fee",
"guests_included", "number_of_reviews", "instant_bookable",
"host_response_rate", "host_neighbourhood",
"review_scores_rating", "review_scores_accuracy","review_scores_cleanliness",
"review_scores_checkin" ,"review_scores_communication",
"review_scores_location", "review_scores_value", "space",
"description", "host_id", "state", "latitude", "longitude")
read.gz.url <- function(link) {
con <- gzcon(url(link))
df <- read.csv(textConnection(readLines(con)))
close(con)
df <- df %>% select(wanted.cols) %>%
mutate(source.url = link)
df
}
all.df = list()
for (i in seq_along(wanted.links)) {
all.df[[i]] = read.gz.url(wanted.links[i])
}
all.df = map(all.df, as_tibble)
解决方案
您实际上可以提取所有链接,过滤包含的链接,listings.csv.gz
然后循环下载它们:
library(rvest)
library(dplyr)
# Get all download links
page <- read_html("http://insideairbnb.com/get-the-data.html")
# Get all hrefs (i.e. all links present on the website)
links <- page %>%
html_nodes("a") %>%
html_attr("href")
# Filter for listings.csv.gz
wanted <- grep('listings.csv.gz', links)
wanted.links <- links[wanted]
for (link in wanted.links) {
con <- gzcon(url(link))
txt <- readLines(con)
df <- read.csv(textConnection(txt))
# Do what you want
}
示例:下载并合并文件
为了获得您想要的结果,我建议编写一个下载函数来过滤您想要的列,然后将它们组合在一个数据框中,例如:
read.gz.url <- function(url) {
con <- gzcon(url(link))
df <- read.csv(textConnection(readLines(con)))
close(con)
df <- df %>% select(c('calculated_host_listings_count_shared_rooms', 'cancellation_policy' )) %>% # random columns I chose
mutate(source.url = url) # You may need to remember the origin of each row
df
}
all.df <- do.call('rbind', lapply(head(wanted.links,2), read.gz.url))
注意我只测试了前两个文件,因为它们非常大
推荐阅读
- wpf - 在 WPF 绑定中,空 {Binding} 做什么
- c# - 有没有办法在反序列化对象时防止多次设置相同的属性?
- r - R,按日期加入所有数据集
- azure - Azure SQL 端口 443
- excel - 新手:在工作表中设置范围对象时出错
- javascript - 错误“snapshot.val”不是 Google Cloud Functions 中的函数
- telegram - 如何从 Telegram API 获取频道管理员列表?
- javascript - 添加/删除字段 - 文件上传 javascript
- javascript - 我应该以多大的概率显示下雨?(黑暗天空API)
- compiler-errors - 即使我确实初始化了视频系统,它也没有初始化