首页 > 解决方案 > R Webscraping多个压缩文件

问题描述

有人问过这个问题,但我还没有找到解决方案。我想从网站上抓取一些压缩的 .dat 文件。然而,我在这一点上:

library(XML)
url<-c("http://blablabla")
zipped <-htmlParse(url)
nodes_a<-getNodeSet(zipped,"//a")
files<-grep("*.zip",sapply(nodes_a, function(nodes_a) 
xmlGetAttr(nodes_a,"href")),value=TRUE)
urls<-paste(url,files,sep="")

然后我用这个:

mapply(function(x,y) download.file(x,y),urls,files)

这是我收到的错误消息:

Error in mapply(function(x, y) download.file(x, y), urls, files) : 
 zero-length inputs cannot be mixed with those of non-zero length

有什么提示吗?

标签: rweb-scrapingzip

解决方案


完全没用的“请给我们您的电子邮件”页面引入了一个条件,即我们必须保持状态以进行任何进一步的导航或下载,并首先转到带有注册表单的页面并抓取它以从传递给下一个请求的页面(表面上是出于安全目的):

library(curlconverter)
library(xml2)
library(httr)
library(rvest)

pg <- read_html("https://www.cpc.unc.edu/projects/china/data/datasets/data-downloads-registration")

html_nodes(pg, "input[name='_authenticator']") %>% 
  html_attr("value") -> authenticator

我查看POST了表单使用的请求curlconverter(查看 SO 以了解如何使用它或阅读 GitLab 项目站点)并提出:

httr::POST(
  url = "https://www.cpc.unc.edu/projects/china/data/datasets/data-downloads-registration",
  httr::add_headers(
    `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:63.0) Gecko/20100101 Firefox/63.0",
    Referer = "https://www.cpc.unc.edu/projects/china/data/datasets/data-downloads-registration"
  ),
  httr::set_cookies(`restriction-/projects/china/data/datasets/data_downloads` = "/projects/china/data/datasets/data_downloads"),
  body = list(
    `first-name` = "Steve",
    `last-name` = "Rogers",
    `email-address` = "example@me.com",
    `interest` = "a researcher",
    `org` = "The Avengers",
    `department` = "Operations",
    `postal-address` = "1 Avengers Drive",
    `city-name` = "Undisclosed",
    `state-province` = "Virginia",
    `postal-code` = "09911",
    `country-name` = "US",
    `opt-in:boolean:default` = "",
    `fieldset` = "default",
    `form.submitted` = "1",
    `add_reference.field:record` = "",
    `add_reference.type:record"` = "",
    `add_reference.destination:record"` = "",
    `last_referer` = "https://www.cpc.unc.edu/projects/china/data/datasets",
    `_authenticator` = authenticator,
    `form_submit` = "Submit"
  ), 
  encode = "multipart"
) -> res

curlconverter从开发人员工具中特定项目的简单“副本”为您制作 ^^)

希望你能看到authenticator进来的地方。

现在我们得到了我们需要获取文件。

首先我们需要进入下载页面:

read_html(httr::content(res, as = "text")) %>% 
  html_nodes(xpath=".//p[contains(., 'You may now')]/strong/a") %>% 
  html_attr("href") -> dl_pg_link

dl_pg <- httr::GET(url = dl_pg_link)

然后我们需要进入真正的下载页面:

httr::content(dl_pg, as = "text") %>% 
  read_html() %>% 
  html_nodes(xpath=".//a[contains(@class, 'contenttype-folder state-published url')]") %>% 
  html_attr("href") -> dls

然后我们需要从该页面获取所有可下载的位:

zip_pg <- httr::GET(url = dls)

httr::content(zip_pg, as = "text") %>% 
  read_html() %>% 
  html_nodes("td > a") %>% 
  html_attr("href") %>% 
  gsub("view$", "at_download/file", .) -> dl_links

这里如何获得第一个:

(fil1 <- httr::GET(dl_links[1]))
## Response [https://www.cpc.unc.edu/projects/china/data/datasets/data_downloads/longitudinal/weights-chns.pdf/at_download/file]
##   Date: 2018-10-14 03:03
##   Status: 200
##   Content-Type: application/pdf
##   Size: 197 kB
## <BINARY BODY>

fil1$headers[["content-disposition"]]
## [1] "attachment; filename=\"weights-chns.pdf\""

writeBin(
  httr::content(fil1, as = "raw"),
  file.path("~/Data", gsub('"', '', strsplit(fil1$headers[["content-disposition"]], "=")[[1]][2])))
)

(fil2 <- httr::GET(dl_links[2]))
## Response [https://www.cpc.unc.edu/projects/china/data/datasets/data_downloads/longitudinal/Biomarker_2012Dec.zip/at_download/file]
##   Date: 2018-10-14 03:06
##   Status: 200
##   Content-Type: application/zip
##   Size: 2.37 MB
## <BINARY BODY>

(这是一个 PDF),这里是如何获得第二个是 ZIP 的方法:

fil2$headers[["content-disposition"]]
## [1] "attachment; filename=\"Biomarker_2012Dec.zip\""

writeBin(
  httr::content(fil2, as = "raw"),
  file.path("~/Data", gsub('"', '', strsplit(fil2$headers[["content-disposition"]], "=")[[1]][2])))
)

您可以将 ^^ 转换为迭代操作。

请注意,每次您启动新的 R 会话时,您必须curl从顶部开始(即从输入您的电子邮件表单页面开始),因为底层包(其 powershttrrvest)为您维护会话状态(在 cookie 中)。


推荐阅读