html - 使用R填写html/xml表单并下载报告文件
问题描述
目标
我正在尝试自动填写此表格并下载商业渔业上岸量数据(landing = https://foss.nmfs.noaa.gov/apexfoss/f?p=215:200 :::::: ,查看源代码 = 查看-来源: https: //foss.nmfs.noaa.gov/apexfoss/f ?p=215:200:9856386363065::NO:: :) 通过 R。
变量(我认为这些都是?)
以下是我需要在网站中选择的一些示例输入。我试图确定用户会看到什么 [User Selection Box Name]=[Example User Selection] 以及幕后阅读的内容(查看 devtools 和源代码)([HTML ID]=[Example HTML Selection代码])。希望我做对了这些。
年份 = "2011" (P200_YEAR = "2011")
区域类型 = “州”(P200_GEO_LOV = “1025”)
登陆州 = “阿拉巴马州”(P200_GEOGRAPHY = “01”)
物种 = “所有物种”(P200_SPECIES = “ALL_SP”)
运行报告 = [按钮] (P200_go = "")
单击运行报告按钮会在交互式表格中显示数据。我们可以通过选择“所有”行(p200_interactive_report_row_select = “100000”)来获取数据的所有行(或至少 100000 行)。我本身不需要下载数据。如果可以将数据提取并保存为页面上的对象,那就太好了。
如果最佳做法是每次都下载数据,则需要选择“1. 详细报告”(p200_interactive_report_saved_reports),单击“操作”>“下载”,其中会出现一个弹出窗口,我会选择“CSV”。
研究
我一直在互联网上寻找有关如何做到这一点的想法,但我还没有很好地工作。我在这篇 StackOverflow 帖子(使用 R 填写表格后下载文件)之后取得了一些进展,但它仍然不太正确,我认为他们查询的网站要简单得多。
环境
library(httr)
# Warning message:
# package ‘httr’ was built under R version 3.4.3
library(tidyverse)
# -- Attaching packages ----------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
# v ggplot2 2.2.1 v purrr 0.2.4
# v tibble 2.0.1 v dplyr 0.8.0.1
# v tidyr 0.7.2 v stringr 1.3.1
# v readr 1.1.1 v forcats 0.2.0
# -- Conflicts -------------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
# x dplyr::filter() masks stats::filter()
# x dplyr::lag() masks stats::lag()
# Warning messages:
# 1: package ‘tidyverse’ was built under R version 3.4.4
# 2: package ‘ggplot2’ was built under R version 3.4.4
# 3: package ‘tibble’ was built under R version 3.4.4
# 4: package ‘tidyr’ was built under R version 3.4.3
# 5: package ‘purrr’ was built under R version 3.4.3
# 6: package ‘dplyr’ was built under R version 3.4.4
# 7: package ‘stringr’ was built under R version 3.4.4
R.version.string
#[1] "R version 3.4.1 (2017-06-30)"
尝试的代码
POST(
url = "https://foss.nmfs.noaa.gov/apexfoss/f?p=215:200:9856386363065::NO:::",
encode = "multipart",
body = list(
P200_YEAR = "2011",
P200_GEO_LOV = "1025", #state
P200_GEOGRAPHY = "01", #Alabama
P200_SPECIES = "ALL_SP",
p200_go = "",
p200_interactive_report_saved_reports = "301319959260891033", # 1. Detailed Report
p200_interactive_report_row_select = 100000 #"All"
), verbose()
) -> res
# -> POST /apexfoss/f?p=215:200:9856386363065::NO::: HTTP/1.1
# -> Host: foss.nmfs.noaa.gov
# -> User-Agent: libcurl/7.56.0 r-curl/3.0 httr/1.3.1
# -> Accept-Encoding: gzip, deflate
# -> Accept: application/json, text/xml, application/xml, */*
# -> Content-Length: 833
# -> Content-Type: multipart/form-data; boundary=------------------------eb5b05624e7c8110
# ->
# >> --------------------------eb5b05624e7c8110
# >> Content-Disposition: form-data; name="P200_YEAR"
# >>
# >> 2011
# >> --------------------------eb5b05624e7c8110
# >> Content-Disposition: form-data; name="P200_GEO_LOV"
# >>
# >> 1025
# >> --------------------------eb5b05624e7c8110
# >> Content-Disposition: form-data; name="P200_GEOGRAPHY"
# >>
# >> 01
# >> --------------------------eb5b05624e7c8110
# >> Content-Disposition: form-data; name="P200_SPECIES"
# >>
# >> ALL_SP
# >> --------------------------eb5b05624e7c8110
# >> Content-Disposition: form-data; name="p200_go"
# >>
# >>
# >> --------------------------eb5b05624e7c8110
# >> Content-Disposition: form-data; name="p200_interactive_report_saved_reports"
# >>
# >> 301319959260891033
# >> --------------------------eb5b05624e7c8110
# >> Content-Disposition: form-data; name="p200_interactive_report_row_select"
# >>
# >> 1e+05
# >> --------------------------eb5b05624e7c8110--
#
# <- HTTP/1.1 200
# <- Date: Mon, 12 Aug 2019 03:37:12 GMT
# <- Server: Apache
# <- Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
# <- Access-Control-Allow-Origin: floatplan.noaa.gov
# <- Cache-Control: no-store
# <- Pragma: no-cache
# <- Expires: Sun, 27 Jul 1997 13:00:00 GMT
# <- Content-Type: application/json
# <- Transfer-Encoding: chunked
res
# Response [https://foss.nmfs.noaa.gov/apexfoss/f?p=215:200:9856386363065::NO:::]
# Date: 2019-08-12 03:37
# Status: 200
# Content-Type: application/json
# Size: 39 B
# {
# "error":"Your session has expired"
# }
out <- content(res)
out
# $error
# [1] "Your session has expired"
我需要...
为了能够使用 R 填写 html/xml 表单,提交它,将此数据作为 data.frame 获取到我的环境中。如果这可以通过简单地抓取页面或要求我下载数据来完成,那很好。另外,我不确定“您的会话已过期”的错误消息是什么意思。
任何帮助是极大的赞赏!非常感谢您的宝贵时间!
解决方案
推荐阅读
- java - 关于 OptaPlanner 配置以解决车辆路由(实时)的提示
- java - 脚本未正确验证
- php - 如何获取安装在远程服务器上的 Wordpress 站点的本地运行副本?
- plot - 如何在两点之间画一条线
- c++ - 哪些 C++ 标准库头文件调用 GCC 的 -pthread 选项的要求?
- java - 为什么 AtomicReference 不覆盖等于?
- ruby - Ruby 和 Rspec 模拟哈希
- python - 如何通过递归获取列表中所有元素出现的索引?
- flutter - 如何根据flutter app中的设备屏幕大小调整字体大小来固定文本的位置?
- entity-framework-migrations - 在事务中包装 ef 核心数据库更新