首页 > 解决方案 > 使用R填写html/xml表单并下载报告文件

问题描述

目标

我正在尝试自动填写此表格并下载商业渔业上岸量数据(landing = https://foss.nmfs.noaa.gov/apexfoss/f?p=215:200 :::::: ,查看源代码 = 查看-来源: https: //foss.nmfs.noaa.gov/apexfoss/f ?p=215:200:9856386363065::NO:: :) 通过 R。

变量(我认为这些都是?)

以下是我需要在网站中选择的一些示例输入。我试图确定用户会看到什么 [User Selection Box Name]=[Example User Selection] 以及幕后阅读的内容(查看 devtools 和源代码)([HTML ID]=[Example HTML Selection代码])。希望我做对了这些。

年份 = "2011" (P200_YEAR = "2011")

区域类型 = “州”(P200_GEO_LOV = “1025”)

登陆州 = “阿拉巴马州”(P200_GEOGRAPHY = “01”)

物种 = “所有物种”(P200_SPECIES = “ALL_SP”)

运行报告 = [按钮] (P200_go = "")

单击运行报告按钮会在交互式表格中显示数据。我们可以通过选择“所有”行(p200_interactive_report_row_select = “100000”)来获取数据的所有行(或至少 100000 行)。我本身不需要下载数据。如果可以将数据提取并保存为页面上的对象,那就太好了。

如果最佳做法是每次都下载数据,则需要选择“1. 详细报告”(p200_interactive_report_saved_reports),单击“操作”>“下载”,其中会出现一个弹出窗口,我会选择“CSV”。

研究

我一直在互联网上寻找有关如何做到这一点的想法,但我还没有很好地工作。我在这篇 StackOverflow 帖子(使用 R 填写表格后下载文件)之后取得了一些进展,但它仍然不太正确,我认为他们查询的网站要简单得多。

环境

library(httr)
# Warning message:
# package ‘httr’ was built under R version 3.4.3

library(tidyverse)
# -- Attaching packages ----------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
#   v ggplot2 2.2.1       v purrr   0.2.4  
# v tibble  2.0.1       v dplyr   0.8.0.1
# v tidyr   0.7.2       v stringr 1.3.1  
# v readr   1.1.1       v forcats 0.2.0  
# -- Conflicts -------------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
#   x dplyr::filter() masks stats::filter()
# x dplyr::lag()    masks stats::lag()
# Warning messages:
#   1: package ‘tidyverse’ was built under R version 3.4.4 
# 2: package ‘ggplot2’ was built under R version 3.4.4 
# 3: package ‘tibble’ was built under R version 3.4.4 
# 4: package ‘tidyr’ was built under R version 3.4.3 
# 5: package ‘purrr’ was built under R version 3.4.3 
# 6: package ‘dplyr’ was built under R version 3.4.4 
# 7: package ‘stringr’ was built under R version 3.4.4 

R.version.string
#[1] "R version 3.4.1 (2017-06-30)"

尝试的代码

POST(
  url = "https://foss.nmfs.noaa.gov/apexfoss/f?p=215:200:9856386363065::NO:::",
  encode = "multipart",
  body = list(
    P200_YEAR = "2011",
    P200_GEO_LOV = "1025", #state
    P200_GEOGRAPHY = "01", #Alabama
    P200_SPECIES = "ALL_SP", 
    p200_go = "", 
    p200_interactive_report_saved_reports = "301319959260891033", # 1. Detailed Report
    p200_interactive_report_row_select = 100000 #"All"
    ), verbose()
) -> res

# -> POST /apexfoss/f?p=215:200:9856386363065::NO::: HTTP/1.1
# -> Host: foss.nmfs.noaa.gov
# -> User-Agent: libcurl/7.56.0 r-curl/3.0 httr/1.3.1
# -> Accept-Encoding: gzip, deflate
# -> Accept: application/json, text/xml, application/xml, */*
#   -> Content-Length: 833
# -> Content-Type: multipart/form-data; boundary=------------------------eb5b05624e7c8110
# -> 
#   >> --------------------------eb5b05624e7c8110
# >> Content-Disposition: form-data; name="P200_YEAR"
# >> 
#   >> 2011
# >> --------------------------eb5b05624e7c8110
# >> Content-Disposition: form-data; name="P200_GEO_LOV"
# >> 
#   >> 1025
# >> --------------------------eb5b05624e7c8110
# >> Content-Disposition: form-data; name="P200_GEOGRAPHY"
# >> 
#   >> 01
# >> --------------------------eb5b05624e7c8110
# >> Content-Disposition: form-data; name="P200_SPECIES"
# >> 
#   >> ALL_SP
# >> --------------------------eb5b05624e7c8110
# >> Content-Disposition: form-data; name="p200_go"
# >> 
#   >> 
#   >> --------------------------eb5b05624e7c8110
# >> Content-Disposition: form-data; name="p200_interactive_report_saved_reports"
# >> 
#   >> 301319959260891033
# >> --------------------------eb5b05624e7c8110
# >> Content-Disposition: form-data; name="p200_interactive_report_row_select"
# >> 
#   >> 1e+05
# >> --------------------------eb5b05624e7c8110--
#   
#   <- HTTP/1.1 200 
# <- Date: Mon, 12 Aug 2019 03:37:12 GMT
# <- Server: Apache
# <- Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
# <- Access-Control-Allow-Origin: floatplan.noaa.gov
# <- Cache-Control: no-store
# <- Pragma: no-cache
# <- Expires: Sun, 27 Jul 1997 13:00:00 GMT
# <- Content-Type: application/json
# <- Transfer-Encoding: chunked
res

# Response [https://foss.nmfs.noaa.gov/apexfoss/f?p=215:200:9856386363065::NO:::]
# Date: 2019-08-12 03:37
# Status: 200
# Content-Type: application/json
# Size: 39 B
# {
#   "error":"Your session has expired"
# }
out <- content(res)
out

# $error
# [1] "Your session has expired"

我需要...

为了能够使用 R 填写 html/xml 表单,提交它,将此数据作为 data.frame 获取到我的环境中。如果这可以通过简单地抓取页面或要求我下载数据来完成,那很好。另外,我不确定“您的会话已过期”的错误消息是什么意思。

任何帮助是极大的赞赏!非常感谢您的宝贵时间!

标签: htmlrxmlshinyhttr

解决方案


推荐阅读