python - 在 R 中重新创建 python 机械化脚本
问题描述
我想重新创建下面的 python 脚本,它在 R 中使用 mechanize 和 http.cookiejar。我认为使用 rvest 会很简单,但我无法这样做。任何关于使用和应用哪些包的见解都将非常有帮助。我意识到网状可能是一种可能性,但我认为在 R 中必须有一种直接的方法来做到这一点。
import mechanize
import http.cookiejar
b = mechanize.Browser()
b.set_handle_refresh(True)
b.set_debug_redirects(True)
b.set_handle_redirect(True)
b.set_debug_http(True)
cj = http.cookiejar.CookieJar()
b.set_cookiejar(cj)
b.addheaders = [
('User-agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36'),
('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'),
('Host', 'www.fangraphs.com'),
('Referer', 'https://www.fangraphs.com/auctiontool.aspx?type=pit&proj=atc&pos=1,1,1,1,5,1,1,0,0,1,5,5,0,18,0&dollars=400&teams=12&mp=5&msp=5&mrp=5&mb=1&split=&points=c|0,1,2,3,4,5|0,1,2,3,4,5&lg=MLB&rep=0&drp=0&pp=C,SS,2B,3B,OF,1B&players=')
]
b.open("https://www.fangraphs.com/auctiontool.aspx?type=pit&proj=atc&pos=1,1,1,1,5,1,1,0,0,1,5,5,0,18,0&dollars=400&teams=12&mp=5&msp=5&mrp=5&mb=1&split=&points=c|0,1,2,3,4,5|0,1,2,3,4,5&lg=MLB&rep=0&drp=0&pp=C,SS,2B,3B,OF,1B&players=")
def is_form1_form(form):
return "id" in form.attrs and form.attrs['id'] == "form1"
b.select_form(predicate=is_form1_form)
b.form.find_control(name='__EVENTTARGET').readonly = False
b.form.find_control(name='__EVENTARGUMENT').readonly = False
b.form['__EVENTTARGET'] = 'AuctionBoard1$cmdCSV'
b.form['__EVENTARGUMENT'] = ''
print(b.submit().read())
我用来尝试用 rvest 重新创建它的 R 代码如下。评论指出了我困惑的主要来源。特别是当我用 rvest 抓取表单时,python 代码抓取的所需字段没有显示,当我尝试手动插入它们时,我在提交时收到连接被拒绝。
library(rvest)
atc.pitcher.link = "https://www.fangraphs.com/auctiontool.aspx?type=pit&proj=atc&pos=1,1,1,1,5,1,1,0,0,1,5,5,0,18,0&dollars=400&teams=12&mp=5&msp=5&mrp=5&mb=1&split=&points=c|0,1,2,3,4,5|0,1,2,3,4,5&lg=MLB&rep=0&drp=0&pp=C,SS,2B,3B,OF,1B&players="
proj.data = html_session(atc.pitcher.link)
form.unfilled = proj.data %>% html_node("form") %>% html_form()
# note: I am suprised "__EVENTTARGET" and "__EVENTARGUMENT" are not included as attributes of the unfilled form. I can select them in the posted python script.
# If I try and create them with the appropriate values I get a Connection Refused Error.
form.unfilled[[5]]$`__EVENTTARGET` = form.unfilled[[5]]$`__VIEWSTATE`
form.unfilled[[5]]$`__EVENTARGUMENT`= form.unfilled[[5]]$`__VIEWSTATE`
form.unfilled[[5]]$`__EVENTTARGET`$readonly = FALSE
form.unfilled[[5]]$`__EVENTTARGET`$value = "AuctionBoard1$cmdCSV"
form.unfilled[[5]]$`__EVENTARGUMENT`$value = ""
form.unfilled[[5]]$`__EVENTARGUMENT`$readonly = FALSE
form.filled = form.unfilled
session = submit_form(proj.data, form.filled)
解决方案
这是一种使用 RSelenium 并将 chrome 设置为无头并启用远程下载到您的工作目录的方法。它会自动启动一个无头浏览器,然后让代码驱动它。
我相信要在 rvest 中做同样的事情,你需要编写一些原生 phantomjs。
library(RSelenium)
library(wdman)
eCaps <- list(
chromeOptions = list(
args = c('--headless','--disable-gpu', '--window-size=1280,800'),
prefs = list(
"profile.default_content_settings.popups" = 0L,
"download.prompt_for_download" = FALSE,
"download.default_directory" = getwd()
)
)
)
cDrv <- wdman::chrome()
rD <- RSelenium::rsDriver(extraCapabilities = eCaps)
remDr <- rD$client
remDr$queryRD(
ipAddr = paste0(remDr$serverURL, "/session/", remDr$sessionInfo[["id"]], "/chromium/send_command"),
method = "POST",
qdata = list(
cmd = "Page.setDownloadBehavior",
params = list(
behavior = "allow",
downloadPath = getwd()
)
)
)
atc.pitcher.link= "http://www.fangraphs.com/auctiontool.aspx?type=pit&proj=atc&pos=1,1,1,1,5,1,1,0,0,1,5,5,0,18,0&dollars=400&teams=12&mp=5&msp=5&mrp=5&mb=1&split=&points=c|0,1,2,3,4,5|0,1,2,3,4,5&lg=MLB&rep=0&drp=0&pp=C,SS,2B,3B,OF,1B&players="
remDr$navigate(atc.pitcher.link)
# sleep to be nice and give things time to load
Sys.sleep(8)
# find the button the page we want to click
option <- remDr$findElement('id', 'AuctionBoard1_cmdCSV')
#click it
option$clickElement()
list.files(getwd(),pattern = 'sysdata')
remDr$closeall()
cDrv$stop()
推荐阅读
- docker - 如何将微服务架构中的不同容器与 api 网关连接起来?
- numpy - 在numpy中统一初始化二维数组
- php - Carbon 2 日期解析返回不同的结果
- javascript - 单击按钮时不断导航到我的 html 目录
- r - R - 如何在数据框中找到每行的三个最小值?
- apache-kafka - 重启后 Kafka 消费者不阅读消息。我正在使用 weiboad / kafka-php
- typo3-9.x - TYPO3 9:递归计算根页面的子页面
- javascript - React app中出现mongodb.js (TypeError: qs.unescape is not a function)的问题,但可以找到该函数
- android - 使用 ViewPager2 和 FragmentStateAdapter 的片段生命周期行为
- typescript - 包装一个返回值为 Promise 的函数
| undefined 始终返回 Promise 并保持输入