r - 使用 R 配置随机代理进行抓取

问题描述

我抓取了一个授权抓取机器人规则的网站，但有时我被阻止了。

当我联系管理员了解原因时，我想了解如何在 R 中使用不同的代理来继续抓取而不会被阻止。

我遵循了这个快速教程： https: //support.rstudio.com/hc/en-us/articles/200488488-Configuring-R-to-Use-an-HTTP-or-HTTPS-Proxy

所以我编辑了环境文件：

file.edit('~/.Renviron')

在其中我插入了一个随机选择的代理列表：

proxies_list <- c("128.199.109.241:8080","113.53.230.195:3128","125.141.200.53:80","125.141.200.14:80","128.199.200.112:138","149.56.123.99:3128","128.199.200.112:80","125.141.200.39:80","134.213.29.202:4444")
proxy <-paste0('https://', sample(proxies_list, 1))
https_proxy=proxy

但是当我使用这段代码时：

download.file(url_proxy, destfile ='output.html',quiet = TRUE)
html_output <- read_html('output.html')

我一直被屏蔽。

我没有正确设置代理吗？

谢谢！M。

标签： rhttpweb-scrapingproxy

您需要设置环境变量，而不是 R 变量。有关?download.file更多详细信息，请参阅。

例如

Sys.setenv(http_proxy=proxy)

在其他任何事情发生之前。另请注意文档中的警告：

These environment variables must be set before the download code is
first used: they cannot be altered later by calling 'Sys.setenv'.

r - 使用 R 配置随机代理进行抓取

问题描述

解决方案

推荐阅读