java - 做 curl 和爬取网站导致“你没有访问权限”
问题描述
我想使用 java库来抓取这个网站。jsoup
我的代码如下:
private String crawl() {
Document doc = null;
try {
doc = Jsoup.connect(getUrl()).headers(getRequestHeaders()).get();
} catch (Exception e) {
e.printStackTrace();
}
return doc.body().text();
}
private String getUrl() {
return "https://usa.visa.com/support/consumer/travel-support/exchange-rate-calculator.html?" +
"amount=1&" +
"fee=3&" +
"fromCurr=IDR" +
"&toCurr=USD" +
"&submitButton=Calculate+exchange+rate";
}
private Map<String, String> getRequestHeaders() {
Map<String, String> headers = new HashMap<>();
headers.put("authority", "usa.visa.com");
headers.put("cache-control", "max-age=0");
headers.put("upgrade-insecure-requests", "1");
headers.put("user-agent", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36");
headers.put("accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3");
headers.put("accept-encoding", "gzip, deflate, br");
headers.put("accept-language", "en-US,en;q=0.9");
return headers;
}
如果我尝试在本地爬行,它工作正常。但是,当我将代码部署到 AWS Lambda 函数时,我得到了一个拒绝访问页面:
<HTML><HEAD>
<TITLE>Access Denied</TITLE>
</HEAD><BODY>
<H1>Access Denied</H1>
You don't have permission to access "http://usa.visa.com/support/consumer/travel-support/exchange-rate-calculator.html?" on this server.<P>
Reference #18.de174b17.1561156615.19dc81c4
</BODY>
</HTML>
当我尝试使用curl
以下命令在本地使用时,它给了我同样的错误。
curl 'https://usa.visa.com/support/consumer/travel-support/exchange-rate-calculator.html?amount=1&fee=3&fromCurr=IDR&toCurr=USD&submitButton=Calculate+exchange+rate' -H 'authority: usa.visa.com' -H 'cache-control: max-age=0' -H 'upgrade-insecure-requests: 1' -H 'user-agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2' -H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3' -H 'accept-encoding: gzip, deflate, br' -H 'accept-language: en-US,en;q=0.9' --compressed
我也尝试根据此处的答案使用cookie,但仍然没有解决问题。
我怀疑该网站有某种机制来保护它不被抓取。我该怎么做才能绕过它?
解决方案
推荐阅读
- wso2 - SP OIDC 自动生成的 client_id 可修改?租户之间不应该是可重复的吗
- python - 如何将 youtube 视频下载到 mp3,并按“时间戳”分割
- rest - Rocketchat REST api 回复特定消息
- java - 使用 @DataMongoTest 时未注入 MockMvc
- https - nestjs 配置 https.agent
- angular - 使用角度凭据成功登录后如何从登录导航到主页
- node.js - cors 不能在具有 express 的 firebase 功能中工作
- ruby-on-rails - 使用 bulma 图标但无法显示到屏幕上?
- rust - 为什么借用检查器不明白借用切片的不同部分基本上是可以的?
- python-3.x - 比较两个字符串以获得多个结果