首页 > 解决方案 > 做 curl 和爬取网站导致“你没有访问权限”

问题描述

我想使用 java库来抓取这个网站。jsoup

我的代码如下:

  private String crawl() {
    Document doc = null;
    try {
      doc = Jsoup.connect(getUrl()).headers(getRequestHeaders()).get();
    } catch (Exception e) {
      e.printStackTrace();  
    }

    return doc.body().text();
  }

  private String getUrl() {
    return "https://usa.visa.com/support/consumer/travel-support/exchange-rate-calculator.html?" +
        "amount=1&" +
        "fee=3&" +
        "fromCurr=IDR" +
        "&toCurr=USD" +
        "&submitButton=Calculate+exchange+rate";
  }

  private Map<String, String> getRequestHeaders() {
    Map<String, String> headers = new HashMap<>();
    headers.put("authority", "usa.visa.com");
    headers.put("cache-control", "max-age=0");
    headers.put("upgrade-insecure-requests", "1");
    headers.put("user-agent", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36");
    headers.put("accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3");
    headers.put("accept-encoding", "gzip, deflate, br");
    headers.put("accept-language", "en-US,en;q=0.9");

    return headers;
  }

如果我尝试在本地爬行,它工作正常。但是,当我将代码部署到 AWS Lambda 函数时,我得到了一个拒绝访问页面:

<HTML><HEAD>
<TITLE>Access Denied</TITLE>
</HEAD><BODY>
<H1>Access Denied</H1>

You don't have permission to access "http&#58;&#47;&#47;usa&#46;visa&#46;com&#47;support&#47;consumer&#47;travel&#45;support&#47;exchange&#45;rate&#45;calculator&#46;html&#63;" on this server.<P>
Reference&#32;&#35;18&#46;de174b17&#46;1561156615&#46;19dc81c4
</BODY>
</HTML>

当我尝试使用curl以下命令在本地使用时,它给了我同样的错误。

curl 'https://usa.visa.com/support/consumer/travel-support/exchange-rate-calculator.html?amount=1&fee=3&fromCurr=IDR&toCurr=USD&submitButton=Calculate+exchange+rate' -H 'authority: usa.visa.com' -H 'cache-control: max-age=0' -H 'upgrade-insecure-requests: 1' -H 'user-agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2' -H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3' -H 'accept-encoding: gzip, deflate, br' -H 'accept-language: en-US,en;q=0.9' --compressed

我也尝试根据此处的答案使用cookie,但仍然没有解决问题。

我怀疑该网站有某种机制来保护它不被抓取。我该怎么做才能绕过它?

标签: javaamazon-web-servicescurljsoup

解决方案


推荐阅读