首页 > 解决方案 > 如何绕过 Cloudflare 和 reCAPTCHA 获取页面内容

问题描述

我想扔一个带有代理的页面。我使用 cfscrapy 进入页面并通过 Cloudflare(第一个“挑战”),然后页面询问我 reCAPTCHA 以了解我是否是人类。这就是问题所在,我想我需要传递用户代理和 cookie(可能我做了代码错误),但我不知道该怎么做。

    link = "https://www.oneblockdown.it/en/footwear-sneakers/adidas/men-unisex/adidas-originals-yeezy-boost-350-v2/9438"
    proxies = get_proxy(proxy_list) #I get proxies from a file...
    scraper = cfscrape.create_scraper() # returns a CloudflareScraper instance

    headers = {
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36"
        }
    try:
        if(use_proxies):
            print("[Proxy]: " + proxies['http'])
            r = scraper.get(link,  proxies=proxies)

    except:
        print("Connection to URL <" + link + "> failed.")
        return

    soup = BeautifulSoup(r.text, 'html.parser')
    print(soup.prettify())

最后一次打印的响应是这样的:

'''

<script src="https://www.google.com/recaptcha/api.js?hl=" type="text/javascript">
  </script>
  <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/1.11.3/jquery.js" type="text/javascript">
  </script>
 </head>
 <body>
  <div class="g-recaptcha" data-callback="getCaptchaResult" data-sitekey="6Le49hgUAAAAAIv3wrILeXIrOSdM3_5oxK4C6m48" data-size="invisible">
  </div>
  <script type="text/javascript">
   window.onload = function () { grecaptcha.execute(); };
function getCaptchaResult(response) {
    $.post("/index.php", {action: "captcha_verify", captcha: response, version: 37}, function(result){
        var timeout = result ? 0 : 2500;
        setTimeout(function() {
            window.location.reload();
        }, timeout);
    });
}
  </script>
  <script type="text/javascript">
   window.NREUM||(NREUM={});NREUM.info={"beacon":"bam.nr-data.net","licenseKey":"97b599ea8e","applicationID":"23522071","transactionName":"YFxXbENSCxEFUhVfWlkWdk1CRwoPS1cOWUFAXFRKHEALBwVaBERGGFhRUVVSFg==","queueTime":0,"applicationTime":54,"atts":"TBtUGgtIGB8=","errorBeacon":"bam.nr-data.net","agent":""}
  </script>
 </body>
</html>

'''

我需要验证我是人类。我怎样才能通过这个挑战?

标签: pythonweb-scrapingbeautifulsoup

解决方案


推荐阅读