首页 > 解决方案 > 抓取问题(访问被拒绝)

问题描述

我需要检索网站中的所有信息(我删除了链接,因为昨天有人关闭了我的问题)。

我已经做了两个星期了,三天前,当直接从 Chrome 进入该站点时,它要求我检查我是否是机器人,因为从我的 IP 中看到了奇怪的动作。(我不记得具体写了什么,但那是概念)。

今天我正在工作并提取一些数据(一个简单的链接列表),在我第二次运行代码时,我注意到列表是空的。所以我检查了结果,requests.get('**site**')它与这些天我通常查看的页面的 html 不同。

最后,我尝试直接从浏览器访问该站点,它打开了该站点的页面,但只是空白并写有“拒绝访问”

我仍然尝试在请求行添加用户代理,但继续拒绝访问。这是一个简单的脚本:

import requests
from bs4 import BeautifulSoup

r = requests.get('site', headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) Gecko/20100101 Chrome/72.0.3626.121'}).text
obj = BeautifulSoup(r, 'html.parser')
print(obj)

下面是输出

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<meta content="noindex,nofollow" name="robots"/>
<title>AZLyrics - request for access</title>
<link crossorigin="anonymous" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" integrity="sha384-BVYiiSIFeK1dGmJRAkycuHAHRg32OmUcww7on3RYdg4Va+PmSTsz/K68vbdEjh4u" rel="stylesheet"/>
<link href="https://code.jquery.com/ui/1.12.1/themes/base/jquery-ui.min.css" rel="stylesheet"/>
<link href="/bsaz.css" rel="stylesheet"/>
<!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->
<!--[if lt IE 9]>
      <script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>
      <script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
    <![endif]-->
<script async="" defer="" src="https://www.google.com/recaptcha/api.js"></script>
<script crossorigin="anonymous" integrity="sha256-ZosEbRLbNQzLpnKIkEdrPv7lOy9C27hHQ+Xp8a4MxAQ=" src="https://code.jquery.com/jquery-1.12.4.min.js"></script>
<script src="https://code.jquery.com/ui/1.12.1/jquery-ui.min.js"></script>
<script crossorigin="anonymous" integrity="sha384-Tc5IQib027qvyjSMfHjOMaLkfuWVxZxUPnCJA7l2mCWNIpG9mGCD8wGNIcPD7Txa" src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js"></script>
<script type="text/javascript">
    <!-- 
      if (top.location != self.location) {
      top.location = self.location.href
     }
    //--> 
    function az_recaptcha_success(){
        document.getElementById("az_unblock").submit();
    }
    </script>
</head>
<body>
<nav class="navbar navbar-default navbar-static-top text-center">
<div class="container text-center">
<div class="navbar-header" style="float:none; display:inline-block;">
<a class="navbar-brand" href="https://www.azlyrics.com"><img alt="AZLyrics.com" class="pull-left" src="/az_logo_tr.png" style="max-height:40px; margin-top:-10px;"/></a>
</div>
</div><!-- /.container -->
</nav>
<!-- top ban -->
<!--  <div class="lboard-wrap">
  <div class="container">
    <div class="row">
       <div class="col-xs-12 top-ad text-center">
         <span id="cf_banner_top_nofc"></span>
       </div>
    </div>
  </div>
  </div> -->
<!-- main -->
<div class="container main-page">
<div class="row">
<div class="col-xs-12 col-sm-10 col-sm-offset-1 col-md-8 col-md-offset-2 text-center">
<div class="alert alert-danger" role="alert">
                Access denied.
            </div>
</div>
</div>
</div>
</body></html>
 <!-- container main-page -->
<!-- bot ban -->
<!--<div class="lboard-wrap">
  <div class="container">
    <div class="row">
       <div class="col-xs-12 top-ad text-center">
          <span id="cf_banner_bottom"></span>
       </div>
    </div>
  </div>
  </div>-->
<!-- footer -->
<!--<nav class="navbar navbar-footer">
          <div class="container text-center">
          <ul class="nav navbar-nav navbar-center">
            <li><a href="//www.azlyrics.com/adv.html">Advertise Here</a></li>
            <li><a href="//www.azlyrics.com/privacy.html">Privacy Policy</a></li>
            <li><a href="//www.azlyrics.com/cookie.html">Cookie Policy</a></li>
            <li><a href="//www.azlyrics.com/copyright.html">DMCA Policy</a></li>
          </ul>
          </div> 
     </nav>-->
<div class="footer-wrap">
<div class="container">
<small>
<script type="text/javascript">
                curdate=new Date();
                document.write("<strong>Copyright &copy; 2000-"+curdate.getFullYear()+" AZLyrics.com<\/strong>");
             </script>
</small>
</div>
</div>

你认为我可以做些什么来继续在这个网站上工作吗?

标签: pythonbeautifulsouppython-requests

解决方案


推荐阅读