首页 > 解决方案 > 使用 BeautifulSoup (Python3) 抓取给定的 URL

问题描述

哈!

我正在尝试用 python 编写一个网络爬虫。目标是(最终)遍历域列表,抓取这些域中的所有本地 URL,并将所有 HTML 内容转储回我的主机。

现在,我将脚本交给一个域 ( http://www.scrapethissite.com )。我已经定义了一个队列,以及新的/已处理的/本地的/外部的/损坏的 URL 集。根据输出,看起来脚本正在读取页面的 HTML,获取第一个 local_URL ('https://scrapethissite.com/lessons/'),但未能通过将其添加到队列中来处理该 URL。

理想情况下,脚本会将每个输入 URL 的输出输出到它自己的目录中,同时保留 TLD 的 DOM。但我很乐意让队列正常工作。

这是我的代码:

from bs4 import BeautifulSoup
import requests
import requests.exceptions 
from collections import deque
from urllib.parse import urlsplit, urlparse
from urllib.request import Request, urlopen 
import requests
from html.parser import HTMLParser
import lxml


#set targets
url = 'https://scrapethissite.com'
new_urls = deque(([url]))
processed_urls = set()
local_urls = set()
foreign_urls = set()
broken_urls = set()

#process urls until we exhaust queue
while len(new_urls):   

    url = new_urls.popleft()    
    processed_urls.add(url)
    print("processing %s..." % url)
    
    try:
        response = requests.get(url).text
        print(response)
        print(new_urls)

    except(requests.exceptions.MissingSchema, requests.exceptions.ConnectionError, requests.exceptions.InvalidURL, requests.exceptions.InvalidSchema):    
        broken_urls.add(url)
        print(("this %s failed") % url)
        continue    
       

##get base URL to differentiate local and foreign addresses
    ###not working
print("differentiating addresses...")
parts = urlsplit(url)
base = "{0.netloc}".format(parts)
strip_base = base.replace("www.","")
base_url = "{0.scheme}://{0.netloc}".format(parts)
path = url[:url.rfind('/')+1] if '/' in parts.path else url



#initialize soup
print("initializing soup...")
response = requests.get(url).text
soup = BeautifulSoup(response, 'lxml')

#get links in HTML 
for link in soup.find_all('a'):
    anchor = link.attrs["href"] if "href" in link.attrs else ''

#scrape page for links
print("scraping links...")
if anchor.startswith('/'):        
    local_link = base_url + anchor        
    local_urls.add(local_link)    
elif strip_base in anchor:        
    local_urls.add(anchor)    
elif not anchor.startswith('http'):        
    local_link = path + anchor        
    local_urls.add(local_link)    
else:        
    foreign_urls.add(anchor)
    print(forieng_urls)
    print(local_urls)
    
#to crawl local urls
#...and add them to sets
for i in local_urls:    
    if not i in new_urls and not i in processed_urls:        
        new_urls.append(i)
        
    
#to crawl all URLs
#for i in local_urls:
    #if not link in new_urls and not link in processed_urls:
       #new_urls.append(link)

print("new urls: %s" % new_urls)
print("processed urls: %s" % processed_urls)
print("broken urls: %s" % broken_urls)

这是输出:

(base) $ crawler % python3 bsoup_crawl.py
processing https://scrapethissite.com...
<!doctype html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <title>Scrape This Site | A public sandbox for learning web scraping</title>
    <link rel="icon" type="image/png" href="/static/images/scraper-icon.png" />

    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <meta name="description" content="A public sandbox for learning web scraping">

    <link href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap.min.css" rel="stylesheet" integrity="sha256-MfvZlkHCEqatNoGiOXveE8FIwMzZg4W85qfrfIFBfYc= sha512-dTfge/zgoMYpP7QbHy4gWMEGsbsdZeCXz7irItjcC3sPUFtf0kuFbDz/ixG7ArTxmDjLXDmezHubeNikyKGVyQ==" crossorigin="anonymous">
    <link href='https://fonts.googleapis.com/css?family=Lato:400,700' rel='stylesheet' type='text/css'>
    <link rel="stylesheet" type="text/css" href="/static/css/styles.css">

    

  </head>

  <body>
    <nav id="site-nav">
            <div class="container">
                <div class="col-md-12">
                    <ul class="nav nav-tabs">
                        <li id="nav-homepage">
                            <a href="/" class="nav-link hidden-sm hidden-xs">
                                <img src="/static/images/scraper-icon.png" id="nav-logo">
                                Scrape This Site
                            </a>
                        </li>
                        <li id="nav-sandbox">
                            <a href="/pages/" class="nav-link">
                                <i class="glyphicon glyphicon-console hidden-sm hidden-xs"></i>
                                Sandbox
                            </a>
                        </li>
                        <li id="nav-lessons">
                            <a href="/lessons/" class="nav-link">
                                <i class="glyphicon glyphicon-education hidden-sm hidden-xs"></i>
                                Lessons
                            </a>
                        </li>
                        <li id="nav-faq">
                            <a href="/faq/" class="nav-link">
                                <i class="glyphicon glyphicon-flag hidden-sm hidden-xs"></i>
                                FAQ
                            </a>
                        </li>
                        
                        <li id="nav-login" class="pull-right">
                            <a href="/login/" class="nav-link">
                                Login
                            </a>
                        </li>
                        
                    </ul>
                </div>
            </div>
        </nav>

        <script type="text/javascript">
            var path = document.location.pathname;
            var tab = undefined;
            if (path === "/"){
                tab = document.querySelector("#nav-homepage");
            } else if (path.indexOf("/faq/") === 0){
                tab = document.querySelector("#nav-faq");
            } else if (path.indexOf("/lessons/") === 0){
                tab = document.querySelector("#nav-lessons");
            } else if (path.indexOf("/pages/") === 0) {
                tab = document.querySelector("#nav-sandbox");
            } else if (path.indexOf("/login/") === 0) {
                tab = document.querySelector("#nav-login");
            }
            tab.classList.add("active")
        </script>

    

    <div id="homepage">

        <section id="hero">
            <div class="container">
                <div class="row">
                    <div class="col-md-12 text-center">
                        <img src="/static/images/scraper-icon.png" id="townhall-logo" />
                        <h1>
                            Scrape This Site
                        </h1>
                        <p class="lead">
                            The internet's best resource for learning <strong>web scraping</strong>.
                        </p>
                        <br><br><br>
                        <a href="/pages/" class="btn btn-lg btn-default" />Explore Sandbox</a>
                        <a href="/lessons/" class="btn btn-lg btn-primary" />
                            <i class="glyphicon glyphicon-education"></i>
                            Begin Lessons &rarr;
                        </a>
                    </div><!--.col-->
                </div><!--.row-->
            </div><!--.container-->
        </section>

    </div>


    <section id="footer">
        <div class="container">
            <div class="row">
                <div class="col-md-12 text-center text-muted">
                    Lessons and Videos &copy; Hartley Brody 2018
                </div><!--.col-->
            </div><!--.row-->
        </div><!--.container-->
    </section>
  </body>

  <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.3/jquery.min.js"></script>
  <script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/js/bootstrap.min.js" integrity="sha256-Sk3nkD6mLTMOF0EOpNtsIry+s1CsaqQC1rVLTAy+0yc= sha512-K1qjQ+NcF2TYO/eI3M6v8EiNYZfA95pQumfvcVrTHtwQVDG+aHRqLi/ETn2uB+1JqwYqVG3LIvdm9lj6imS/pQ==" crossorigin="anonymous"></script>

  

  <script src="https://cdnjs.cloudflare.com/ajax/libs/pnotify/2.1.0/pnotify.core.min.js"></script>
  <link href="https://cdnjs.cloudflare.com/ajax/libs/pnotify/2.1.0/pnotify.core.min.css" rel="stylesheet" type="text/css">

  <!-- pnotify messages -->
  <script type="text/javascript">
    
    PNotify.prototype.options.styling = "bootstrap3";
    $(function(){
      
    });
    

    $(function () {
      $('[data-toggle="tooltip"]').tooltip()
    })
  </script>

  <!-- golbal video controls -->
  <script type="text/javascript">
    $("video").hover(function() {
        $(this).prop("controls", true);
    }, function() {
        $(this).prop("controls", false);
    });

    $("video").click(function() {
        if( this.paused){
            this.play();
        }
        else {
            this.pause();
        }
    });
    </script>

  <!-- insert google analytics here -->
  <script>
    (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
    (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
    m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
    })(window,document,'script','https://www.google-analytics.com/analytics.js','ga');

    ga('create', 'UA-41551755-8', 'auto');
    ga('send', 'pageview');
  </script>

  <!-- Facebook Pixel Code -->
  <script>
  !function(f,b,e,v,n,t,s){if(f.fbq)return;n=f.fbq=function(){n.callMethod?
  n.callMethod.apply(n,arguments):n.queue.push(arguments)};if(!f._fbq)f._fbq=n;
  n.push=n;n.loaded=!0;n.version='2.0';n.queue=[];t=b.createElement(e);t.async=!0;
  t.src=v;s=b.getElementsByTagName(e)[0];s.parentNode.insertBefore(t,s)}(window,
  document,'script','https://connect.facebook.net/en_US/fbevents.js');

  fbq('init', '764287443701341');
  fbq('track', "PageView");</script>
  <noscript><img height="1" width="1" style="display:none"
  src="https://www.facebook.com/tr?id=764287443701341&ev=PageView&noscript=1"
  /></noscript>
  <!-- End Facebook Pixel Code -->

  <!-- Google Code for Remarketing Tag -->
  <script type="text/javascript">
    /* <![CDATA[ */
    var google_conversion_id = 950945448;
    var google_custom_params = window.google_tag_params;
    var google_remarketing_only = true;
    /* ]]> */
    </script>
    <script type="text/javascript" src="//www.googleadservices.com/pagead/conversion.js">
    </script>
    <noscript>
    <div style="display:inline;">
    <img height="1" width="1" style="border-style:none;" alt="" src="//googleads.g.doubleclick.net/pagead/viewthroughconversion/950945448/?guid=ON&amp;script=0"/>
    </div>
  </noscript>

  <!-- Global site tag (gtag.js) - Google AdWords: 950945448 -->
  <script async src="https://www.googletagmanager.com/gtag/js?id=AW-950945448"></script>
  <script>
   window.dataLayer = window.dataLayer || [];
   function gtag(){dataLayer.push(arguments);}
   gtag('js', new Date());

   gtag('config', 'AW-950945448');
  </script>

</html>
deque([])
differentiating addresses...
initializing soup...
scraping links...
new urls: deque(['https://scrapethissite.com/lessons/', <a class="btn btn-lg btn-primary" href="/lessons/"></a>])
processed urls: {'https://scrapethissite.com'}
broken urls: set()

标签: python-3.xweb-scrapingbeautifulsoupweb-crawler

解决方案


推荐阅读