python-3.x - 使用 BeautifulSoup (Python3) 抓取给定的 URL
问题描述
哈!
我正在尝试用 python 编写一个网络爬虫。目标是(最终)遍历域列表,抓取这些域中的所有本地 URL,并将所有 HTML 内容转储回我的主机。
现在,我将脚本交给一个域 ( http://www.scrapethissite.com )。我已经定义了一个队列,以及新的/已处理的/本地的/外部的/损坏的 URL 集。根据输出,看起来脚本正在读取页面的 HTML,获取第一个 local_URL ('https://scrapethissite.com/lessons/'),但未能通过将其添加到队列中来处理该 URL。
理想情况下,脚本会将每个输入 URL 的输出输出到它自己的目录中,同时保留 TLD 的 DOM。但我很乐意让队列正常工作。
这是我的代码:
from bs4 import BeautifulSoup
import requests
import requests.exceptions
from collections import deque
from urllib.parse import urlsplit, urlparse
from urllib.request import Request, urlopen
import requests
from html.parser import HTMLParser
import lxml
#set targets
url = 'https://scrapethissite.com'
new_urls = deque(([url]))
processed_urls = set()
local_urls = set()
foreign_urls = set()
broken_urls = set()
#process urls until we exhaust queue
while len(new_urls):
url = new_urls.popleft()
processed_urls.add(url)
print("processing %s..." % url)
try:
response = requests.get(url).text
print(response)
print(new_urls)
except(requests.exceptions.MissingSchema, requests.exceptions.ConnectionError, requests.exceptions.InvalidURL, requests.exceptions.InvalidSchema):
broken_urls.add(url)
print(("this %s failed") % url)
continue
##get base URL to differentiate local and foreign addresses
###not working
print("differentiating addresses...")
parts = urlsplit(url)
base = "{0.netloc}".format(parts)
strip_base = base.replace("www.","")
base_url = "{0.scheme}://{0.netloc}".format(parts)
path = url[:url.rfind('/')+1] if '/' in parts.path else url
#initialize soup
print("initializing soup...")
response = requests.get(url).text
soup = BeautifulSoup(response, 'lxml')
#get links in HTML
for link in soup.find_all('a'):
anchor = link.attrs["href"] if "href" in link.attrs else ''
#scrape page for links
print("scraping links...")
if anchor.startswith('/'):
local_link = base_url + anchor
local_urls.add(local_link)
elif strip_base in anchor:
local_urls.add(anchor)
elif not anchor.startswith('http'):
local_link = path + anchor
local_urls.add(local_link)
else:
foreign_urls.add(anchor)
print(forieng_urls)
print(local_urls)
#to crawl local urls
#...and add them to sets
for i in local_urls:
if not i in new_urls and not i in processed_urls:
new_urls.append(i)
#to crawl all URLs
#for i in local_urls:
#if not link in new_urls and not link in processed_urls:
#new_urls.append(link)
print("new urls: %s" % new_urls)
print("processed urls: %s" % processed_urls)
print("broken urls: %s" % broken_urls)
这是输出:
(base) $ crawler % python3 bsoup_crawl.py
processing https://scrapethissite.com...
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Scrape This Site | A public sandbox for learning web scraping</title>
<link rel="icon" type="image/png" href="/static/images/scraper-icon.png" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="description" content="A public sandbox for learning web scraping">
<link href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap.min.css" rel="stylesheet" integrity="sha256-MfvZlkHCEqatNoGiOXveE8FIwMzZg4W85qfrfIFBfYc= sha512-dTfge/zgoMYpP7QbHy4gWMEGsbsdZeCXz7irItjcC3sPUFtf0kuFbDz/ixG7ArTxmDjLXDmezHubeNikyKGVyQ==" crossorigin="anonymous">
<link href='https://fonts.googleapis.com/css?family=Lato:400,700' rel='stylesheet' type='text/css'>
<link rel="stylesheet" type="text/css" href="/static/css/styles.css">
</head>
<body>
<nav id="site-nav">
<div class="container">
<div class="col-md-12">
<ul class="nav nav-tabs">
<li id="nav-homepage">
<a href="/" class="nav-link hidden-sm hidden-xs">
<img src="/static/images/scraper-icon.png" id="nav-logo">
Scrape This Site
</a>
</li>
<li id="nav-sandbox">
<a href="/pages/" class="nav-link">
<i class="glyphicon glyphicon-console hidden-sm hidden-xs"></i>
Sandbox
</a>
</li>
<li id="nav-lessons">
<a href="/lessons/" class="nav-link">
<i class="glyphicon glyphicon-education hidden-sm hidden-xs"></i>
Lessons
</a>
</li>
<li id="nav-faq">
<a href="/faq/" class="nav-link">
<i class="glyphicon glyphicon-flag hidden-sm hidden-xs"></i>
FAQ
</a>
</li>
<li id="nav-login" class="pull-right">
<a href="/login/" class="nav-link">
Login
</a>
</li>
</ul>
</div>
</div>
</nav>
<script type="text/javascript">
var path = document.location.pathname;
var tab = undefined;
if (path === "/"){
tab = document.querySelector("#nav-homepage");
} else if (path.indexOf("/faq/") === 0){
tab = document.querySelector("#nav-faq");
} else if (path.indexOf("/lessons/") === 0){
tab = document.querySelector("#nav-lessons");
} else if (path.indexOf("/pages/") === 0) {
tab = document.querySelector("#nav-sandbox");
} else if (path.indexOf("/login/") === 0) {
tab = document.querySelector("#nav-login");
}
tab.classList.add("active")
</script>
<div id="homepage">
<section id="hero">
<div class="container">
<div class="row">
<div class="col-md-12 text-center">
<img src="/static/images/scraper-icon.png" id="townhall-logo" />
<h1>
Scrape This Site
</h1>
<p class="lead">
The internet's best resource for learning <strong>web scraping</strong>.
</p>
<br><br><br>
<a href="/pages/" class="btn btn-lg btn-default" />Explore Sandbox</a>
<a href="/lessons/" class="btn btn-lg btn-primary" />
<i class="glyphicon glyphicon-education"></i>
Begin Lessons →
</a>
</div><!--.col-->
</div><!--.row-->
</div><!--.container-->
</section>
</div>
<section id="footer">
<div class="container">
<div class="row">
<div class="col-md-12 text-center text-muted">
Lessons and Videos © Hartley Brody 2018
</div><!--.col-->
</div><!--.row-->
</div><!--.container-->
</section>
</body>
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.3/jquery.min.js"></script>
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/js/bootstrap.min.js" integrity="sha256-Sk3nkD6mLTMOF0EOpNtsIry+s1CsaqQC1rVLTAy+0yc= sha512-K1qjQ+NcF2TYO/eI3M6v8EiNYZfA95pQumfvcVrTHtwQVDG+aHRqLi/ETn2uB+1JqwYqVG3LIvdm9lj6imS/pQ==" crossorigin="anonymous"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/pnotify/2.1.0/pnotify.core.min.js"></script>
<link href="https://cdnjs.cloudflare.com/ajax/libs/pnotify/2.1.0/pnotify.core.min.css" rel="stylesheet" type="text/css">
<!-- pnotify messages -->
<script type="text/javascript">
PNotify.prototype.options.styling = "bootstrap3";
$(function(){
});
$(function () {
$('[data-toggle="tooltip"]').tooltip()
})
</script>
<!-- golbal video controls -->
<script type="text/javascript">
$("video").hover(function() {
$(this).prop("controls", true);
}, function() {
$(this).prop("controls", false);
});
$("video").click(function() {
if( this.paused){
this.play();
}
else {
this.pause();
}
});
</script>
<!-- insert google analytics here -->
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-41551755-8', 'auto');
ga('send', 'pageview');
</script>
<!-- Facebook Pixel Code -->
<script>
!function(f,b,e,v,n,t,s){if(f.fbq)return;n=f.fbq=function(){n.callMethod?
n.callMethod.apply(n,arguments):n.queue.push(arguments)};if(!f._fbq)f._fbq=n;
n.push=n;n.loaded=!0;n.version='2.0';n.queue=[];t=b.createElement(e);t.async=!0;
t.src=v;s=b.getElementsByTagName(e)[0];s.parentNode.insertBefore(t,s)}(window,
document,'script','https://connect.facebook.net/en_US/fbevents.js');
fbq('init', '764287443701341');
fbq('track', "PageView");</script>
<noscript><img height="1" width="1" style="display:none"
src="https://www.facebook.com/tr?id=764287443701341&ev=PageView&noscript=1"
/></noscript>
<!-- End Facebook Pixel Code -->
<!-- Google Code for Remarketing Tag -->
<script type="text/javascript">
/* <![CDATA[ */
var google_conversion_id = 950945448;
var google_custom_params = window.google_tag_params;
var google_remarketing_only = true;
/* ]]> */
</script>
<script type="text/javascript" src="//www.googleadservices.com/pagead/conversion.js">
</script>
<noscript>
<div style="display:inline;">
<img height="1" width="1" style="border-style:none;" alt="" src="//googleads.g.doubleclick.net/pagead/viewthroughconversion/950945448/?guid=ON&script=0"/>
</div>
</noscript>
<!-- Global site tag (gtag.js) - Google AdWords: 950945448 -->
<script async src="https://www.googletagmanager.com/gtag/js?id=AW-950945448"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'AW-950945448');
</script>
</html>
deque([])
differentiating addresses...
initializing soup...
scraping links...
new urls: deque(['https://scrapethissite.com/lessons/', <a class="btn btn-lg btn-primary" href="/lessons/"></a>])
processed urls: {'https://scrapethissite.com'}
broken urls: set()
解决方案
推荐阅读
- docker - Gradle 在 GitLab CI 上构建:无法创建 ScriptPluginFactory 类型的服务
- c# - 错误:未找到与请求 URI 'http://localhost/api/test' 匹配的 HTTP 资源
- c - 需要帮助将整数值从循环存储到数组中
- java - JavaFX app that uses Executor hangs on quit
- arrays - 在此对象上找不到 PowerShell 属性 Count
- java - Selenium - 单击按钮已注册但未将页面重定向到目标链接
- firebase - 将 Web Push API 与 cordova phonegap-plugin-push 和 FCM 一起使用
- talend - 删除 Talend 服务会启动该服务
- javascript - 如何使用箭头键访问动态创建的图像
- mysql - 如何从一到多到多列连接三行结果