python - Scrapy spider 在检查所有链接之前提前完成
问题描述
我正在使用 Scrapy 解决方案来搜索网站列表以查找电子邮件(基于https://towardsdatascience.com/web-scraping-to-extract-contact-information-part-1-mailing-lists-854e8a8844d2) . 蜘蛛按预期运行,直到它“完成”大约 30 秒,但是它只爬过我在 start_urls 中发送给它的近 5000 个链接中的大约 25 个。这是蜘蛛:
class MailSpider(scrapy.Spider):
name = 'email'
def parse(self, response):
links = LxmlLinkExtractor(allow=()).extract_links(response)
links = [str(link.url) for link in links]
links.append(str(response.url))
for link in links:
yield scrapy.Request(url=link, callback=self.parse_item)
def parse_item(self, response):
for word in self.reject:
if word in str(response.url):
return
html_text = str(response.text)
raw_mail_list = re.findall('\w+@\w+\.{1}\w+', html_text)
mail_list = []
for email in raw_mail_list:
if '.com' in email or '.org' in email or '.net' in email or '.co' in email:
mail_list.append(email)
dic = {'email': mail_list, 'link': str(response.url)}
df = pd.DataFrame(dic)
df.to_csv(self.path, mode='a', header=False)
df.to_csv(self.path, mode='a', header=False)
以下是调试日志中显示的内容:
2020-05-31 04:15:13 [scrapy.core.engine] INFO: Closing spider (finished)
2020-05-31 04:15:13 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 8,
'downloader/exception_type_count/builtins.ValueError': 1,
'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 3,
'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 4,
'downloader/request_bytes': 175382,
'downloader/request_count': 475,
'downloader/request_method_count/GET': 475,
'downloader/response_bytes': 20920027,
'downloader/response_count': 467,
'downloader/response_status_count/200': 354,
'downloader/response_status_count/301': 79,
'downloader/response_status_count/302': 17,
'downloader/response_status_count/303': 3,
'downloader/response_status_count/404': 2,
'downloader/response_status_count/406': 4,
'downloader/response_status_count/503': 6,
'downloader/response_status_count/999': 2,
'dupefilter/filtered': 370,
'elapsed_time_seconds': 28.004682,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 5, 31, 8, 15, 13, 458883),
'httperror/response_ignored_count': 10,
'httperror/response_ignored_status_count/404': 2,
'httperror/response_ignored_status_count/406': 4,
'httperror/response_ignored_status_count/503': 2,
'httperror/response_ignored_status_count/999': 2,
'log_count/DEBUG': 473,
'log_count/ERROR': 13,
'log_count/INFO': 20,
'request_depth_max': 1,
'response_received_count': 361,
'retry/count': 9,
'retry/max_reached': 4,
'retry/reason_count/503 Service Unavailable': 4,
'retry/reason_count/twisted.internet.error.DNSLookupError': 2,
'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 3,
'scheduler/dequeued': 475,
'scheduler/dequeued/memory': 475,
'scheduler/enqueued': 475,
'scheduler/enqueued/memory': 475,
'spider_exceptions/AttributeError': 5,
'start_time': datetime.datetime(2020, 5, 31, 8, 14, 45, 454201)}
2020-05-31 04:15:13 [scrapy.core.engine] INFO: Spider closed (finished)
任何帮助,将不胜感激!
解决方案
推荐阅读
- android - 将 EditText 保存到 Integer Arraylist
- android - 如何在android中创建具有左上角和右上角的布局?
- bash - 使用 mv 命令移动包含空格的文件
- msbuild - 如何在 Visual Studio 中重新导入 .targets 文件?
- nativescript - 如何修改 Back Stack - NativeScript
- excel - 为什么我不能点击提交按钮(91 错误 438 错误 VBA)?
- email - 更新电子邮件帐户密码后,通过电子邮件获取问题停止工作
- python - Python:怎么可能有一个变量的值而不是它的地址
- xlib - libXm 从小部件获取显示编号
- sql - 将 SQL 中的数据与列中的数据合并