python-3.x - 如何从 pipelines.py 中获取 scrapy 参数的值?
问题描述
我是scrapy的新手。我在 github 上发现了一个从网站上抓取电子邮件的刮板。
此蜘蛛使用命令行中的参数传递:
scrapy crawl spider -a domain="example.com" -o emails-found.csv
该蜘蛛将结果存储在 csv 文件中。我想将结果存储在我的 Mysql DB 中。
所以我在 pipelines.py 中做了一些改动。
今天下午我非常努力地试图获得这个论点“域”的价值。你可以在这里看到我之前关于这个主题的帖子: 如何从我的 pipelines.py 文件中导入我的蜘蛛类的变量?
但我没有成功。日志告诉我:
AttributeError:类型对象'ThoroughSpider'没有属性......
我尝试使用 start_urls、domain 和 allowed_domains,但我总是得到相同的日志消息“...没有属性...”
@gangabass 好心地提出了一个好主意:为了从 pipelines.py 中获取它,对域进行 Yelding。
但正如我所说,我是新手,我不知道该怎么做。
我已经花了整个下午的时间来寻找一个没有任何成功的解决方案(请不要笑,这对我来说并不那么容易:-))。我相信这对专家来说是一件简单易行的事情。
现在我并不真正关心如何做到这一点的方法。我只想在我的 pipelines.py 中收集这个域值。
这是蜘蛛的代码:
# implementation of the thorough spider
import re
from urllib.parse import urljoin, urlparse
import scrapy
from scrapy.linkextractors import IGNORED_EXTENSIONS
from scraper.items import EmailAddressItem
# scrapy.linkextractors has a good list of binary extensions, only slight tweaks needed
IGNORED_EXTENSIONS.extend(['ico', 'tgz', 'gz', 'bz2'])
def get_extension_ignore_url_params(url):
path = urlparse(url).path # conveniently lops off all params leaving just the path
extension = re.search('\.([a-zA-Z0-9]+$)', path)
if extension is not None:
return extension.group(1)
else:
return "none" # don't want to return NoneType, it will break comparisons later
class ThoroughSpider(scrapy.Spider):
name = "spider"
def __init__(self, domain=None, subdomain_exclusions=[], crawl_js=False):
self.allowed_domains = [domain]
start_url = "http://" + domain
self.start_urls = [
start_url
]
self.subdomain_exclusions=subdomain_exclusions
self.crawl_js = crawl_js
# boolean command line parameters are not converted from strings automatically
if str(crawl_js).lower() in ['true', 't', 'yes', 'y', '1']:
self.crawl_js = True
def parse(self, response):
# print("Parsing ", response.url)
all_urls = set()
# use xpath selectors to find all the links, this proved to be more effective than using the
# scrapy provided LinkExtractor during testing
selector = scrapy.Selector(response)
# grab all hrefs from the page
# print(selector.xpath('//a/@href').extract())
all_urls.update(selector.xpath('//a/@href').extract())
# also grab all sources, this will yield a bunch of binary files which we will filter out
# below, but it has the useful property that it will also grab all javavscript files links
# as well, we need to scrape these for urls to uncover js code that yields up urls when
# executed! An alternative here would be to drive the scraper via selenium to execute the js
# as we go, but this seems slightly simpler
all_urls.update(selector.xpath('//@src').extract())
# custom regex that works on javascript files to extract relativel urls hidden in quotes.
# This is a workaround for sites that need js executed in order to follow links -- aka
# single-page angularJS type designs that have clickable menu items that are not rendered
# into <a> elements but rather as clickable span elements - e.g. jana.com
all_urls.update(selector.re('"(\/[-\w\d\/\._#?]+?)"'))
for found_address in selector.re('[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,6}'):
item = EmailAddressItem()
item['email_address'] = found_address
yield item
for url in all_urls:
# ignore commonly ignored binary extensions - might want to put PDFs back in list and
# parse with a pdf->txt extraction library to strip emails from whitepapers, resumes,
# etc.
extension = get_extension_ignore_url_params(url)
if extension in IGNORED_EXTENSIONS:
continue
# convert all relative paths to absolute paths
if 'http' not in url:
url = urljoin(response.url, url)
if extension.lower() != 'js' or self.crawl_js is True:
yield scrapy.Request(url, callback=self.parse)
有没有好心的专家可以告诉我如何做到这一点?
解决方案
您可以简单地访问管道中的蜘蛛参数,如下所示
spider.domain
:self.domain = domain
您必须通过添加蜘蛛的 __init__来使参数作为属性可用。
推荐阅读
- php - 试图获取非对象错误的属性,而我应该取回一个对象?
- ios - SwiftUI:NavigationLink 目标视图的异构集合?
- c - 为什么我的 .exe 文件在执行后会崩溃
- python - Python Pandas 数据框移位在应用函数中不起作用
- excel - 需要识别某些单元格,然后将整行移动到另一个工作表
- mysql - 我想要最好的模式来存储反应
- java - 使用媒体播放器播放音频文件时出现问题
- javascript - 服务器在某些路由上抛出 500 错误
- clojurescript - ClojureScript Figwheel - REPL 出错后无法恢复提示
- haskell - 如何解决“无法将类型 'b' 与 ConcreteType 匹配”?