python-3.x - Scrapy Hit Login Spider 在当前 Spider Crawl 之前
问题描述
我有三个蜘蛛如下
Class LogInSpider(scrapy):
name = 'DomainLogin'
allowed_domains = ['domain.io']
start_urls = ['https://www.domain.io/signin']
def parse(self, response):
return FormRequest.from_response(response,formdata={
'email':email,
'password':password,
})
Class SelectProduct(scrapy):
# Crawl and select products
Class AddProductToCart(scrapy):
# Form request to cart
Here when I run spider "SelectProduct", I want to first hit "LogInSpider" and get its follow request to "SelectProduct" parse() method and at last hit "AddProductToCart" spider.
I tried using CrawlerRunner() as well but the issue I am facing with these is, scrapy request object gets changed(not same derived from login) when it comes to "SelectProduct", check below code out.
@defer.inlineCallbacks
def crawl():
yield runner.crawl(LogInSpider)
yield runner.crawl(SelectProduct)
yield runner.crawl(AddProductToCart)
reactor.stop()
configure_logging()
runner = CrawlerRunner(settings = get_project_settings())
crawl()
reactor.run()
任何有关工作流程更改的建议都将被接受。
注意:以上三个蜘蛛需要在一个单独的类中,以保持代码集中。
解决方案
我通过将登录蜘蛛的 prase 回调设置为登录蜘蛛登录到站点后我必须调用的任何蜘蛛来实现上述逻辑。我确实确保每个连接蜘蛛都有“解析”(相同的请求回调函数)方法名称。
我还使用“argparse”从 CLI 获取下一个需要在登录后运行的蜘蛛的参数。
加载蜘蛛.py:
import argparse
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
from LogInSpider import LogInSpider
# Initiate the parser
parser = argparse.ArgumentParser()
parser.add_argument("-fc", help="Follow Spider Class")
# Read arguments from the command line
args = parser.parse_args()
configure_logging()
runner = CrawlerRunner(get_project_settings())
runner.crawl(LogInSpider, parseClass=args.fc)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run()
登录蜘蛛.py:
from SelectProduct import SelectProduct
Class LogInSpider(scrapy):
name = 'DomainLogin'
allowed_domains = ['domain.io']
start_urls = ['https://www.domain.io/signin']
def __init__(self, parseClass=None, *args, **kwargs):
super(LogInSpider, self).__init__(*args, **kwargs)
self.parseClass = parseClass
def parse(self, response):
return FormRequest.from_response(response,formdata={
'email':email,
'password':password,
}, callback=eval(self.parseClass).parse)
Class SelectProduct(scrapy):
# Crawl and select products
命令:
python3 LoadSpider.py fc="SelectProduct"
推荐阅读
- javascript - chart.js typeError t is undefined and Uncaught TypeError: Cannot read property 'fontSize' of undefined
- python - Selenium 访问 div 标签中的文本
- javascript - Angular 7 - 错误错误:未捕获(承诺):TypeError:无法读取未定义的属性'forEach'
- gradle - 如何使用gradle访问artifactory中trunk文件夹中的jar
- node.js - EventEmitter 和 CustomEvent 的问题
- c# - Asp .net core WebApi 将外键 ID 作为值返回
- mysql - 在mysql中加入两个选择查询
- android - 在android中将数据从一个应用程序发送到另一个应用程序的加密
- r - 在线图 r 中绘制分组变量
- excel - 使用 VBA 将单词与数字分开