首页 > 解决方案 > Scrapy Hit Login Spider 在当前 Spider Crawl 之前

问题描述

我有三个蜘蛛如下

Class LogInSpider(scrapy):
   name = 'DomainLogin'
   allowed_domains = ['domain.io']
   start_urls = ['https://www.domain.io/signin']
   def parse(self, response):
      return FormRequest.from_response(response,formdata={
            'email':email,
            'password':password,
      })

Class SelectProduct(scrapy):
   # Crawl and select products

Class AddProductToCart(scrapy):
   # Form request to cart

Here when I run spider "SelectProduct", I want to first hit "LogInSpider" and get its follow request to "SelectProduct" parse() method and at last hit "AddProductToCart" spider.

I tried using CrawlerRunner() as well but the issue I am facing with these is, scrapy request object gets changed(not same derived from login) when it comes to "SelectProduct", check below code out.

@defer.inlineCallbacks
def crawl():
    yield runner.crawl(LogInSpider)
    yield runner.crawl(SelectProduct)
    yield runner.crawl(AddProductToCart)
    reactor.stop()
configure_logging()
runner = CrawlerRunner(settings = get_project_settings())
crawl()
reactor.run() 

任何有关工作流程更改的建议都将被接受。

注意:以上三个蜘蛛需要在一个单独的类中,以保持代码集中。

标签: python-3.xscrapyweb-crawleroperator-overloading

解决方案


我通过将登录蜘蛛的 prase 回调设置为登录蜘蛛登录到站点后我必须调用的任何蜘蛛来实现上述逻辑。我确实确保每个连接蜘蛛都有“解析”(相同的请求回调函数)方法名称。

我还使用“argparse”从 CLI 获取下一个需要在登录后运行的蜘蛛的参数。

加载蜘蛛.py:

import argparse
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
from LogInSpider import LogInSpider

# Initiate the parser
parser = argparse.ArgumentParser()
parser.add_argument("-fc", help="Follow Spider Class")
# Read arguments from the command line
args = parser.parse_args()
configure_logging()
runner = CrawlerRunner(get_project_settings())
runner.crawl(LogInSpider, parseClass=args.fc)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run()

登录蜘蛛.py:

from SelectProduct import SelectProduct
Class LogInSpider(scrapy):
   name = 'DomainLogin'
   allowed_domains = ['domain.io']
   start_urls = ['https://www.domain.io/signin']

   def __init__(self, parseClass=None, *args, **kwargs):
        super(LogInSpider, self).__init__(*args, **kwargs)
        self.parseClass = parseClass

   def parse(self, response):
      return FormRequest.from_response(response,formdata={
            'email':email,
            'password':password,
      }, callback=eval(self.parseClass).parse)

Class SelectProduct(scrapy):
   # Crawl and select products

命令:

python3 LoadSpider.py fc="SelectProduct"


推荐阅读