首页 > 技术文章 > scrapy--redis(分布式爬虫)

eilinge 2018-09-03 16:17 原文

分布式爬虫:scrapy本身并不是一个为分布式爬取而设计的框架,但第三方库scrapy-redis为其扩展了分布式爬取的功能,两者结合便是一个分布式Scrapy爬虫框架。在分布式爬虫框架中,需要使用某种通信机制协调各个爬虫的工作,让每一个爬虫明确自己的任务:

  1.当前的爬取任务,即下载+提取数据(分配任务)
  2.当前爬取任务是否已经被其他爬虫执行过(任务去重)
  3.如何存储爬取到的数据(数据存储)

前期准备:Redis的安装与基本知识(http://www.runoob.com/redis/redis-keys.html)

老规矩,先上爬取效果图,大家也赶快行动起来!QAQ

开始爬取:

1.首先看看分布式爬虫的整体文件架构

Books
  Books
    spiders
      __init__.py
      books.py
    __init__.py
    items.py
    middlewares.py
    pipelines.py
    settings.py
  scrapy_redis  (第三方库下载地址--https://github.com/rmax/scrapy-redis)
    __init__.py
    connection.py
    defaults.py
    dupefilter.py
    picklecompat.py
    pipelines.py
    queque.py
    scheduler.py
    spiders.py
    utils.py
  scrapy.cfg

2.看起来比较复杂,其实和之前爬虫没太大变化,不需要动scrapy_redis文件下脚本,只需要调用就好
books.py

# -*- coding: utf-8 -*-
import scrapy
import pdb
from scrapy.linkextractors import LinkExtractor
from Books.items import BooksItem
from scrapy_redis.spiders import RedisSpider

#class BooksSpider(scrapy.Spider):
class BooksSpider(RedisSpider):  #(调用分布式爬虫最重要的,继承RedisSpider的类)
    name = 'books'
    #allowed_domains = ['books.toscrape.com']
    #start_urls = ['http://books.toscrape.com/']  (这里起始地址需要备注掉,运行爬虫的时候,在redis-cli之后,启动)

    def parse(self, response):
        sels = response.css('article.product_pod')
        book = BooksItem()
        for sel in sels:
            book["name"] = sel.css('h3 a::attr(title)').extract()[0]
            book["price"] = sel.css('div.product_price p::text').extract()[0]
            yield book

        links = LinkExtractor(restrict_css='ul.pager li.next').extract_links(response)
        yield scrapy.Request(links[0].url,callback=self.parse)

3.Pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import scrapy
import pdb
from scrapy.exceptions import DropItem
from scrapy.item import Item
import pymongo
import redis

class BooksPipeline(object):
    def process_item(self, item, spider):
        return item

class PriceConverterPipeline(object):    #提取的price进行价额转换
    exchange_rate = 8.5309
    def process_item(self,item,spider):
        price = float(item['price'][1:])*self.exchange_rate
        item['price']= '$%.2f'%price
        return item
class DuplicatesPipeline(object):    #去重进行过滤
    def __init__(self):
        self.set= set()
    def process_item(self,item,spider):
        name = item["name"]
        if name in self.set:
            raise DropItem("Duplicate book found:%s"%item)

        self.set.add(name)
        return item
class MongoDBPipeline(object):    #存储到mongodb中
    @classmethod
    def from_crawler(cls,crawler):
        cls.DB_URL = crawler.settings.get("MONGO_DB_URL",'mongodb://localhost:27017/')
        cls.DB_NAME = crawler.settings.get("MONGO_DB_NAME",'scrapy_data')
        return cls()
    def open_spider(self,spider):
        pdb.set_trace()
        self.client = pymongo.MongoClient(self.DB_URL)
        self.db     = self.client[self.DB_NAME]
    def close_spider(self,spider):
        self.client.close()

    def process_item(self,item,spider):
        collection = self.db[spider.name]
        post = dict(item) if isinstance(item,Item) else item
        collection.insert_one(post)

        return item
class RedisPipeline:    #下载到redis数据库中
    def open_spider(self,spider):
        db_host = spider.settings.get("REDIS_HOST",'10.240.176.134')
        #db_host = spider.settings.get("REDIS_HOST",'localhost')
        db_port = spider.settings.get("REDIS_PORT",6379)
        db_index= spider.settings.get("REDIS_DB_INDEX",0)
        #db_passwd = spider.settings.get('REDIS_PASSWD','redisredis')

        #self.db_conn = redis.StrictRedis(host=db_host,port=db_port,db=db_index,password=db_passwd)
        self.db_conn = redis.StrictRedis(host=db_host,port=db_port,db=db_index)
        self.item_i = 0

    def close_spider(self,spider):
        self.db_conn.connection_pool.disconnect()

    def process_item(self,item,spider):
        self.insert_db(item)
        return item

    def insert_db(self,item):
        if isinstance(item,Item):
            item = dict(item)

        self.item_i += 1
        self.db_conn.hmset('books12:%s'%self.item_i,item)

4.1settings.py

(1)添加代理:Middlewares.py文件中

class BooksSpiderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    def __init__(self,ip=''):
        self.ip = ip
    def process_request(self,request,spider):
        print('http://10.240.252.16:911')
        request.meta['proxy']= 'http://10.240.252.16:911'

(2)settings.py

DOWNLOADER_MIDDLEWARES = {
    #'Books.middlewares.BooksDownloaderMiddleware': 543,
    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware':543,
    'Books.middlewares.BooksSpiderMiddleware':125,
}
ITEM_PIPELINES = {
    #'Books.pipelines.BooksPipeline': 300,
    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware':1,
    'Books.pipelines.PriceConverterPipeline': 300,
    'Books.pipelines.DuplicatesPipeline':350,
    #'Books.pipelines.MongoDBPipeline':400,
    'Books.pipelines.RedisPipeline':404,
}

4.2基础设置

ROBOTSTXT_OBEY = False
CONCURRENT_REQUESTS = 32
DOWNLOAD_DELAY = 3
COOKIES_ENABLED = False

4.3下载到mongodb数据库中

MONGO_DB_URL = 'mongodb://localhost:27017/'
MONGO_DB_NAME = 'eilinge'

FEED_EXPORT_FIELDS = ['name','price']#设置导出文件格式顺序

4.4实现redis和存储

REDIS_HOST = '10.240.176.134'
#REDIS_HOST = 'localhost'
REDIS_PORT = 6379
REDIS_DB_INDEX = 0
#REDIS_PASSWD = 'redisredis'
REDIS_URL = 'redis://10.240.176.134:6379'    #指定爬虫所使用的Redis数据库

SCHEDULER = 'scrapy_redis.scheduler.Scheduler'     #使用scrapy_redis的调度器替代Scrapy原版调度器(FreeBSD系统中运行会报错,需要绑定core,然而freebsd中core路径不同)

DUPEFILER = 'scrapy_redis.dupefilter.RFPDupeFilter'     #使用scrapy_redis的RFPDupeFilter作为去重过滤器

SCHEDULER_PERSIST = True    #爬虫停止后,保留/清理Redis中请求队列以及去重集合

需要注意的点:

    1.假使你有3台服务器可以同时运行爬取,使用scp远程传输Books文件,进行拷贝
    2.分别在3台主机使用相同命令运行爬虫:scrapy crawl books
    3.2018-09-03 12:30:47 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
    2018-09-03 12:31:47 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)   ...停止在此
    运行后,由于Redis中的起始爬虫点列表和请求队列都是空的,3个爬虫都进入了暂停等待的状态,因此在任意主机上使用Redis客户端设置起始爬取点
    redis-cli -h 10.240.176.134
    10.240.176.134:6379>lpush books:start_urls "http://books.toscrape.com"

补充知识:

Redis数据库的配置文件redis.conf

  #bind 127.0.0.1

  bind 0.0.0.0 #接收来自任意IP的请求

  #acquirepass redisredis #远程连接需要密码验证

 不同系统下运行redis服务

1.ubuntu:sudo service redis-server restart

2.linux(fedora):service redis restart

3.Freebsd:service redis onerestart

 

推荐阅读