首页 > 技术文章 > 基于CrawlSpider的全站数据爬取

Hedger-Lee 2020-06-10 17:11 原文

基于CrawlSpider的全站数据爬取

CrawlSpider就是爬虫类中Spider的一个子类

使用流程

1.创建一个基于CrawlSpider的一个爬虫文件,命令:

scrapy genspider -t crawl spiderName www.xxxx.com

2.构造链接提取器和规则解析器

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

链接提取器的作用:可以根据指定的规则进行指定链接的提取

提取的规则:allow =‘正则表达式’

LinkExtractor(allow=r'type=4&page=\d+')

规则解析器作用:获取链接提取器提取到的链接,然后对其进行请求发送,根据指

定规则对请求到的页面源码数据进行数据解析

follow=True:将链接提取器继续作用到连接提取器提取出的页码链接所对应的页面中,从而可以得到所有页面的url,进而进行全站数据爬取

rules = (
    Rule(LinkExtractor(allow=r'type=4&page=\d+'), callback='parse_item', follow=True),
)

注意点:连接提取器和规则解析器也是一对一的关系

示例代码

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from sunPro.items import SunproItem_second,SunproItem

只设置一个链接提取器和一个规则解析器

#没有实现深度爬取:爬取的只是每一个页码对应页面中的数据
class SunSpider(CrawlSpider):
    name = 'sun'
    # allowed_domains = ['www,xxx,com']
    start_urls = ['http://wz.sun0769.com/index.php/question/questionType?type=4&page=']
    #链接提取器
    link = LinkExtractor(allow=r'type=4&page=\d+')
    rules = (
        #实例化一个Rule(规则解析器)的对象
        Rule(link, callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        tr_list = response.xpath('//*[@id="morelist"]/div/table[2]//tr/td/table//tr')
        for tr in tr_list:
            title = tr.xpath('./td[2]/a[2]/@title').extract_first()
            status = tr.xpath('./td[3]/span/text()').extract_first()

            print(title,status)

各设置多个,实现深度爬取

#实现深度爬取
class SunSpider(CrawlSpider):
    name = 'sun'
    # allowed_domains = ['www,xxx,com']
    start_urls = ['http://wz.sun0769.com/index.php/question/questionType?type=4&page=']
    #链接提取器
    link = LinkExtractor(allow=r'type=4&page=\d+')
    #http://wz.sun0769.com/html/question/201908/426393.shtml
    link_detail = LinkExtractor(allow=r'question/\d+/\d+\.shtml')
    rules = (
        #实例化一个Rule(规则解析器)的对象
        Rule(link, callback='parse_item', follow=True),
        Rule(link_detail, callback='parse_detail'),
    )

    def parse_item(self, response):
        tr_list = response.xpath('//*[@id="morelist"]/div/table[2]//tr/td/table//tr')
        for tr in tr_list:
            title = tr.xpath('./td[2]/a[2]/@title').extract_first()
            status = tr.xpath('./td[3]/span/text()').extract_first()
            num = tr.xpath('./td[1]/text()').extract_first()
            item = SunproItem_second()
            item['title'] = title
            item['status'] = status
            item['num'] = num
            yield item

    def parse_detail(self,response):
        content = response.xpath('/html/body/div[9]/table[2]/tbody/tr[1]//text()').extract()
        content = ''.join(content)
        num = response.xpath('/html/body/div[9]/table[1]/tbody/tr/td[2]/span[2]/text()').extract_first()
        if num:
            num = num.split(':')[-1]
            item = SunproItem()
            item['content'] = content
            item['num'] = num
            yield item

推荐阅读