首页 > 解决方案 > 简单的 Scrapy 爬虫不会使用项目管道

问题描述

背景

我正在尝试通过示例学习 Scrapy。

在这一点上,我已经制作了一个 CrawlSpider 能够导航到页面,跟踪所有链接,使用 CSS 选择器提取数据并使用项目加载器填充项目。

我现在正在尝试添加一个任意管道,以便我可以让它工作。

我的问题是 CrawlSpider 的项目管道似乎需要比使用 scrapy.Spider 的项目管道更多的定义 - 我找不到 CrawlSpider 管道的工作示例。

我的代码实际上做了什么

从雷克萨斯的维基百科页面开始,然后跟随所有其他链接自该页面的维基百科页面。然后它从第一个表中提取每个页面的标题和标题。这些存储在项目中,然后打印到 .txt 文档中。

lexuswikicrawl.py

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.loader import ItemLoader
from wikicars.items import WikiItem

#Global variables
pages = {}
failed_pages = 1
filename = 'wiki.txt'



class GenSpiderCrawl(CrawlSpider):
    name = 'lexuswikicrawl'

    #Start at the lexus wikipedia page, and only follow the wikipedia links
    allowed_domains = ['wikipedia.org']
    start_urls = ['https://en.wikipedia.org/wiki/Lexus']

    #There are no specific rules specified below, therefore all links will be followed              
    rules = (Rule(LinkExtractor(), callback = 'parse_page'),)

    #Define what selectors to crawl
    def parse_page(self, response): 
        global pages
        global failed_pages

        #Try and capture the page title using CSS selector
        #If not, keep count of the amount of failed selectors
        try:
            pagename = (response.css('div#content.mw-body > h1#firstHeading::text').extract())[0]
        except:
            pagename = ('Failed pagename: '+ str(failed_pages))
            failed_pages += 1


        # Capture table categories that fall under the CSS selector for regular text items
        tabcat = response.css('#mw-content-text > div > table.infobox.vcard > tbody > tr > th::text').extract() 
        # Capture tale categories that fall under CSS selector for text with hyperlinks
        for i in range(20):
            tabcatlink1 = response.css('#mw-content-text > div > table.infobox.vcard > tbody > tr:nth-child('+str(i)+') > th > a::text').extract()
            tabcatlink2 = response.css('#mw-content-text > div > table.infobox.vcard > tbody > tr:nth-child('+str(i)+') > th > div > a::text').extract()
            if len(tabcatlink1) > 0:
                tabcat.append(tabcatlink1)
            else:
                pass
            if len(tabcatlink2) > 0:
                tabcat.append(tabcatlink2)
            else: 
                continue

        #Load 'pagename' and 'categories' into a new item

        page_loader = ItemLoader(item=WikiItem(), selector = tabcat)

        page_loader.add_value('title', pagename)
        page_loader.add_value('categories', tabcat)

        #Store the items in an overarching dictionary structure
        pages[pagename] = page_loader.load_item()

        #Try and print the results to a text document
        try:
            with open(filename, 'a+') as f:
                f.write('Page Name:'   + str(pages[pagename]['title'])+ '\n')
        except: 
            with open(filename, 'a+') as f:
                f.write('Page name error'+ '\n')
        try:
            with open(filename, 'a+') as f:
                f.write('Categories:'  + str(pages[pagename]['categories'])+ '\n')
        except:
            with open(filename, 'a+') as f:
                f.write('Table Category data not available'  + '\n')

项目.py

import scrapy
from scrapy.loader.processors import TakeFirst, MapCompose

def convert_categories(categories):
    categories = (str(categories).upper().strip('[]'))
    return categories

def convert_title(title):
    title = title.upper()
    return title

class WikiItem(scrapy.Item):

    categories = scrapy.Field(
                              input_processor = MapCompose(convert_categories)
                              )

    title      = scrapy.Field(
                              input_processor = MapCompose(convert_title)

管道.py

这是我怀疑造成麻烦的地方。我目前的想法是,我需要的不仅仅是 process_item() 来让我的管道运行。我已尽我所能重新排列以下示例:https ://docs.scrapy.org/en/latest/topics/item-pipeline.html 。

from scrapy.exceptions import DropItem

class PipelineCheck(object):

    def process_item(self, item, spider):
        print('I am a pipeline this is an item:' + str(item) + '\n')

设置.py

我已经宣布了我的管道及其优先级。我还声明了一个通用用户代理。我需要设置其他变量吗?

BOT_NAME = 'wikicars'

SPIDER_MODULES = ['wikicars.spiders']
NEWSPIDER_MODULE = 'wikicars.spiders'

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'

ROBOTSTXT_OBEY = True

ITEM_PIPELINES = {
    'wikicars.pipelines.PipelineCheck': 100,
}

标签: pythonweb-scrapingscrapyweb-crawlerpipeline

解决方案


推荐阅读