python - 简单的 Scrapy 爬虫不会使用项目管道
问题描述
背景
我正在尝试通过示例学习 Scrapy。
在这一点上,我已经制作了一个 CrawlSpider 能够导航到页面,跟踪所有链接,使用 CSS 选择器提取数据并使用项目加载器填充项目。
我现在正在尝试添加一个任意管道,以便我可以让它工作。
我的问题是 CrawlSpider 的项目管道似乎需要比使用 scrapy.Spider 的项目管道更多的定义 - 我找不到 CrawlSpider 管道的工作示例。
我的代码实际上做了什么
从雷克萨斯的维基百科页面开始,然后跟随所有其他链接自该页面的维基百科页面。然后它从第一个表中提取每个页面的标题和标题。这些存储在项目中,然后打印到 .txt 文档中。
lexuswikicrawl.py
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.loader import ItemLoader
from wikicars.items import WikiItem
#Global variables
pages = {}
failed_pages = 1
filename = 'wiki.txt'
class GenSpiderCrawl(CrawlSpider):
name = 'lexuswikicrawl'
#Start at the lexus wikipedia page, and only follow the wikipedia links
allowed_domains = ['wikipedia.org']
start_urls = ['https://en.wikipedia.org/wiki/Lexus']
#There are no specific rules specified below, therefore all links will be followed
rules = (Rule(LinkExtractor(), callback = 'parse_page'),)
#Define what selectors to crawl
def parse_page(self, response):
global pages
global failed_pages
#Try and capture the page title using CSS selector
#If not, keep count of the amount of failed selectors
try:
pagename = (response.css('div#content.mw-body > h1#firstHeading::text').extract())[0]
except:
pagename = ('Failed pagename: '+ str(failed_pages))
failed_pages += 1
# Capture table categories that fall under the CSS selector for regular text items
tabcat = response.css('#mw-content-text > div > table.infobox.vcard > tbody > tr > th::text').extract()
# Capture tale categories that fall under CSS selector for text with hyperlinks
for i in range(20):
tabcatlink1 = response.css('#mw-content-text > div > table.infobox.vcard > tbody > tr:nth-child('+str(i)+') > th > a::text').extract()
tabcatlink2 = response.css('#mw-content-text > div > table.infobox.vcard > tbody > tr:nth-child('+str(i)+') > th > div > a::text').extract()
if len(tabcatlink1) > 0:
tabcat.append(tabcatlink1)
else:
pass
if len(tabcatlink2) > 0:
tabcat.append(tabcatlink2)
else:
continue
#Load 'pagename' and 'categories' into a new item
page_loader = ItemLoader(item=WikiItem(), selector = tabcat)
page_loader.add_value('title', pagename)
page_loader.add_value('categories', tabcat)
#Store the items in an overarching dictionary structure
pages[pagename] = page_loader.load_item()
#Try and print the results to a text document
try:
with open(filename, 'a+') as f:
f.write('Page Name:' + str(pages[pagename]['title'])+ '\n')
except:
with open(filename, 'a+') as f:
f.write('Page name error'+ '\n')
try:
with open(filename, 'a+') as f:
f.write('Categories:' + str(pages[pagename]['categories'])+ '\n')
except:
with open(filename, 'a+') as f:
f.write('Table Category data not available' + '\n')
项目.py
import scrapy
from scrapy.loader.processors import TakeFirst, MapCompose
def convert_categories(categories):
categories = (str(categories).upper().strip('[]'))
return categories
def convert_title(title):
title = title.upper()
return title
class WikiItem(scrapy.Item):
categories = scrapy.Field(
input_processor = MapCompose(convert_categories)
)
title = scrapy.Field(
input_processor = MapCompose(convert_title)
管道.py
这是我怀疑造成麻烦的地方。我目前的想法是,我需要的不仅仅是 process_item() 来让我的管道运行。我已尽我所能重新排列以下示例:https ://docs.scrapy.org/en/latest/topics/item-pipeline.html 。
from scrapy.exceptions import DropItem
class PipelineCheck(object):
def process_item(self, item, spider):
print('I am a pipeline this is an item:' + str(item) + '\n')
设置.py
我已经宣布了我的管道及其优先级。我还声明了一个通用用户代理。我需要设置其他变量吗?
BOT_NAME = 'wikicars'
SPIDER_MODULES = ['wikicars.spiders']
NEWSPIDER_MODULE = 'wikicars.spiders'
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
ROBOTSTXT_OBEY = True
ITEM_PIPELINES = {
'wikicars.pipelines.PipelineCheck': 100,
}
解决方案
推荐阅读
- javascript - Onclick 函数在加载事件侦听器中时未调用
- sharepoint - 如何开始从 SharePoint 外部读取 SharePoint 列表信息?
- python - 将两个单独的数据框之间的对应列组合成新的数据框
- ios - 视图转换后如何修复不均匀的边框宽度?
- angularjs - 如何为自定义 B2C 邀请策略注册 MSAL JS 回调?
- flutter - WidgetsApp 类、MaterialApp 类和 Directionality 类有什么区别
- python - 当复杂索引和基于布尔的条件作为子集时,如何为熊猫数据框赋值?
- yii - Yii 2 高级模板默认后台管理员登录详情
- javascript - 如果在使用 javascript 中的 map 方法通过该数组执行循环期间将元素推入数组会发生什么
- javascript - 如何按 id 更新索引对象数组