python - Scrapy Crawler 不跟随链接
问题描述
我在试图弄清楚为什么我的辅助功能无法遵循新链接然后输出数据时遇到了一些麻烦。该parse
功能工作得很好。当它回调时parse_puppy
,什么都没有发生。当我检查 json 输出时,我发现 from 的所有内容puppy
都已成功抓取,但parse_puppy
.
在第 28 行,如果我将方法更改为follow
I get results,但大约十几次都是相同的结果。
代码:
import scrapy
from scrapy.cmdline import execute
class Spider(scrapy.Spider):
name = "puppyDetails"
def start_requests(self):
urls = ['https://ws.petango.com/webservices/adoptablesearch/wsAdoptableAnimals.aspx?species=Dog&gender=A&agegroup=UnderYear&location=&site=&onhold=A&orderby=name&colnum=3&css=http://ws.petango.com/WebServices/adoptablesearch/css/styles.css&authkey=io53xfw8b0k2ocet3yb83666507n2168taf513lkxrqe681kf8&recAmount=&detailsInPopup=No&featuredPet=Include&stageID=&wmode=opaque']
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
# GRAB ALL TOPICAL PUPPY DETAILS
for animal in response.css("div.list-animal-info-block"):
yield {
'puppy_name': animal.css('div.list-animal-name a::text').get(),
'puppy_id': animal.css('div.list-animal-id::text').get(),
'puppy_sex': animal.css('div.list-animal-sexSN::text').get(),
'puppy_breed': animal.css('div.list-animal-breed::text').get(),
'puppy_age': animal.css('div.list-animal-age::text').get(),
'puppy_link': animal.css('div.list-animal-name a::attr(href)').get()
}
# DIVE INTO DETAILS PAGE
detail_page = response.css('div.list-animal-name a::attr(href)').get()
self.logger.info('get puppy details')
# GO TO THE PUPPY DETAILS
yield response.follow_all(detail_page, callback=self.parse_puppy)
def parse_puppy(self, response):
# GRAB PUPPY DETAILS
for puppyDetails in response.xpath('//*[@class="detail-table"]//tr'):
yield {
'puppy_id': puppyDetails.xpath('//*[@id="lblID"]/text()').extract(),
'puppy_status': puppyDetails.xpath('//*[@id="lblStage"]/text()').extract(),
'puppy_intake_date': puppyDetails.xpath('//*[@id="lblIntakeDate"]/text()').extract()
}
execute(['scrapy','crawl','puppyDetails'])
错误:
ERROR: Spider must return Request, BaseItem, dict or None, got 'generator' in <GET https://ws.petango.com/webservices/adoptablesearch/wsAdoptableAnimals.aspx?species=Dog&gender=A&agegroup=UnderYear&location=&site=&onhold=A&orderby=name&colnum=3&css=http://ws.petango.com/WebServices/adoptablesearch/css/styles.css&authkey=io53xfw8b0k2ocet3yb83666507n2168taf513lkxrqe681kf8&recAmount=&detailsInPopup=No&featuredPet=Include&stageID=&wmode=opaque>
解决方案
该行应该是
yield from response.follow_all(detail_page, callback=self.parse_puppy)
推荐阅读
- flutter - Flutter & AlertDialog :如何将其与底部对齐?我如何制作像这张照片一样的 2 个警报对话框?
- python - Python Selenium 随机 xpath 点击
- javascript - 如何使用 Tampermonkey 检查 webelement 状态变化之间的时间差
- python - 如何识别图像的梯度行为
- git - 验证用户提供的 Git 凭证
- javascript - 需要回文翻转卡应用程序的帮助!它没有按计划执行功能
- firebase - 颤振:获取云火库中项目的长度
- android - dagger hilt android中的ActivityRetainedComponent @ActivityRetainedScope和ActivityComponent @ActivityScoped有什么区别
- python - 使用递归 for 循环定义序列
- python - 针对 Python Keras/Tensorflow CNN 测试随机图像