python - 刮刀没有在循环中产生结果?
问题描述
我一直在尝试从公司注册中递归地抓取一些搜索结果。它大部分都有效,但我注意到我在导出时错过了很多搜索结果。当我尝试只抓取 1 页时,我注意到它确实设法找到了搜索结果页面,但不知何故试图重新进入它已经在的页面?它只对少数人这样做..第一个结果很好并且产生了。我检查了我的 css 路径,这些路径很好。你明白为什么吗?非常感谢您提前。
这是我的错误日志:
> 2019-05-13 08:25:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET
> https://www.companiesintheuk.co.uk/ltd/aw> (referer:
> https://www.companiesintheuk.co.uk/Company/Find?q=a) 2019-05-13
> 08:25:38 [scrapy.core.scraper] ERROR: Spider error processing <GET
> https://www.companiesintheuk.co.uk/ltd/aw> (referer:
> https://www.companiesintheuk.co.uk/Company/Find?q=a) Traceback (most
> recent call last): File
> "/usr/local/lib/python2.7/dist-packages/scrapy/utils/defer.py", line
> 102, in iter_errback
> yield next(it) File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/offsite.py",
> line 29, in process_spider_output
> for x in result: File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/referer.py",
> line 339, in <genexpr>
> return (_set_referer(r) for r in result or ()) File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/urllength.py",
> line 37, in <genexpr>
> return (r for r in result or () if _filter(r)) File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/depth.py",
> line 58, in <genexpr>
> return (r for r in result or () if _filter(r)) File "/root/Desktop/zakaria/gov2/gov2/spiders/CYRecursive.py", line 41, in
> parse_details
> 'postal_code': re.sub('\s+', ' ', ''.join(i.css("#content2 > strong:nth-child(2) > address:nth-child(2) > div:nth-child(1) >
> a:nth-child(5) > span:nth-child(1)::text").extract_first())),
> TypeError: can only join an iterable
这是我的代码:
import scrapy
import re
from scrapy.linkextractors import LinkExtractor
class QuotesSpider(scrapy.Spider):
name = 'CYRecursive'
start_urls = [
'https://www.companiesintheuk.co.uk/Company/Find?q=a']
def parse(self, response):
for company_url in response.xpath('//div[@class="search_result_title"]/a/@href').extract():
yield scrapy.Request(
url=response.urljoin(company_url),
callback=self.parse_details,
)
# next_page_url = response.xpath(
# '//li/a[@class="pageNavNextLabel"]/@href').extract_first()
# if next_page_url:
# yield scrapy.Request(
# url=response.urljoin(next_page_url),
# callback=self.parse,
# )
def parse_details(self, response):
# Looping throught the searchResult block and yielding it
for i in response.css('div.col-md-6'):
yield {
'company_name': re.sub('\s+', ' ', ''.join(i.css('#content2 > strong:nth-child(2) > strong:nth-child(1) > div:nth-child(1)::text').get())),
'address': re.sub('\s+', ' ', ''.join(i.css("#content2 > strong:nth-child(2) > address:nth-child(2) > div:nth-child(1) > span:nth-child(1)::text").extract_first())),
'location': re.sub('\s+', ' ', ''.join(i.css("#content2 > strong:nth-child(2) > address:nth-child(2) > div:nth-child(1) > span:nth-child(3)::text").extract_first())),
'postal_code': re.sub('\s+', ' ', ''.join(i.css("#content2 > strong:nth-child(2) > address:nth-child(2) > div:nth-child(1) > a:nth-child(5) > span:nth-child(1)::text").extract_first())),
}
解决方案
推荐阅读
- javascript - 在制表符中显示最后一组
- python - 使用带有 IntervalIndex 的 pandas.cut 后如何重命名类别?
- firebase - 如何编写来自 Firebase 的数据以快速存储?
- firebase - 如何在 firebase 数据库中使用 kotlin 协程
- java - Firebase onDataChange - 空对象引用
- xcode - Xcode10 验证:我的图像中没有透明胶片但仍然不接受?
- python-2.7 - 如何使用python比较csv文件中的列?
- docker - Docker 容器中的 ASP.NET Core 构建问题
- python - 在 Django 中播放上传的音频文件时出错
- swift - TabBar 和 NavigationBar 的问题