python - 如何抓取每个搜索项的结果并返回?
问题描述
我一直在尝试从公司注册簿中抓取一些信息。哪个有效,但我希望对搜索条目给出的每个结果都重复一遍。我一直在尝试使用链接提取器,但我没有让它工作。
搜索结果网页为: https ://www.companiesintheuk.co.uk/Company/Find?q=a
从搜索项中抓取单个结果是可行的(如果我单击一个结果项),但是我如何为每个结果项重复此操作?
这是我的代码:
import scrapy
import re
from scrapy.linkextractors import LinkExtractor
class QuotesSpider(scrapy.Spider):
name = 'CYRecursive'
start_urls = [
'https://www.companiesintheuk.co.uk/ltd/a-2']
def parse(self, response):
# Looping throught the searchResult block and yielding it
for i in response.css('div.col-md-9'):
for i in response.css('div.col-md-6'):
yield {
'company_name': re.sub('\s+', ' ', ''.join(i.css('#content2 > strong:nth-child(2) > strong:nth-child(1) > div:nth-child(1)::text').get())),
'address': re.sub('\s+', ' ', ''.join(i.css("#content2 > strong:nth-child(2) > address:nth-child(2) > div:nth-child(1) > span:nth-child(1)::text").extract_first())),
'location': re.sub('\s+', ' ', ''.join(i.css("#content2 > strong:nth-child(2) > address:nth-child(2) > div:nth-child(1) > span:nth-child(3)::text").extract_first())),
'postal_code': re.sub('\s+', ' ', ''.join(i.css("#content2 > strong:nth-child(2) > address:nth-child(2) > div:nth-child(1) > a:nth-child(5) > span:nth-child(1)::text").extract_first())),
}
解决方案
import scrapy
import re
from scrapy.linkextractors import LinkExtractor
class QuotesSpider(scrapy.Spider):
name = 'CYRecursive'
start_urls = [
'https://www.companiesintheuk.co.uk/Company/Find?q=a']
def parse(self, response):
for company_url in response.xpath('//div[@class="search_result_title"]/a/@href').extract():
yield scrapy.Request(
url=response.urljoin(company_url),
callback=self.parse_details,
)
next_page_url = response.xpath('//li/a[@class="pageNavNextLabel"]/@href').extract_first()
if next_page_url:
yield scrapy.Request(
url=response.urljoin(next_page_url),
callback=self.parse,
)
def parse_details(self, response):
# Looping throught the searchResult block and yielding it
for i in response.css('div.col-md-9'):
for i in response.css('div.col-md-6'):
yield {
'company_name': re.sub('\s+', ' ', ''.join(i.css('#content2 > strong:nth-child(2) > strong:nth-child(1) > div:nth-child(1)::text').get())),
'address': re.sub('\s+', ' ', ''.join(i.css("#content2 > strong:nth-child(2) > address:nth-child(2) > div:nth-child(1) > span:nth-child(1)::text").extract_first())),
'location': re.sub('\s+', ' ', ''.join(i.css("#content2 > strong:nth-child(2) > address:nth-child(2) > div:nth-child(1) > span:nth-child(3)::text").extract_first())),
'postal_code': re.sub('\s+', ' ', ''.join(i.css("#content2 > strong:nth-child(2) > address:nth-child(2) > div:nth-child(1) > a:nth-child(5) > span:nth-child(1)::text").extract_first())),
}
当然,您可以使用start_requests
自动从到的yield
所有搜索。a
z
你的 CSS 表达式是错误的:
yield {
'company_name': response.xpath('//div[@itemprop="name"]/text()').extract_first(),
'address': response.xpath('//span[@itemprop="streetAddress"]/text()').extract_first(),
'location': response.xpath('//span[@itemprop="addressLocality"]/text()').extract_first(),
'postal_code': response.xpath('//span[@itemprop="postalCode"]/text()').extract_first(),
}
推荐阅读
- python - 当我尝试在真实模型中使用 FastAPI 文档时,出现错误
- javascript - 为什么我的 AJAX JQUERY 试图将多个变量值传递给 php 服务器端总是返回错误?
- python - 如何将 Coin Market Cap API 数据从 JSON 转换为 Pandas Dataframe?
- python - CustomUser 类型的对象不是 JSON 可序列化的
- python - SQLAlchemy 一对多关系 - 如何正确获取“多”集合
- react-native - 如何从这个 country-picker-modal repo 中仅提取国家列表
- postgresql - 试图允许远程访问 postgresql 总是超时过期
- python - django-rest-framework 中的通用关系序列化
- javascript - 关于验证确认密码和纯字母文本是否正确
- java - Spring&Hibernate:“字段列表”中的未知列,但列名匹配