python - 网站被抓取但未被抓取 Scrapy
问题描述
我一直在抓取这个网站并尝试存储属性,虽然有些属性确实被抓取,但有些只是被抓取而不是被抓取:
class CapeWaterfrontSpider(scrapy.Spider):
name = "cape_waterfront"
start_urls = ['https://www.capewaterfrontestates.co.za/template/Properties.vm/listingtype/SALES']
def parse(self, response):
for prop in response.css('div.col-sm-6.col-md-12.grid-sizer.grid-item'):
link = prop.css('div.property-image a::attr(href)').get()
bedrooms = prop.css('div.property-details li.bedrooms::text').getall()
bathrooms = prop.css('div.property-details li.bathrooms::text').getall()
gar = prop.css('div.property-details li.garages::text').getall()
if len(bedrooms) == 0:
bedrooms.append(None)
else:
bedrooms = bedrooms[1].split()
if len(bathrooms) == 0:
bathrooms.append(None)
else:
bathrooms = bathrooms[1].split()
if len(gar) == 0:
gar.append(None)
else:
gar = gar[1].split()
yield scrapy.Request(
link,
meta={'item': {
'agency': self.name,
'url': link,
'title': ' '.join(prop.css('div.property-details p.intro::text').get().split()),
'price': ''.join(prop.css('div.property-details p.price::text').get().split()),
'bedrooms': str(bedrooms),
'bathroom': str(bathrooms),
'garages': str(gar)
}},
callback=self.get_loc,
)
next_page = response.css('p.form-control-static.pagination-link a::attr(href)').get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
任何建议如何使这项工作?非常感谢您提前
解决方案
您定义选择器的方式很容易出错。此外,很少有故障的选择器根本不工作。下一页的链接也不起作用。它只进入第 1 页,然后退出。最后,我不知道next_sibling
in css 选择器的任何用法,所以我不得不以某种尴尬的方式挖掘出下一个兄弟姐妹的东西。
class CapeWaterfrontSpider(scrapy.Spider):
name = "cape_waterfront"
start_urls = ['https://www.capewaterfrontestates.co.za/template/Properties.vm/listingtype/SALES']
def parse(self, response):
for prop in response.css('.grid-item'):
link = prop.css('.property-image a::attr(href)').get()
bedrooms = [elem.strip() for elem in prop.css(".bedrooms::text").getall()]
bedrooms = bedrooms[-2] if len(bedrooms)>=1 else None
bathrooms = [elem.strip() for elem in prop.css(".bathrooms::text").getall()]
bathrooms = bathrooms[-2] if len(bathrooms)>=1 else None
gar = [elem.strip() for elem in prop.css(".garages::text").getall()]
gar = gar[-2] if len(gar)>=1 else None
yield scrapy.Request(
link,
meta={'item': {
'agency': self.name,
'url': link,
'bedrooms': bedrooms,
'bathroom': bathrooms,
'garages': gar
}},
callback=self.get_loc,
)
next_page = response.css('.pagination-link a.next::attr(href)').get()
if next_page:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
def get_loc(self,response):
items = response.meta['item']
print(items)
如果您想采用更清洁的方法来获取这三个项目,我认为xpath
这就是您要坚持的:
for prop in response.css('.grid-item'):
link = prop.css('.property-image a::attr(href)').get()
bedrooms = prop.xpath("normalize-space(.//*[contains(@class,'bedrooms')]/label/following::text())").get()
bathrooms = prop.xpath("normalize-space(.//*[contains(@class,'bathrooms')]/label/following::text())").get()
gar = prop.xpath("normalize-space(.//*[contains(@class,'garages')]/label/following::text())").get()
为简洁起见,我排除了两三个字段,我想您可以管理它们。
推荐阅读
- php - ETSY Api PHP 获取令牌凭证
- python - 使用 Python 按数组名称在 MongoDB 中查找文档?
- jquery - 我正在尝试使用数据库中的 ajax 在 jsp 中创建一个自动填充的文本框
- python - 为什么在 python 中可以在循环期间从列表中删除元素?
- jquery - 减少或简化 jquery 脚本
- spring-cloud - Spring Cloud kafka流处理器API kafka记录键为空,而记录值正确
- angular - 如何只关闭一个特定的对话框
- flutter - onPressed 动作在 Flutter 中的自定义前导图像 Appbar 上
- java - 将图像文件从 AWS S3 对象转换为 Base64 时出错
- c# - Appium 1.18.0 查找元素的时间太长