python - Scraping 仅获得 Scrapy 在 python 中的第一条记录
问题描述
事实上,我一般是scrapy和python的新手。这是我第一次尝试抓取网站
import scrapy
class HamburgSpider(scrapy.Spider):
name = 'hamburg'
allowed_domains = ['https://www.hamburg.de']
start_urls = ['https://www.hamburg.de/branchenbuch/hamburg/10239785/n0/']
custom_settings = {
'FEED_EXPORT_FORMAT': 'utf-8'
}
def parse(self, response):
items = response.xpath("//div[starts-with(@class, 'item')]")
for item in items:
business_name = item.xpath(".//h3[@class='h3rb']/text()").get()
address1 = item.xpath(".//div[@class='address']/p[@class='extra post']/text()[1]").get()
address2 = item.xpath(".//div[@class='address']/p[@class='extra post']/text()[2]").get()
phone = item.xpath(".//div[@class='address']/span[@class='extra phone']/text()").get()
yield {
'Business Name': business_name,
'Address1': address1,
'Address2': address2,
'Phone Number': phone
}
next_page_url = 'https://www.hamburg.de' + response.xpath("//li[@class='next']/a/@href").get()
if next_page_url:
next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(url=next_page_url, callback=self.parse)
该代码有效,但在我正在抓取的页面中,我有 20 条记录。该代码抓取了 20 条记录,但全部用于第一条记录。代码deosn't get the 20 records
** 至于 for 块中的分页,我放了这个,但也没有用
next_page_url = response.xpath("//li[@class='next']/@href").get()
if next_page_url:
next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(url=next_page_url, callback=self.parse)
这些是调试的结果
{'Business Name': ' A & Z Kfz Meisterbetrieb GmbH ', 'Address1': ' Anckelmannstraße 13', 'Address2': ' 20537 Hamburg (Borgfelde) ', 'Phone Number': '040 / 236 882 10 '}
2020-11-10 19:55:10 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.hamburg.de/branchenbuch/hamburg/10239785/n0/>
{'Business Name': ' A+B Automobile ', 'Address1': ' Kuehnstraße 19', 'Address2': ' 22045 Hamburg (Tonndorf) ', 'Phone Number': '040 / 696 488-0 '}
2020-11-10 19:55:10 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.hamburg.de': <GET https://www.hamburg.de/branchenbuch/hamburg/10239785/n20/>
2020-11-10 19:55:10 [scrapy.core.engine] INFO: Closing spider (finished)
2020-11-10 19:55:10 [scrapy.extensions.feedexport] INFO: Stored json feed (20 items) in: output.json
2020-11-10 19:55:10 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 247,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 50773,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 2.222001,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 11, 10, 17, 55, 10, 908399),
'item_scraped_count': 20,
'log_count/DEBUG': 22,
'log_count/INFO': 11,
'log_count/WARNING': 1,
'offsite/domains': 1,
'offsite/filtered': 1,
'request_depth_max': 1,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2020, 11, 10, 17, 55, 8, 686398)}
2020-11-10 19:55:10 [scrapy.core.engine] INFO: Spider closed (finished)
解决方案
这是问题
item.xpath("//h3[@class='h3rb']/text()").get()
当我们想在scrapy中访问嵌套选择器时,我们必须使用(".//")
而不是("//")
. 尝试如下更改您的代码
business_name = item.xpath(".//h3[@class='h3rb']/text()").get()
address1 = item.xpath(".//div[@class='address']/p[@class='extra post']/text()[1]").get()
address2 = item.xpath(".//div[@class='address']/p[@class='extra post']/text()[2]").get()
phone = item.xpath(".//div[@class='address']/span[@class='extra phone']/text()").get()
希望它按您的意愿工作。
推荐阅读
- c# - 如何在普通类中访问视图模型
- r - ggplot2 中的决策边界图
- symfony - 多对多关系中的多个联接
- javascript - 正确编写 javascript 正则表达式来拆分搅拌
- angularjs - 等待兄弟控制器的功能
- gprs - 通过调制解调器响应获取远程数据而不会中断
- java - Servlet 异常:方法抛出“java.lang.NoClassDefFoundError”
- javascript - AngularJS应用程序未显示要查看的数据
- android - html5视频在android原生webview中不起作用
- dialogflow-es - 有没有办法可以避免第二次触发意图?