首页 > 解决方案 > 如何让爬虫获取这些链接?

问题描述

我正在尝试从此页面上的记分卡列中获取链接...

https://stats.espncricinfo.com/uae/engine/records/team/match_results.html?class=3;id=1965;type=ground

我正在使用爬虫,并尝试使用此 xpath 表达式访问链接....

"//tbody//tr[@class='data1']//td[last()]//a[@class='data-link']" 

这个表达式在 scrapy shell 中工作,并获取所有 48 个链接。当我使用蜘蛛时,它什么也没刮。

我尝试了 20 种不同的 xpath 表达式,但均无济于事。我也尝试过使用“允许”和 css 选择器。我相信我不应该包含@href,因为 crawlspider 会处理这个问题。

我很困惑,因为我有一个非常相似的爬虫,可以正常工作。

这是完整的代码

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class IntlistmakerSpider(CrawlSpider):
    name = 'intlistmaker'
    allowed_domains = ['www.espncricinfo.com']
    start_urls = 'https://stats.espncricinfo.com/uae/engine/records/team/match_results.html?class=3;id=1965;type=ground'
    
    rules = (
        Rule(LinkExtractor(restrict_xpaths="//tbody//tr[@class='data1']//td[last()]//a[@class='data-link']"), callback='parse_item', follow=False),
    )

    def parse_item(self, response):
       
        raw_url = response.url
  
        
        yield {
            'url': raw_url,
        }  

输出

2021-05-25 18:03:07 [scrapy.core.engine] INFO: Spider opened
2021-05-25 18:03:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-05-25 18:03:07 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6024
2021-05-25 18:03:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stats.espncricinfo.com/robots.txt> (referer: None)
2021-05-25 18:03:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stats.espncricinfo.com/uae/engine/records/team/match_results.html?class=3;id=1965;type=ground> (referer: None)
2021-05-25 18:03:08 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'stats.espncricinfo.com': <GET https://stats.espncricinfo.com/uae/engine/match/439500.html>
2021-05-25 18:03:08 [scrapy.core.engine] INFO: Closing spider (finished)
2021-05-25 18:03:08 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 524,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 18522,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 1.528078,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 5, 25, 17, 3, 8, 872122),
 'log_count/DEBUG': 3,
 'log_count/INFO': 10,
 'offsite/domains': 1,
 'offsite/filtered': 48,
 'request_depth_max': 1,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2021, 5, 25, 17, 3, 7, 344044)}
2021-05-25 18:03:08 [scrapy.core.engine] INFO: Spider closed (finished)

这是工作的蜘蛛:

class ListmakerSpider(CrawlSpider):
    name = 'listmaker'
    allowed_domains = ['www.espncricinfo.com']
    start_urls = [psl21]
    

    rules = (
        Rule(LinkExtractor(restrict_xpaths="//a[@data-hover='Scorecard']"), callback='parse_item', follow=True),
    )

这个蜘蛛成功地从这个页面中提取了记分卡链接......

https://www.espncricinfo.com/series/ipl-2021-1249214/match-results

请任何人都可以建议我如何更改第一个示例中的 xpath 表达式,以便我可以隔离和检索记分卡 url。

提前致谢。

标签: pythonxpathscrapy

解决方案


日志中的关键行是这一行

2021-05-25 18:03:08 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'stats.espncricinfo.com': <GET https://stats.espncricinfo.com/uae/engine/match/439500.html>

您已设置allowed_domains"www.espncricinfo.com"不匹配的"stats.espncricinfo.com". 更改allowed_domains"espncricinfo.com"解决该问题。

scrapy我使用的版本中start_urls必须是一个列表,所以你也应该修复它。

xpath现在应该工作了。尽量让它们在未来尽可能简单。在这种情况下,工作css选择器可能是".data1 > td:last-of-type > a"


推荐阅读