首页 > 解决方案 > 抓取网站时陷入循环/错误的 Xpath

问题描述

我正在尝试从该网站上抓取数据:https ://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM

我为初始数据制作了以下脚本:

import scrapy


class WaiascrapSpider(scrapy.Spider):
    name = 'waiascrap'
    allowed_domains = ['clsaa-dc.org']
    start_urls = ['https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM']

    def parse(self, response):
        rows = response.xpath("//tr")
        for row in rows:
            day = rows.xpath("(//tr/td[@class='time']/span)[1]/text()").get()
            time = rows.xpath("//tr/td[@class='time']/span/time/text()").get()

            yield{
                'day': day,
                'time': time,
            }

但是我得到的数据是重复的,就像我没有在 For 循环中导航一样:

PS C:\Users\gasgu\PycharmProjects\ScrapingProject\projects\waia> scrapy crawl waiascrap 2021-08-20 15:25:11 [scrapy.utils.log] INFO: Scrapy 2.5.0 started (bot: waia) 2021- 08-20 15:25:11 [scrapy.utils.log] 信息:版本:lxml 4.6.3.0、libxml2 2.9.5、cssselect 1.1.0、parsel 1.6.0、w3lib 1.22.0、Twisted 21.7.0、Python 3.9.6(标签/v3.9.6:db3ff76,2021 年 6 月 28 日,15:26:21)[MSC v.1929 64 位(AMD64)],pyOpenSSL 20.0.1(OpenSSL 1.1.1k 2021 年 3 月 25 日),密码学3.4.7,平台 Windows-10-10.0.19042-SP0 2021-08-20 15:25:11 [scrapy.utils.log] 调试:使用反应器:twisted.internet.selectreactor.SelectReactor 2021-08-20 15: 25:11 [scrapy.crawler] 信息:覆盖设置:{'BOT_NAME':'waia','NEWSPIDER_MODULE':'waia.spiders','ROBOTSTXT_OBEY':真,'SPIDER_MODULES':['waia.spiders']} 2021-08-20 15:25:11 [scrapy.extensions.telnet] 信息:Telnet 密码:9299b6be5840b21c 2021-08-20 15:25:11 [scrapy.middleware] 信息:启用扩展:['scrapy.extensions .corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.logstats.LogStats'] 2021-08-20 15:25:11 [scrapy.middleware] INFO: 启用下载器中间件: ['scrapy .downloadermiddlewares.robotstxt.RobotsTxtMiddleware'、'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware'、'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware'、'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware'、'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware'、' .downloadermiddlewares.retry.RetryMiddleware','scrapy.downloadermiddlewares.redirect。MetaRefreshMiddleware','scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware','scrapy.downloadermiddlewares.redirect.RedirectMiddleware','scrapy.downloadermiddlewares.cookies.CookiesMiddleware','scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware','scrapy.downloadermiddlewares.stats。 DownloaderStats'] 2021-08-20 15:25:11 [scrapy.middleware] 信息:启用蜘蛛中间件:['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware','scrapy.spidermiddlewares.offsite.OffsiteMiddleware','scrapy.spidermiddlewares。 referer.RefererMiddleware','scrapy.spidermiddlewares.urllength.UrlLengthMiddleware','scrapy.spidermiddlewares.depth.DepthMiddleware'] 2021-08-20 15:25:11 [scrapy.middleware] 信息:启用项目管道:[] 2021- 08-20 15:25:11 [scrapy.core.engine] INFO:Spider 打开 2021-08-20 15:25:11 [scrapy.extensions.logstats] INFO:抓取 0 页(以 0 页/分钟),抓取 0 个项目(以 0 个项目/分钟)2021-08-20 15:25:11 [scrapy.extensions.telnet] 信息:Telnet 控制台正在监听 127.0.0.1:6023 2021-08-20 15:25:12 [scrapy.core.engine] 调试:已爬网(404) <获取https://aa-dc.org/robots.txt> (referer: None) 2021-08-20 15:25:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://aa- dc.org/meetings?tsml-day=any&tsml-type=IPM> (referer: None) 2021-08-20 15:25:16 [scrapy.core.scraper] 调试:从 <200 https://aa- dc.org/meetings?tsml-day=any&tsml-type=IPM> {'day': 'Sunday', 'time': '6:45 am'} 2021-08-20 15:25:19 [scrapy.core .scraper] 调试:从 <200 https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM> {'day': 'Sunday', 'time': '6:45 am' } 2021-08-20 15:25:22 [scrapy.core.scraper] 调试:从 <200 https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM> {'day' :' Sunday', 'time': '6:45 am'} 2021-08-20 15:25:26 [scrapy.core.scraper] 调试:从 <200 开始https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM> {'day': 'Sunday', 'time': '6:45 am'} 2021-08-20 15:25 :29 [scrapy.core.scraper] 调试:从 <200 https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM> {'day': 'Sunday', 'time': '6:45 am'} 2021-08-20 15:25:32 [scrapy.core.scraper] 调试:从 <200 https://aa-dc.org/meetings?tsml-day=any&tsml-type=抓取IPM> {'day': 'Sunday', 'time': '6:45 am'} 2021-08-20 15:25:35 [scrapy.core.scraper] 调试:从 <200 https://aa -dc.org/meetings?tsml-day=any&tsml-type=IPM> {'day': 'Sunday', 'time': '6:45 am'} 2021-08-20 15:25:39 [scrapy.核。 scraper] 调试:从 <200 https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM> {'day': 'Sunday', 'time': '6:45 am'}

编辑:

现在它可以工作了,@Prophet 标记的错误和我的 Xpath 有问题。

我将我的代码放在下面:

import scrapy


class WaiascrapSpider(scrapy.Spider):
    name = 'waiascrap'
    allowed_domains = ['clsaa-dc.org']
    start_urls = ['https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM']

    def parse(self, response):
        rows = response.xpath("//tr")
        for row in rows:
            day = row.xpath(".//td[@class='time']/span/text()").get()
            time = row.xpath(".//td[@class='time']/span/time/text()").get()
            yield {
                'day': day,
                'time': time,
                }

标签: python-3.xweb-scrapingxpathscrapy

解决方案


要选择元素内的元素,您必须.在 XPath 表达式前面放置一个点,表示“从这里开始”。
否则,它(//tr/td[@class='time']/span)[1]/text()每次都会为您带来整个页面上的第一个匹配项,如您所见。
此外,由于您正在迭代每个row它应该是row.xpath...,而不是rows.xpath因为rows它是一个元素列表,而每个元素row都是一个元素。
此外,要根据 XPath 定位器对 Web 元素应用搜索,您应该使用find_element_by_xpath方法,而不是xpath.

def parse(self, response):
    rows = response.xpath("//tr")
    for row in rows:
        day = row.find_element_by_xpath(".(//tr/td[@class='time']/span)[1]/text()").get()
        time = row.find_element_by_xpath("//.tr/td[@class='time']/span/time/text()").get()

        yield{
            'day': day,
            'time': time,
        }

推荐阅读