首页 > 解决方案 > 为什么我不能抓取里面的数据

问题描述

我正在尝试在<ul>带有class="review_list". 有 10 条评论,每条评论都<li>class="review_list_new_item_block".

在此处输入图像描述

这是第一个<li>标签内的数据图片:

在此处输入图像描述

但我注意到我无法抓取 this<ul><li>tags 中的大部分数据,尽管我总是对 xpath 使用相同的逻辑。例如,我尝试遵循 xpaths 来抓取title, text, language, review datestay date

title = response.xpath('//h3[@class="c-review-block__title"]/text()').extract()
#title = response.xpath('//div[@class="c-review-block__row"]//h3/text()')

text = response.xpath('//span[@class="c-review__prefix c-review__prefix--color-green"]/span[2]/text()').extract()

lang = response.xpath('//span[@class="c-review__prefix c-review__prefix--color-green"]/span[2]/@lang').extract()

reviewdate = response.xpath('//span[@class="c-review-block__date"]/text()').extract()

staydate = response.xpath('//div[@class="c-review-block__room-info__name"]/div/span/text()').extract()

只有这两个项目的 xpath 有效:

author = response.xpath('//span[@class="bui-avatar-block__title"]/text()').extract()
authorcountry = response.xpath('//span[@class="bui-avatar-block__subtitle"]/text()').extract()

你有什么建议吗?这是我使用 xpath 的方式的问题,还是 booking.com 在这个 HTML 代码的地方有任何限制?先感谢您!

我的脚本:

import scrapy

class BookingSpider(scrapy.Spider):
    name = 'booking-spider'
    allowed_domains = ['booking.com']
    # start with the page of all countries
    start_urls = [
        'https://www.booking.com/country.de.html?aid=356980;label=gog235jc-1DCAIoLDgcSAdYA2gsiAEBmAEHuAEHyAEP2AED6AEB-AECiAIBqAIDuAK7q7DyBcACAQ;sid=8de61678ac61d10a89c13a3941fd3dcd'
    ]

    # get country page
    def parse(self, response):

        for countryurl in response.xpath('normalize-space(//a[contains(text(),"Schweiz")]/@href)'):
            url = response.urljoin(countryurl.extract())
            yield scrapy.Request(url, callback=self.parse_country)

    # get page of all hotels in a country
    def parse_country(self, response):

        for hotelsurl in response.xpath('normalize-space(//a[@class="bui-button bui-button--secondary"]/@href)'):
            url = response.urljoin(hotelsurl.extract())
            yield scrapy.Request(url, callback=self.parse_allhotels)

    # get page of one hotel
    def parse_allhotels(self, response):

        for hotelurl in response.xpath('normalize-space(//a[@class="hotel_name_link url"]/@href)'):
            url = response.urljoin(hotelurl.extract())
            yield scrapy.Request(url, callback=self.parse_hotelpage)

        next_page = response.xpath('//a[contains(@class,"paging-next") and contains(@title,"Nächste Seite")]/@href')
        if next_page:
            url = response.urljoin(next_page[0].extract())
            yield scrapy.Request(url, self.parse_allhotels)

    # get review page of this hotel
    def parse_hotelpage(self, response):

        reviewsurl = response.xpath('//a[@class="hp_nav_reviews_link toggle_review track_review_link_zh"]/@href')
        url = response.urljoin(reviewsurl[0].extract())
        new_url = url.replace('blockdisplay4', 'tab-reviews')
        yield scrapy.Request(new_url, callback=self.parse_reviews, dont_filter=True)

    # parse its reviews
    def parse_reviews(self, response):

        author = response.xpath('//span[@class="bui-avatar-block__title"]/text()').extract()
        authorcountry = response.xpath('//span[@class="bui-avatar-block__subtitle"]/text()').extract()

        title = response.xpath('//div[@class="c-review-block"]//div[@class="c-review-block__row"]//h3/text()').extract()
        print(title)

标签: htmlxpath

解决方案


您可以尝试以下 xpath。

标题:

//div[@class='c-review-block']//div[@class="c-review-block__row"]//h3/text()

文本(包括优秀和糟糕的文本)

//div[@class='c-review-block']//div[@class='c-review-block__row'][3]//text()

审核日期

//div[@class='c-review-block']//div[@class='c-review-block__row']//span[@class="c-review-block__date"]/text()

入住日期:

//div[@class='c-review-block']//div[@class='c-review-block__room-info']//span[@class="c-review-block__date"]/text()

字幕:

//div[@class='c-review-block']//span[@class="bui-avatar-block__subtitle"]/text()

您必须review通过 using 获取节点//div[@class='c-review-block'],然后遍历所有节点以获取详细信息。如果您正在遍历每个评论,那么您只需替换//div[@class='c-review-block']in.以便 xpath 位于review上下文中。


推荐阅读