首页 > 解决方案 > Xpath is correct but no result after scraping

问题描述

I am trying to crawl all the name of the cities of the following web: https://www.zomato.com/directory.

I have tried to used the following xpath.

python
#1st approach:
def parse(self,response):
    cities_name = response.xpath('//div//h2//a/text()').extract_first()
    items['cities_name'] = cities_name
    yield items 
 #2nd approach:

def parse(self,response):
 for city in response.xpath("//div[@class='col-l-5 col-s-8 item pt0 pb5 
   ml0']"):
        l = ItemLoader(item = CountryItem(),selector = city)
        l.add_xpath("cities_name",".//h2//a/text()")
        yield l.load_item()
        yield city

Actual result: Crawl 0 pages and scrape 0 items
Expected: Adelaide, Ballarat etc

标签: xpathweb-scrapingscrapy

解决方案


首先要注意:
您的 xpath 有点太具体了。html 中的 Css 类并不总是具有可靠的顺序。class1 class2最终可能会成为class2 class1甚至涉及一些损坏的语法,例如尾随空格:class1 class2.

当您直接将 xpath 匹配到[@class="class1 class2"]时,它很有可能会失败。相反,您应该尝试使用contains函数。

第二:
您的cities_namexpath 中有一个小错误。在 html 正文中,它的 a>h2>text 和在你的代码中它是相反的h2>a>text

话虽如此,我设法让它与这些 css 和 xpath 选择器一起工作:

$ parsel "https://www.zomato.com/directory"                                                                           
> p.mb10>a>h2::text +first                                                                                            
Adelaide
> p.mb10>a>h2::text +len                                                                                              
736
> -xpath                                                                                                              
switched to xpath
> //p[contains(@class,"mb10")]/a/h2/text() +first                                                                     
Adelaide
> //p[contains(@class,"mb10")]/a/h2/text() +len                                                                       
736

parselcli - https://github.com/Granitosaurus/parsel-cli


推荐阅读