首页 > 解决方案 > Scrapy,Xpath,提取h3内容?

问题描述

我需要在h3 class AIRFRAME /h3之后但在h3 class ENGINES /h3之前提取所有内容:

我需要提取的内容:

《入役时间:2010年12月总时间:3580小时》等

HTML 代码照片 - 不知道如何直接嵌入而不是链接

以下是我尝试过的,但它没有返回任何内容。我是 Scrapy 和一般编程的新手,所以我希望能得到一些帮助。我试过搜索其他帖子和谷歌一般没有任何运气。

input = response.xpath("//div[@class='large-6 cell selectorgadget_rejected']/h3/text()").extract()

输出 = []

标签: htmlpython-3.xxpathweb-scrapingscrapy

解决方案


要完成@renatodvc 的回答,您可以添加normalize-space忽略空白节点的功能。

//div[@class='large-6 cell selectorgadget_selected']/text()[normalize-space()]

或者直接在元素上使用函数:

normalize-space(//div[@class='large-6 cell selectorgadget_selected'])

输出 :

AIRFRAME " Entry Into Service: December 2010" " Total Time Since New: 3,58@ Hours" " Total Landings Since New: 1,173" " (as of September 2019)" " Program Coverage: Enrolled on Smart Parts Plus" " Maintenance Tracking: CAMP "

然后,要提取值,您可以使用 regex :

import re
text = 'AIRFRAME " Entry Into Service: December 2010" " Total Time Since New: 3,58@ Hours" " Total Landings Since New: 1,173" " (as of September 2019)" " Program Coverage: Enrolled on Smart Parts Plus" " Maintenance Tracking: CAMP "'
data = [el.strip() for el in re.findall(':(.+?)\"', text, re.IGNORECASE)]
print(data)

输出 :

['December 2010', '3,58@ Hours', '1,173', 'Enrolled on Smart Parts Plus', 'CAMP']

推荐阅读