首页 > 解决方案 > 无法从 Xpath 中获取要列出的空字符串

问题描述

我是一个完全的新手。我训练从 upwork 的任务中解析站点。问题出现如下:货物清单全部退回,但没有价目表,新飞机没有价格“文本”。然后需要将列表组合成一个表格,如果列表中没有相同数量的元素,一切都会出错。

请帮助我了解如何处理此类异常,以便在这种情况下最终列表中出现一个空字符串。提前感谢您的回答。

import requests
import lxml.html


def parse_data(url):
    try:
        response = requests.get(url)
    except:
        return
    tree = lxml.html.document_fromstring(response.text)
    text_aicraft = tree.xpath('//*[contains(@id, "item_card")]/div/div[4]/div/h2/a/text()')
    price_aicraft = tree.xpath('//*[contains(@id, "item_card")]/div/div[4]/div/div[1]/text()')
    print(text_aicraft)
    print(len(text_aicraft))
    print(price_aicraft)
    print(len(price_aicraft))


def main():
    url = 'https://www.avbuyer.com/aircraft/private-jets/page-13'
    parse_data(url)


if __name__ == "__main__":
    main()

标签: pythonweb-scraping

解决方案


一种选择是将解析分为两个步骤。

Step1 - 提取元素。Step2 - 从元素中提取文本

当元素为空时,Python 列表推导返回 None 。

import requests
import lxml.html


def parse_data(url):
    try:
        response = requests.get(url)
    except:
        return
    tree = lxml.html.document_fromstring(response.text)
    text_aicraft = tree.xpath('//*[contains(@id, "item_card")]/div/div[4]/div/h2/a/text()')
    price_aicraft_elements = tree.xpath('//*[contains(@id, "item_card")]/div/div[4]/div/div[1]')
    price_aicraft =  [element.text for element in price_aicraft_elements]
    print(text_aicraft)
    print(len(text_aicraft))
    print(price_aicraft)
    print(len(price_aicraft))


def main():
    url = 'https://www.avbuyer.com/aircraft/private-jets/page-13'
    parse_data(url)


if __name__ == "__main__":
    main()

输出:

['Dassault Falcon 50EX ', 'Cessna Citation M2 ', 'Embraer Phenom 300 ', 'Bombardier Learjet 40XR ', 'Embraer Legacy 600 ', 'Cessna Citation Sovereign ', 'Cessna Citation Ultra ', 'Cessna Citation Ultra ', 'Airbus ACJ318 ', 'Gulfstream G550 ', 'Boeing 737 -500', 'Boeing BBJ ', 'Hawker 800XP ', 'Boeing
 737 ', 'Bombardier Learjet 55 ', 'Bombardier Challenger 300 ', 'Airbus ACJ TwoTwenty ', 'Gulfstream G200 ', 'Bombardier Learjet 60XR ', 'Cessna Citation Mustang ']
20
['Deal pending', 'Please call ', 'Please call ', 'Please call ', 'Please call ', 'Price: USD $6,500,000', 'Please call ', 'Please call ', 'Make offer', 'Please call ', 'Please call ', 'Make offer', 'Please call ', 'Price: USD $3,500,000', 'Please email', 'Make offer', None, 'Please call ', 'Deal pend
ing', 'Price: USD $1,200,000']
20

推荐阅读