首页 > 解决方案 > 如何使用硒从无序列表中提取文本

问题描述

我正在学习硒。我正在尝试从亚马逊网站提取制造商信息。

在上述网站中,Manufacturer信息存在于无序列表中。如何用硒提取这些信息。

我试过这段代码,但它似乎不起作用

try:
    manufacturer_element = WebDriverWait(driver, 5).until(
            EC.presence_of_element_located((By.XPATH, "//ul//span[text()='Manufacturer']/ancestor::li")))

    manufacturer_text = manufacturer_element.text.split(':')[1].strip()
    return manufacturer_text

except TimeoutException:
    return None

这就是列表的设计方式

<ul class="a-unordered-list a-nostyle a-vertical a-spacing-none detail-bullet-list">
    <li><span class="a-list-item">
            <span class="detail-bullet-label a-text-bold">Is Discontinued By Manufacturer
            :
            </span>
            <span>No</span>
        </span></li>
    
    <li><span class="a-list-item">
            <span class="detail-bullet-label a-text-bold">Package Dimensions
            :
            </span>
            <span>10 x 4 x 4 inches</span>
        </span></li>
    
    <li><span class="a-list-item">
            <span class="detail-bullet-label a-text-bold">Item model number
            :
            </span>
            <span>BHBUSWA2918</span>
        </span></li>

    <li><span class="a-list-item">
        <span class="detail-bullet-label a-text-bold">UPC
        :
        </span>
        <span>874989001644</span>
    </span></li>

    <li><span class="a-list-item">
        <span class="detail-bullet-label a-text-bold">Manufacturer
        :
        </span>
        <span>Wonder Bread</span>
    </span></li>

    <li><span class="a-list-item">
        <span class="detail-bullet-label a-text-bold">ASIN
        :
        </span>
        <span>B0038EUT9W</span>
    </span></li>
</ul>

从上面的列表中我想提取Wonder Bread

提前致谢

标签: pythonseleniumweb-scraping

解决方案


尝试使用 找到元素By.CSS_SELECTOR

try:
    manufacturer_element = WebDriverWait(driver, 5).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "div#detailBullets_feature_div > ul > li:nth-child(5)")))

    manufacturer_text = manufacturer_element.text.split(':')[1].strip()
    return manufacturer_text

except TimeoutException:
    return None

li:nth-child(5)上面的代码指的是Manufacturer.

或者使用这个 xpath:

try:
    manufacturer_text = WebDriverWait(driver, 5).until(
            EC.presence_of_element_located((By.XPATH, "//span[normalize-space() = 'Manufacturer :']//following-sibling::span"))).text
    return manufacturer_text

except TimeoutException:
    return None

推荐阅读