首页 > 解决方案 > 为什么在 Python 中通过 CSS 选择器定位元素仅在某些情况下有效?

问题描述

我正在尝试抓取一个二手车网站(https://www.webmotors.com.br/)并提取一些关于他们的汽车的信息。我正在使用 Selenium 库和find_element_by_css_selector方法来定位这些信息。我的输入是一个特定的汽车名称,例如:“tracker”,对于某些汽车,我的代码运行没有任何问题!但是对于其他一些模型,我的代码只是返回错误。

  1. 类名始终保持不变,所以我不明白为什么会发生这个错误。

遵循两种情况的代码:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
import time
import os.path
from os import path
import pandas as pd

plan="WebScrapPrecos.csv"
FileExists = 1
if not path.exists(plan):
    FileExists = 0
f = open(plan, "a", encoding="utf-8")
if FileExists ==0:
    f.write("Nome;Modelo;Valor;Ano;Quilometragem;")

navegador = webdriver.Chrome()
navegador.get('https://www.webmotors.com.br/')
navegador.find_element_by_xpath('//*[@id="searchBar"]').send_keys('hb20')
time.sleep(3)
navegador.find_element_by_xpath('//*[@id="searchBar"]').send_keys(Keys.TAB + Keys.TAB + Keys.RETURN)
time.sleep(5)

SCROLL_PAUSE_TIME = 1

# Get scroll height
last_height = navegador.execute_script("return document.body.scrollHeight")

while True:
    # Scroll down to bottom
    navegador.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # Wait to load page
    time.sleep(SCROLL_PAUSE_TIME)

    # Calculate new scroll height and compare with last scroll height
    new_height = navegador.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height


wait = WebDriverWait(navegador, 20)
actions = ActionChains(navegador)

#wait for the first element visibility
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".sc-hMFtBS.cVTeoI")))
#extra wait to make all the other elements loaded before getting the elements list
time.sleep(0.5)

cars = navegador.find_elements_by_css_selector('.sc-hMFtBS.cVTeoI')

for car in cars:
    # scroll element into the view
    actions.move_to_element(car).perform()
    time.sleep(0.2)
    names = car.find_element_by_css_selector('.sc-uJMKN.hNiOat').text
    models = car.find_element_by_css_selector('.sc-bbmXgH.fEaLmM').text
    prices = car.find_element_by_css_selector('.sc-kvZOFW.knsOia').text
    quilometers = car.find_element_by_css_selector('.sc-hmzhuo.ezNMNH').text
    f.write("\n")
    f.write(names)
    f.write(";")
    f.write(models)
    f.write(";")
    f.write(prices)
    f.write(";")
    f.write(quilometers)

并遵循两种情况的 html 打印(第一个工作第二个不工作)

在职的:

在职的

不工作:

不工作

标签: pandasselenium

解决方案


不确定,但我猜以下两件事可能会有所帮助:

  1. 在获取汽车元素列表之前添加等待/延迟
  2. 对于每辆汽车的详细信息,将该元素滚动到视图中。所以代码如下:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
import time

wait = WebDriverWait(driver, 20)
actions = ActionChains(driver)

#wait for the first element visibility
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".sc-hMFtBS.cVTeoI")))
#extra wait to make all the other elements loaded before getting the elements list
time.sleep(0.5)

car = drive.find_elements_by_css_selector('.sc-hMFtBS.cVTeoI')

for car in cars:
    #scroll element into the view
    actions.move_to_element(car).perform()
    time.sleep(0.2)
    names = car.find_element_by_css_selector('.sc-uJMKN.hNiOat').text
    models = car.find_element_by_css_selector('.sc-bbmXgH.fEaLmM').text
    prices = car.find_element_by_css_selector('.sc-kvZOFW.knsOia').text
    quilometers = car.find_element_by_css_selector('.sc-hmzhuo.ezNMNH').text

    f.write("\n")
    f.write(names)
    f.write(";")
    f.write(models)
    f.write(";")
    f.write(prices)
    f.write(";")
    f.write(quilometers)

推荐阅读