pandas - 为什么在 Python 中通过 CSS 选择器定位元素仅在某些情况下有效?
问题描述
我正在尝试抓取一个二手车网站(https://www.webmotors.com.br/)并提取一些关于他们的汽车的信息。我正在使用 Selenium 库和find_element_by_css_selector
方法来定位这些信息。我的输入是一个特定的汽车名称,例如:“tracker”,对于某些汽车,我的代码运行没有任何问题!但是对于其他一些模型,我的代码只是返回错误。
- 类名始终保持不变,所以我不明白为什么会发生这个错误。
遵循两种情况的代码:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
import time
import os.path
from os import path
import pandas as pd
plan="WebScrapPrecos.csv"
FileExists = 1
if not path.exists(plan):
FileExists = 0
f = open(plan, "a", encoding="utf-8")
if FileExists ==0:
f.write("Nome;Modelo;Valor;Ano;Quilometragem;")
navegador = webdriver.Chrome()
navegador.get('https://www.webmotors.com.br/')
navegador.find_element_by_xpath('//*[@id="searchBar"]').send_keys('hb20')
time.sleep(3)
navegador.find_element_by_xpath('//*[@id="searchBar"]').send_keys(Keys.TAB + Keys.TAB + Keys.RETURN)
time.sleep(5)
SCROLL_PAUSE_TIME = 1
# Get scroll height
last_height = navegador.execute_script("return document.body.scrollHeight")
while True:
# Scroll down to bottom
navegador.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load page
time.sleep(SCROLL_PAUSE_TIME)
# Calculate new scroll height and compare with last scroll height
new_height = navegador.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
wait = WebDriverWait(navegador, 20)
actions = ActionChains(navegador)
#wait for the first element visibility
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".sc-hMFtBS.cVTeoI")))
#extra wait to make all the other elements loaded before getting the elements list
time.sleep(0.5)
cars = navegador.find_elements_by_css_selector('.sc-hMFtBS.cVTeoI')
for car in cars:
# scroll element into the view
actions.move_to_element(car).perform()
time.sleep(0.2)
names = car.find_element_by_css_selector('.sc-uJMKN.hNiOat').text
models = car.find_element_by_css_selector('.sc-bbmXgH.fEaLmM').text
prices = car.find_element_by_css_selector('.sc-kvZOFW.knsOia').text
quilometers = car.find_element_by_css_selector('.sc-hmzhuo.ezNMNH').text
f.write("\n")
f.write(names)
f.write(";")
f.write(models)
f.write(";")
f.write(prices)
f.write(";")
f.write(quilometers)
并遵循两种情况的 html 打印(第一个工作第二个不工作)
在职的:
不工作:
解决方案
不确定,但我猜以下两件事可能会有所帮助:
- 在获取汽车元素列表之前添加等待/延迟
- 对于每辆汽车的详细信息,将该元素滚动到视图中。所以代码如下:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
import time
wait = WebDriverWait(driver, 20)
actions = ActionChains(driver)
#wait for the first element visibility
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".sc-hMFtBS.cVTeoI")))
#extra wait to make all the other elements loaded before getting the elements list
time.sleep(0.5)
car = drive.find_elements_by_css_selector('.sc-hMFtBS.cVTeoI')
for car in cars:
#scroll element into the view
actions.move_to_element(car).perform()
time.sleep(0.2)
names = car.find_element_by_css_selector('.sc-uJMKN.hNiOat').text
models = car.find_element_by_css_selector('.sc-bbmXgH.fEaLmM').text
prices = car.find_element_by_css_selector('.sc-kvZOFW.knsOia').text
quilometers = car.find_element_by_css_selector('.sc-hmzhuo.ezNMNH').text
f.write("\n")
f.write(names)
f.write(";")
f.write(models)
f.write(";")
f.write(prices)
f.write(";")
f.write(quilometers)
推荐阅读
- powerbi - 建立用于报告的数字传感器数据的并发性
- android - Android xml 形状填充整个背景
- autoit - 让 AutoIT 等到 Audacity 完成命令
- vba - 公式 VBA Excel 中的变量语法
- django - 在 pythonanywhere 上将迁移文件部署到生产环境
- c++ - utf-8 编码 Visual Studios 十进制值
- javascript - 如何根据选择的半径显示谷歌地图
- angular - Angular 5 - Observables:无法使用 Firestore 访问文档的值
- android - 获取 Firebase 设备令牌 android delphi 10.2
- gradle - Gradle 离线使用插件和依赖