首页 > 解决方案 > 使用 Webdriver (selenium & python) 向下滚动时提取动态 div 列表时遇到问题

问题描述

我很难弄清楚如何在使用 Selenium 和 Python3 中的 Webdriver 向下滚动页面时获得刷新的动态列表。https://www.ubereats.com/stores/这是我要抓取的网站,如果该网站将您定向到主页,请输入任何城市并单击,它将显示 div 中的餐厅列表。

这里有趣的是,如果你去检查元素,<div class="base_ ue-ff ...>..</div>当我向下滚动页面时更改列表,甚至我确实使用 selenium python 中的 webdriver 向下滚动页面,它仍然检索已提取的旧数据第一名。下面是我的示例代码。我还做了一个睡眠功能让数据加载,但数据提取没有任何区别。

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import NoSuchElementException
from urlib.request import urlopen
from importlib import reload
import re
import sys

driver = webdriver.Chrome(path_chrome_driver)
driver.get('https://www.ubereats.com')

wait_time_for_search_complete = float(np.random.uniform(1,2,1))
time.sleep(wait_time_for_search_complete)

input_city_name = driver.find_element_by_xpath("//input[@placeholder='Enter your delivery address']")

time_to_wait_to_enter_city_name = float(np.random.uniform(1, 2, 1))
time.sleep(time_to_wait_to_enter_city_name)

input_city_name.send_keys('Sydney')

time_to_wait_to_write_city = float(np.random.uniform(2, 3, 1))
time.sleep(time_to_wait_to_write_city)

select_first_in_dropdown = driver.find_element_by_xpath('//*[@id="app-content"]/div/div[1]/div/div[1]/div[1]/div[2]/div/div/div[3]/div[1]/div/div/div[2]/div/div/button[1]')
select_first_in_dropdown.click()

time_to_wait_to_load_restaurants = float(np.random.uniform(2, 3, 1))
time.sleep(time_to_wait_to_load_restaurants)

current_page = driver.page_source
soup = BeautifulSoup(current_page,'html.parser')

height = 0
restaurant_site = []
while True:
  restaurant_information = ''
  restaurant_information = soup.find_all('a',['base_','ue-kl','ue-km','ue-kn','ue-ko'])
  time.sleep(5)
  for restaurant in restaurant_information:
    print(restaurant['href'])

  height += 1000
  driver.execute_script("window.scrollTo(0,"+ str(height) +")")
  driver.implicitly_wait(3)

由于 div 是动态的,因此当我向下滚动页面时,我真的很难弄清楚如何检索餐厅列表。我相信这与 ajax 调用有关,但如果您有任何替代解决方案,请告诉我。真想尽快解决这个问题。

谢谢!!

标签: pythonseleniumselenium-webdriverweb-scrapingbeautifulsoup

解决方案


您只是在向下滚动时忘记更新 HTML。修复很简单,只需将下面的代码移动到循环中即可。

current_page = driver.page_source
soup = BeautifulSoup(current_page,'html.parser')

请参见下面的示例。

...
time_to_wait_to_load_restaurants = float(np.random.uniform(2, 3, 1))
time.sleep(time_to_wait_to_load_restaurants)

height = 0
restaurant_site = []
while True:
    current_page = driver.page_source
    soup = BeautifulSoup(current_page,'html.parser')
    restaurant_information = ''
    restaurant_information = soup.find_all('a',['base_','ue-kl','ue-km','ue-kn','ue-ko'])
    time.sleep(5)
    for restaurant in restaurant_information:
        print(restaurant['href'])

    height += 1000
    driver.execute_script("window.scrollTo(0,"+ str(height) +")")
    driver.implicitly_wait(3)

推荐阅读