首页 > 解决方案 > 如何使用 selenium python 从动态网站中检索所有链接

问题描述

我有以下代码:

rom selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException


chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')

prefs = {'profile.managed_default_content_settings.images':2}
chrome_options.add_experimental_option("prefs", prefs)
driver = webdriver.Chrome(chrome_options=chrome_options) 
driver.get("http://biggestbook.com/ui/catalog.html#/search?cr=1&rs=12&st=BM&category=1")
wait = WebDriverWait(driver,20)
links = wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".ess-product-brand + [href]")))
results = [link.get_attribute("href") for link in links]
#print(links)
print(results)
driver.quit()

但是,我只获得特色产品的结果/链接,而不是所有产品。有时,(很少)如果我运行 20 次,我会得到所有的产品。但我希望始终能够获得所有产品。我还尝试了以下不同的方法:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
driver = webdriver.Chrome(chrome_options=chrome_options) 
driver.get("http://biggestbook.com/ui/catalog.html#/search?cr=1&rs=12&st=BM&category=1")

links = [elem.get_attribute("href") for elem in driver.find_elements_by_tag_name('a')]

print(links)

同样的问题。我的问题是,我无法获得所有链接的原因是什么?这已经让我发疯了 2 周。我还试图延迟计时器,认为它可能没有加载,但它仍然没有工作。谢谢

标签: javascriptpythonjsonselenium-webdriverweb-scraping

解决方案


您可以通过提取结果总数并将特色总数添加到其中来尝试使用控制总数。这些数字已经可供您使用,因此您可以循环直到#hrefs 满足此要求。您可能想要在循环中添加超时。

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')

prefs = {'profile.managed_default_content_settings.images':2}
chrome_options.add_experimental_option("prefs", prefs)
driver = webdriver.Chrome(chrome_options=chrome_options) 
driver.get("http://biggestbook.com/ui/catalog.html#/search?cr=1&rs=12&st=BM&category=1")
wait = WebDriverWait(driver,20)
nonFeaturedTotal = int(wait.until(EC.presence_of_element_located((By.CSS_SELECTOR , '.ess-view-item-count-text'))).text.split(' ')[-1])
featuredTotal = len(wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".ess-product-container-featured"))))
expectedTotal = featuredTotal + nonFeaturedTotal

while False:
    len(driver.find_elements_by_css_selector(".ess-product-brand + [href]")) == expectedTotal

links = driver.find_elements_by_css_selector(".ess-product-brand + [href]")
results = [link.get_attribute("href") for link in links]

print(len(results))
print(links)

driver.quit()

推荐阅读