python - 使用 Selenium 和 Python 抓取网站时无法找到分页链接
问题描述
我正在学习使用 Selenium 进行网页抓取。我对正在使用的网站有几个问题:
- 该网站有多个页面要浏览,我似乎找不到找到页面路径并浏览它们的方法。例如,以下代码返回link_page
为NoneType
.
from selenium import webdriver
import time
driver = webdriver.Chrome('chromedriver')
driver.get('https://www.oddsportal.com/soccer/england/premier-league')
time.sleep(0.5)
results_button = driver.find_element_by_xpath('/html/body/div[1]/div/div[2]/div[6]/div[1]/div/div[1]/div[2]/div[1]/div[2]/ul/li[3]/span')
results_button.click()
time.sleep(3)
season_button = driver.find_element_by_xpath('/html/body/div[1]/div/div[2]/div[6]/div[1]/div/div[1]/div[2]/div[1]/div[3]/ul/li[2]/span/strong/a')
season_button.click()
link_page = driver.find_element_by_xpath('/html/body/div[1]/div/div[2]/div[6]/div[1]/div/div[1]/div[2]/div[1]/div[6]/div/a[3]/span').get_attribute('href')
print(link_page.text)
driver.get(link_page)
- 出于某种原因,我必须使用results_button
才能获得href
匹配项。例如,下面的代码尝试直接进入页面(试图规避上面的问题 1),但link_page
返回NoSuchElementException
错误。
from selenium import webdriver
import time
driver = webdriver.Chrome('chromedriver')
driver.get('https://www.oddsportal.com/soccer/england/premier-league/results/#/page/2')
time.sleep(3)
link_page = driver.find_element_by_xpath('/html/body/div[1]/div/div[2]/div[6]/div[1]/div/div[1]/div[2]/div[1]/div[6]/table/tbody/tr[11]/td[2]/a').get_attribute('href')
print(link_page.text)
driver.get(link_page)
解决方案
要使用Selenium定位要遍历它们的页面,您需要诱导WebDriverWait并且visibility_of_all_elements_located()
您可以使用以下定位器策略:
使用
XPATH
:driver.get('https://www.oddsportal.com/soccer/england/premier-league/') WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//a[text()='RESULTS']"))).click() WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//a[text()='2018/2019']"))).click() print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH, "//span[@class='active-page']//following::a[@x-page]/span[not(contains(., '|')) and not(contains(., '»'))]/..")))])
控制台输出:
['https://www.oddsportal.com/soccer/england/premier-league-2018-2019/results/#/page/2/', 'https://www.oddsportal.com/soccer/england/premier-league-2018-2019/results/#/page/3/', 'https://www.oddsportal.com/soccer/england/premier-league-2018-2019/results/#/page/4/', 'https://www.oddsportal.com/soccer/england/premier-league-2018-2019/results/#/page/5/', 'https://www.oddsportal.com/soccer/england/premier-league-2018-2019/results/#/page/6/', 'https://www.oddsportal.com/soccer/england/premier-league-2018-2019/results/#/page/7/', 'https://www.oddsportal.com/soccer/england/premier-league-2018-2019/results/#/page/8/']
注意:您必须添加以下导入:
from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC
推荐阅读
- c# - 如何从字符串中计算一个数字
- julia - Julia 是否支持音频处理
- graphql - 语法错误:预期名称,找到字符串“” gatsby 和 graphql
- r - R中的数据框行必须是唯一的?
- c# - IStorageProviderItemPropertySource 可用性
- python - jupyter 运行魔术传递参数到笔记本
- c++ - 有没有办法通过 C++ 将“\uxxxx”转换为文本?
- c++ - C++ mingw32-make undefined references to library (Windows)
- c++ - GDB shows incorrect arguments of functions for stack frames
- react-native - Can users add their own firestore db in a react native app