首页 > 解决方案 > 是否有一种有效的方法可以使用 python(并避免使用开发工具)为 selenium 挖掘 html 元素。如果这可以用 BeautifulSoup 来完成,怎么做?

问题描述

我正在编写一个 selenium webdriver 脚本来自动化更新事件注册门户的过程。

用户界面的图片链接如下。

我能够成功登录到门户网站。通过使用从 chrome 开发工具复制的 XPATH 元素。我还能够成功地在屏幕左侧的各个文件夹之间自动切换(2018、2017、...、加拿大、美国、...、温哥华、基洛纳、...)。请记住,我能够做到这一点是通过手动处理每个可单击文件夹链接的单个 XPATH 元素,如下面的脚本所示。

import time

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup


# Create a new instance of the Chrome driver
driver = webdriver.Chrome("C:/Users/Computer/Downloads/chromedriver")

# Go to the regOnline homepage
driver.get("######LoginPage#####")
driver.get("#####LinkedUIPage#####")
# the page is ajaxy so the title is originally this:
print ("print " + driver.title)


# find the username and password element
userElement = driver.find_element_by_id("ctl00_cphMaster_txtLogin")
passElement = driver.find_element_by_id("ctl00_cphMaster_txtPassword")

# find the login button element
logInElement = driver.find_element(By.LINK_TEXT,"Sign In")


# type in account information
userElement.send_keys("#####")
passElement.send_keys("#####")

# log into the webpage (hit the enter button)
logInElement.send_keys(Keys.ENTER)

# xpath elements
calgaryXPATH = '//*[@id="ctl00_ctl00_cphDialog_cpMgrMain_trUserNodes"]/ul/li[1]/ul/li[11]/ul/li[1]/ul/li[1]/div/span[2]'
edmontonXPATH = '//*[@id="ctl00_ctl00_cphDialog_cpMgrMain_trUserNodes"]/ul/li[1]/ul/li[11]/ul/li[1]/ul/li[2]/div/span[2]'
fortNelsonXPATH = '//*[@id="ctl00_ctl00_cphDialog_cpMgrMain_trUserNodes"]/ul/li[1]/ul/li[11]/ul/li[1]/ul/li[3]/div/span[2]'
fortStJohnXPATH = '//*[@id="ctl00_ctl00_cphDialog_cpMgrMain_trUserNodes"]/ul/li[1]/ul/li[11]/ul/li[1]/ul/li[4]/div/span[2]'
halifaxXPATH = '//*[@id="ctl00_ctl00_cphDialog_cpMgrMain_trUserNodes"]/ul/li[1]/ul/li[11]/ul/li[1]/ul/li[5]/div/span[2]'
kamloopsXPATH = '//*[@id="ctl00_ctl00_cphDialog_cpMgrMain_trUserNodes"]/ul/li[1]/ul/li[11]/ul/li[1]/ul/li[6]/div/span[2]'
kelownaXPATH = '//*[@id="ctl00_ctl00_cphDialog_cpMgrMain_trUserNodes"]/ul/li[1]/ul/li[11]/ul/li[1]/ul/li[7]/div/span[2]'
ottawaXPATH = '//*[@id="ctl00_ctl00_cphDialog_cpMgrMain_trUserNodes"]/ul/li[1]/ul/li[11]/ul/li[1]/ul/li[10]/div/span[2]'
princeGeorgeXPATH = '//*[@id="ctl00_ctl00_cphDialog_cpMgrMain_trUserNodes"]/ul/li[1]/ul/li[11]/ul/li[1]/ul/li[11]/div/span[2]'
saskatoonXPATH = '//*[@id="ctl00_ctl00_cphDialog_cpMgrMain_trUserNodes"]/ul/li[1]/ul/li[11]/ul/li[1]/ul/li[12]/div/span[2]'
thunderBayXPATH = '//*[@id="ctl00_ctl00_cphDialog_cpMgrMain_trUserNodes"]/ul/li[1]/ul/li[11]/ul/li[1]/ul/li[13]/div/span[2]'
torontoXPATH = '//*[@id="ctl00_ctl00_cphDialog_cpMgrMain_trUserNodes"]/ul/li[1]/ul/li[11]/ul/li[1]/ul/li[14]/div/span[2]'
vancouverXPATH = '//*[@id="ctl00_ctl00_cphDialog_cpMgrMain_trUserNodes"]/ul/li[1]/ul/li[11]/ul/li[1]/ul/li[15]/div/span[2]'
whitehorseXPATH = '//*[@id="ctl00_ctl00_cphDialog_cpMgrMain_trUserNodes"]/ul/li[1]/ul/li[11]/ul/li[1]/ul/li[16]/div/span[2]'
williamsLakeXPATH = '//*[@id="ctl00_ctl00_cphDialog_cpMgrMain_trUserNodes"]/ul/li[1]/ul/li[11]/ul/li[1]/ul/li[17]/div/span[2]'

page2018XPATH = '//*[@id="ctl00_ctl00_cphDialog_cpMgrMain_trUserNodes"]/ul/li[1]/ul/li[11]/div/img'
canada2018XPATH = '//*[@id="ctl00_ctl00_cphDialog_cpMgrMain_trUserNodes"]/ul/li[1]/ul/li[11]/ul/li[1]/div/img'

# find city element
calgary = driver.find_element(By.XPATH, calgaryXPATH)
edmonton = driver.find_element(By.XPATH, edmontonXPATH)
fortNelson = driver.find_element(By.XPATH, fortNelsonXPATH)
fortStJohn = driver.find_element(By.XPATH, fortStJohnXPATH)
halifax = driver.find_element(By.XPATH, halifaxXPATH)
kamloops = driver.find_element(By.XPATH, kamloopsXPATH)
kelowna = driver.find_element(By.XPATH, kelownaXPATH)
ottawa = driver.find_element(By.XPATH, ottawaXPATH)
princeGeorge = driver.find_element(By.XPATH, princeGeorgeXPATH)
saskatoon = driver.find_element(By.XPATH, saskatoonXPATH)
thunderBay = driver.find_element(By.XPATH, thunderBayXPATH)
toronto = driver.find_element(By.XPATH, torontoXPATH)
whitehorse = driver.find_element(By.XPATH, whitehorseXPATH)
vancouver = driver.find_element(By.XPATH, vancouverXPATH)
williamsLake = driver.find_element(By.XPATH, williamsLakeXPATH)

page2018 = driver.find_element(By.XPATH, page2018XPATH)
canada2018 = driver.find_element(By.XPATH, canada2018XPATH)

from selenium.webdriver.common.action_chains import ActionChains

actions = ActionChains(driver)

def goToPage(xpath, sec):
    actions.move_to_element(xpath)
    actions.click(xpath)
    actions.perform()
    time.sleep(sec)

# testing individual page access
goToPage(page2018,3)
goToPage(canada2018,3)



# save page html
#html = driver.page_source

#soup = BeautifulSoup(html)

参考该图像,我需要访问 UI 中右侧部分的所有单个事件链接,即事件日志。从每个链接复制每个 XPATH 元素将是费力且不必要的。此外,这些将不断更新,我需要一种方法来访问单个元素,而无需从浏览器的开发工具中手动复制和粘贴。

界面截图

问题: - 是否有一种有效的方法可以使用 python(并避免使用开发工具)为 selenium 挖掘 html 元素。

-- 是否有可能通过解析 HTML 的 DOM 用漂亮的汤来做到这一点 -- 如果提议的方法扩展到 UI 中的任何元素,那就更好了。

注意- 如果可能的话,我不知道如何在美丽的汤中做到这一点。

问候,J

标签: pythonseleniumxpathbeautifulsoupbrowser-automation

解决方案


推荐阅读