首页 > 解决方案 > Tableau Public 的 Python Selenium 网页抓取:如何将收藏夹分配给工作簿?

问题描述

我编写了我的第一个 Selenium 脚本来练习 Python 中的网页抓取。这个想法是从 Tableau Public 配置文件中抓取所有工作簿、视图和收藏夹。我设法提取了这三个关键变量,但我不知道如何将收藏夹分配给它们各自的工作簿,因为并非所有工作簿都至少有一个收藏夹。

例如,“百老汇的斯凯勒”没有收藏夹,但如果我要在字典中匹配工作簿和收藏夹,它将获得下一个最佳值,即 4。

f.text != "" 仅删除列表末尾的空值。

解决这个问题的最佳方法是什么?

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time

driver = webdriver.Chrome(executable_path=r',mypath')

driver.get("https://public.tableau.com/profile/skybjohnson#!/")

#load entire website:

while True:

   try:
       show_more = WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.ID, "load-more-vizzes")))
       driver.find_element_by_id("load-more-vizzes")
       driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
       WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.ID, "load-more-vizzes")))

   except Exception as e:
       print(e)
       break

#get workbook titles
titles = driver.find_elements_by_class_name("workbook-title")

workbook_titles = [i.text for i in titles if i.text != ""]
print(workbook_titles)

#get number of views per workbook
views = driver.find_elements_by_class_name('workbook-view-count')

workbook_views = [int(v.text.split()[0]) for v in views if v.text != ""]
print(workbook_views)

#get number of favourites per workbook
favs = driver.find_elements_by_xpath('//SPAN[@ng-bind="controller.workbook.numberOfFavorites"]')

workbook_favs = [f.text for f in favs if f.text != ""]
print(workbook_favs)

标签: pythonseleniumweb-scraping

解决方案


首先,您可以获取所有 Vizz,然后获取子标题、视图和收藏夹。您还必须检查是否存在观看次数和收藏夹。您可以找到改进的滚动和正确的方法来获取视图计数(如果没有视图,则为 0)和收藏(如果没有收藏,则为 0):

wait = WebDriverWait(driver, 10)
with driver:
    driver.get("https://public.tableau.com/profile/skybjohnson#!/")

    wait.until(EC.presence_of_element_located((By.ID, "load-more-vizzes")))
    while driver.find_element_by_id("load-more-vizzes").is_displayed():
        driver.execute_script("arguments[0].scrollIntoView()", driver.find_element_by_id("load-more-vizzes"))

    vizzes = wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".viz-container li.media-viz")))
    for viz in vizzes:
        if not viz.is_displayed():
            continue

        title = viz.find_element_by_css_selector('[ng-bind="controller.workbook.title"]').text

        views_count_list = viz.find_elements_by_css_selector('[ng-bind="controller.workbook.viewCount"]')
        views_count = views_count_list[0].text if len(views_count_list) > 0 else 0

        number_of_favorites_list = viz.find_elements_by_css_selector('[ng-bind="controller.workbook.numberOfFavorites"]')
        number_of_favorites = number_of_favorites_list[0].text if len(number_of_favorites_list) > 0 else 0

        print(title, views_count, number_of_favorites)

推荐阅读