首页 > 解决方案 > Selenium (python) 在抓取许多页面时崩溃 (50+)

问题描述

我有以下脚本(如下)来刮掉 TRACE 的债券数据(代码中的链接)。这是修改后的版本:https ://github.com/treatmesubj/FINRABondScrape 。

它适用于 50-100 页,但当我尝试抓取所有约 7,600 页时失败。该脚本在此处失败:bond = [tablerow.text]并引发以下错误:

“StaleElementReferenceException:消息:元素引用已过时;要么元素不再附加到 DOM,它不在当前框架上下文中,要么文档已被刷新”

我添加了一个明确的等待,tablerows认为某些表需要更长的时间才能加载,但由于问题仍然存在,它似乎没有帮助。我尝试了其他几件事,但我没有想法。

关于如何解决此问题的任何想法都会有所帮助。此外,欢迎任何加快代码速度的技巧。谢谢!

更新:来自 KunduK + 的以下建议在循环中增加time.sleep(0.8)似乎已解决问题。但是,在接受 KunduK 的答案之前,我会稍等片刻,以防其他人提出更好的答案。time.sleep(1.5)for

# TRACE Bond Scraper
import os
import time
import numpy as np
import pandas as pd
from datetime import date
from datetime import datetime as dt
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.firefox.options import Options

options = Options()
options.headless = False
driver = webdriver.Firefox(options = options)
driver.get('http://finra-markets.morningstar.com/BondCenter/Results.jsp')

# Click agree, edit search and submit 
WebDriverWait(driver, 10).until(EC.element_to_be_clickable(
    (By.CSS_SELECTOR, ".button_blue.agree"))).click()
WebDriverWait(driver, 10).until(EC.element_to_be_clickable(
    (By.CSS_SELECTOR, 'a.qs-ui-btn.blue'))).click()
WebDriverWait(driver, 10).until(EC.element_to_be_clickable(
    (By.CSS_SELECTOR, 'a.ms-display-switcher.hide'))).click()
WebDriverWait(driver, 10).until(EC.element_to_be_clickable(
    (By.CSS_SELECTOR, 'input.button_blue[type=submit]'))).click()
WebDriverWait(driver, 10).until(EC.presence_of_element_located(
    (By.CSS_SELECTOR, '.rtq-grid-row.rtq-grid-rzrow .rtq-grid-cell-ctn')))
headers = [title.text for title in driver.find_elements_by_css_selector(
    '.rtq-grid-row.rtq-grid-rzrow .rtq-grid-cell-ctn')[1:]]

# Find out the total number of pages to scrape
pg_no = WebDriverWait(driver, 10).until(EC.presence_of_element_located(
            (By.CSS_SELECTOR, '.qs-pageutil-total > span:nth-child(1)'))).text
pg_no = int(pg_no)

# Scrape tables
bonds = []
for page in range(1, pg_no):
    WebDriverWait(driver, 10).until(EC.presence_of_element_located(
        (By.CSS_SELECTOR, (f"a.qs-pageutil-btn.on[value='{str(page)}']"))))
    time.sleep(0.8)
    tablerows = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located(
        (By.CSS_SELECTOR, 'div.rtq-grid-bd > div.rtq-grid-row')))
    for tablerow in tablerows:
        bond = [tablerow.text]
        bonds.append(bond)
    WebDriverWait(driver, 10).until(EC.presence_of_element_located(
        (By.CSS_SELECTOR, ('a.qs-pageutil-next')))).click()

标签: pythonseleniumweb-scraping

解决方案


将这一行从

WebDriverWait(driver, 10).until(EC.presence_of_element_located(
        (By.CSS_SELECTOR, (f"a.qs-pageutil-btn.on[value='{str(page)}']"))))

这删除了类on

WebDriverWait(driver, 10).until(EC.presence_of_element_located(
        (By.CSS_SELECTOR, (f"a.qs-pageutil-btn[value='{str(page)}']"))))

推荐阅读