首页 > 解决方案 > 尝试从网站实时抓取文章时,我总是遇到延迟

问题描述

我正在尝试使用 selenium 来更新我何时在特定网站上发布文章。但是,我注意到的一个问题是代码总是至少延迟 1 分 30 秒阅读文章。例如,它在 7:00 发布,代码在 7:01-7:02 读取它。我还注意到这些文章有一个时间戳,但它们出现在网站上的时间晚于时间戳所指示的时间。为什么会这样?我正在使用 selenium,因为该站点具有 JavaScript。我也尝试用 bs4 抓取 RSS 提要,看看是否可以解决延迟问题,但它仍然存在 :(

这是我的参考代码:

import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as ec
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from pandas.io.html import read_html
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
import time
from bs4 import BeautifulSoup
import smtplib
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
from email.mime.base import MIMEBase
from email import encoders
import schedule
import re
from datetime import datetime

url = 'https://www.globenewswire.com/NewsRoom'
currentData = []
pastData = []

def publicScrape():
    options = Options()
    options.headless = True
    driver = webdriver.Chrome('/Users/sajjad/Dropbox/My Mac (MacBook Pro)/Downloads/chromedriver-4',options=options)
    driver.get(url)
    cookies_button = WebDriverWait(driver, 10).until(
        ec.element_to_be_clickable((By.XPATH, '//*[@id="cookies-consent"]/div/div/div/div/div/div[2]/button')))
    driver.execute_script("arguments[0].click();", cookies_button)

    links = driver.find_elements_by_xpath('//a')

    #print(links[15].text)

    x = links[15].text

    currentData.append(x)

    driver.quit()
    return x



def comparison(new_data):
    if new_data in pastData:
        return False
    else:
        return True


def run():
    now = datetime.now()
    x = publicScrape()
    del currentData[:]
    publicScrape()
    if currentData is not None:
        for x in currentData:
            compare = comparison(x)
            if compare:
                pastData.append(x)
                print(f'New article: {x} published at {now}')
            else:
                print("Already Read")

    else:
        print("Nothing new has been published")

schedule.every(1).seconds.do(run)

while True:
    schedule.run_pending()


标签: pythonseleniumselenium-webdriver

解决方案


推荐阅读