首页 > 解决方案 > Python Selenium 网络抓取与并发期货

问题描述

我正在编写一个脚本,它将从同一网站的多篇文章中抓取数据。我想使用线程来加快整个过程。我一直在使用下面的代码:

from selenium import webdriver
import pickle
import time
from concurrent.futures import ThreadPoolExecutor, as_completed

def scraping(url):
    option = webdriver.ChromeOptions()
    option.add_argument("--incognito")
    driver = webdriver.Chrome(chrome_options=option)
    driver.get(url)
    # elements will be added to dictionary 'dic'
    dic = {}
    ##########################################
    # Selecting elements to scrape goes here #
    ##########################################
    driver.quit()
    return dic

# loading the URLS and adding them to a list called 'lines'
with open ('URLS', 'rb') as fp:
    lines = pickle.load(fp)


with ThreadPoolExecutor(max_workers = 2) as executor:

    start = time.time()
    futures = { executor.submit(scraping, url): url for url in lines }
    data = []
    for result in as_completed(futures):
        try:
            data.append(result.result())
        except Exception as e:
            print(e)
    end = time.time()
    print("Time Taken: {:.6f}s".format(end-start))

但是,每次调用“scraping”函数时,它都会创建一个新的浏览器实例,因此在每次迭代时我都有 2 个新窗口。我的目标是让 2 个窗口保持打开并继续抓取链接。任何人都知道如何做到这一点?

标签: pythonseleniumpython-multithreading

解决方案


推荐阅读