首页 > 解决方案 > 异常后如何在python中“重新填充”工作队列?

问题描述

我正在尝试构建一个多线程硒刮刀。假设我想使用 20 个 ChromeDriver 实例获取 100.000 个网站并打印它们的页面源。到目前为止,我有以下代码:

from queue import Queue
from threading import Thread
from selenium import webdriver
from numpy.random import randint


selenium_data_queue = Queue()
worker_queue = Queue()

# Start 20 ChromeDriver instances
worker_ids = list(range(20))
selenium_workers = {i: webdriver.Chrome() for i in worker_ids}
for worker_id in worker_ids:
    worker_queue.put(worker_id)


def selenium_task(worker, data):

    # Open website
    worker.get(data)
    
    # Print website page source
    print(worker.page_source)

def selenium_queue_listener(data_queue, worker_queue):

    while True:
        url = data_queue.get()
        worker_id = worker_queue.get()

        worker = selenium_workers[worker_id]
        
        # Assign current worker and url to your selenium function
        selenium_task(worker, url)
        
        # Put the worker back into the worker queue as  it has completed it's task
        worker_queue.put(worker_id)
        data_queue.task_done()
    return


if __name__ == '__main__':
    selenium_processes = [Thread(target=selenium_queue_listener,
                                 args=(selenium_data_queue, worker_queue)) for _ in worker_ids]

    for p in selenium_processes:
        p.daemon = True
        p.start()
    
    # Adding urls indefinitely to data queue
    
    # Generating random url just for testing
    for i in range(100000):
        d = f'http://www.website.com/{i}'
        selenium_data_queue.put(d)
    
    # Wait for all selenium queue listening processes to complete
    selenium_data_queue.join()
    
    # Tearing down web workers
    for b in selenium_workers.values():
        b.quit()

我的问题是:如果任何 ChromeDriver 突然关闭(即不可恢复的异常InvalidSessionIdException,如 如果是这样,有一个很好的做法来完成它?

标签: pythonmultithreadingselenium

解决方案


推荐阅读