首页 > 解决方案 > 多张不同域名同时截图

问题描述

我目前正在编写一个代码,允许用户使用多线程同时对不同的网页进行多个屏幕截图。

编码:

import selenium
import threading
import time, datetime
from datetime import date, timedelta
from selenium import webdriver

domain_file = r'C:\Users\a\testfiles\testdomains.txt'
driver = webdriver.PhantomJS()
def file_len(file):
    with open(file, 'r') as f:
        for i, l in enumerate(f):
            pass
        return i + 1

current_date = date.today().strftime('%Y-%m-%d_')

def threadedloop(d):
    with open(domain_file, 'r') as f:
        for line in f:

            stripped_line = line.rstrip()
            url1 = 'http://' + stripped_line
            url2 = 'https://' + stripped_line
            imgname = current_date + 'http_' + stripped_line + '.png'
            imgSname = current_date + 'https_' + stripped_line + '.png'

            ### Screenshot function ###

            def scrshot():

                print('Taking screenshot of {}.'.format(stripped_line))

                try:
                    driver.get(url1)
                except TimeoutException:
                    print('{} timed out'.format(url1))
                    pass
                except Exception:
                    print('Unknown error at {}'.format(stripped_line))

                driver.maximize_window()
                driver.save_screenshot(imgname)
                try:
                    driver.get(url2)
                except TimeoutException:
                    print('{} timed out'.format(url2))
                    pass
                except Exception:
                    print('Unknown error at {}'.format(stripped_line))

                driver.maximize_window()
                driver.save_screenshot(imgSname)

            scrshot()

d = threading.local

start = time.time()

for i in range(file_len(domain_file)):
    t = threading.Thread(target = threadedloop, args=(d,))
    t.start()

t.join()

end = time.time()

print(end - start)

测试文件由 4 个域组成。问题是网页不是每个都添加到 1 个单个线程,而是每个都添加到所有 4 个线程,导致输出:

Taking screenshot of google.com.
Taking screenshot of google.com.
Taking screenshot of google.com.
Taking screenshot of google.com.
Taking screenshot of reddit.com.
Taking screenshot of reddit.com.
Taking screenshot of reddit.com.
Taking screenshot of reddit.com.
Taking screenshot of facebook.com.
Taking screenshot of facebook.com.
Taking screenshot of facebook.com.
Taking screenshot of facebook.com.
Taking screenshot of facebook.com.
Taking screenshot of twitter.com.
Taking screenshot of twitter.com.
Taking screenshot of twitter.com.
Taking screenshot of twitter.com.

任何帮助是极大的赞赏。

标签: pythonpython-multithreading

解决方案


我浏览了您的代码并意识到您没有正确划分子任务。

def threadedloop(d):
    with open(domain_file, 'r') as f:
       for line in f:

函数中的这两行读取每一行作为函数“threadlocal”的输入。这意味着,每次调用此函数时,都会读取和处理每个url。

接下来,在多线程部分

for i in range(file_len(domain_file)):
    t = threading.Thread(target = threadedloop, args=(d,))
    t.start()

再次读取每一行并将其分配给线程,这恰好调用了函数threadedloop。我想你已经看到了问题。

更好的方法是在创建线程之前只执行 url 分发部分(就像您在代码中完成第二位的方式)。您使用用于传递threading.local的 args 参数将 url 传递给函数。


推荐阅读