首页 > 解决方案 > 在 Python 中使用多处理来加速网站请求

问题描述

我正在使用 requests 模块下载许多网站的内容,如下所示:

import requests
for i in range(1000):
    url = base_url + f"{i}.anything"
    r = requests.get(url)

当然这是简化的,但基本的 url 基本都是一样的,比如我只想下载一张图片。由于迭代的数量,这需要很长时间。互联网连接不是问题,而是启动请求所需的时间等。所以我在考虑多处理之类的东西,因为这个任务基本上总是一样的,我可以想象它在多处理。

这在某种程度上可行吗?提前致谢!

标签: pythonpython-requestsmultiprocessing

解决方案


我建议在这种情况下,轻量级线程会更好。当我在某个 URL 上运行请求 5 次时,结果是:

Threads: Finished in 0.24 second(s)
MultiProcess: Finished in 0.77 second(s)

您的实现可能是这样的:

import concurrent.futures
import requests
from bs4 import BeautifulSoup
import time

def access_url(url,No):
    print(f"{No}:==> {url}")
    response=requests.get(url)
    soup=BeautifulSoup(response.text,features='lxml')
    return ("{} :  {}".format(No, str(soup.title)[7:50]))

if __name__ == "__main__":
    test_url="http://bla bla.com/"
    base_url=test_url
    THREAD_MULTI_PROCESSING= True
    start = time.perf_counter() # calculate the time
    url_list=[base_url for i in range(5)] # setting parameter for function as a list so map can be used.
    url_counter=[i for i in range(5)] # setting parameter for function as a list so map can be used.
    if THREAD_MULTI_PROCESSING:
        with concurrent.futures.ThreadPoolExecutor() as executor: # In this case thread would be better
            results = executor.map(access_url,url_list,url_counter)
        for result in results:
            print(result)
    end = time.perf_counter() # calculate finish time
    print(f'Threads: Finished in {round(end - start,2)} second(s)')

    start = time.perf_counter()
    PROCESS_MULTI_PROCESSING=True
    if PROCESS_MULTI_PROCESSING:
        with concurrent.futures.ProcessPoolExecutor() as executor:
            results = executor.map(access_url,url_list,url_counter)
        for result in results:
            print(result)
    end = time.perf_counter()
    print(f'Threads: Finished in {round(end - start,2)} second(s)')

我认为你会在你的情况下看到更好的表现。


推荐阅读