首页 > 解决方案 > 并行化 for 循环 - Python - 意外问题

问题描述

我正在从网站下载大量文件,并希望它们并行运行,因为它们很重。不幸的是,我无法真正共享该网站,因为要访问这些文件,我需要一个我无法共享的用户名和密码。下面的代码是我的代码,我知道如果没有网站以及我的用户名和密码,它就无法真正运行,但我 99% 确信我不允许共享该信息

import os 
import requests
from multiprocessing import Process

dataset="dataset_name"

################################
def down_file(dspath, file, savepath, ret):
    webfilename = dspath+file
    file_base = os.path.basename(file)
    file = join(savepath, file_base)
    print('...Downloading',file_base)
 
    req = requests.get(webfilename, cookies = ret.cookies, allow_redirects=True, stream=True)
    filesize = int(req.headers['Content-length'])
    with open(file, 'wb') as outfile:
        chunk_size=1048576
        for chunk in req.iter_content(chunk_size=chunk_size):
            outfile.write(chunk)

    return None

################################
##Download files
def download_files(filelist, c_DateNow):
    ## Authenticate    
    url = 'url'
    values = {'email' : 'email', 'passwd' : "password", 'action' : 'login'}
    ret = requests.post(url, data=values)

    ## Path to files
    dspath = 'datasetwebpath'
    
    savepath = join(path_script, dataset, c_DateNow)
    makedirs(savepath, exist_ok = True)

    #"""
    processes = [Process(target=down_file, args=(dspath, file, savepath, ret)) for file in filelist]
    print(["dspath, %s, savepath, ret\n"%(file) for file in filelist])
    
    # kick them off 
    for process in processes:
        print("\n", process)
        process.start()

    # now wait for them to finish
    for process in processes:
        process.join()

    #"""

    ####### This works and it's what i want to parallelize
    """
    ##Download files
    for file in filelist:
        down_file(dspath, file, savepath, ret)
    #"""

################################
def main(c_DateNow, c_DateIni, c_DateFin):    
    ## Other code
    files=["list of web file addresses"] 
    print("   ...Files being downladed\n     ", "\n      ".join(files), "\n")


    ## Doanlad files
    download_files(files, c_DateNow)

我想下载 25 个文件。当我运行代码时,所有之前在代码中打印的打印行都被重新打印,即使Process执行甚至不在它们附近。我也经常收到以下错误

     An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

我用谷歌搜索了错误,不知道如何修复它。是不是核心不够用?有没有办法根据我有多少可用内核来停止进程?还是完全是别的东西?

在这里的一个问题中,我读到Process必须在__main__函数中,但是这段代码是一个模块,它被导入到另一个代码中,所以当我运行它时,我运行它

import this_code 
import another1_code 
import another2_code 

#Step1
another1_code.main()

#Step2
c_DateNow, c_DateIni, c_DateFin = another2_code.main()

#Step3
this_code.main(c_DateNow, c_DateIni, c_DateFin)

#step4
## More code

所以我需要这个过程在一个函数中而不是在__main__

我感谢有关如何正确并行化上述代码的任何帮助或建议,以使我可以将其用作另一个代码中的模块。

标签: pythonpython-3.xmultiprocessingpython-multiprocessing

解决方案


推荐阅读