python - 并行化 for 循环 - Python - 意外问题
问题描述
我正在从网站下载大量文件,并希望它们并行运行,因为它们很重。不幸的是,我无法真正共享该网站,因为要访问这些文件,我需要一个我无法共享的用户名和密码。下面的代码是我的代码,我知道如果没有网站以及我的用户名和密码,它就无法真正运行,但我 99% 确信我不允许共享该信息
import os
import requests
from multiprocessing import Process
dataset="dataset_name"
################################
def down_file(dspath, file, savepath, ret):
webfilename = dspath+file
file_base = os.path.basename(file)
file = join(savepath, file_base)
print('...Downloading',file_base)
req = requests.get(webfilename, cookies = ret.cookies, allow_redirects=True, stream=True)
filesize = int(req.headers['Content-length'])
with open(file, 'wb') as outfile:
chunk_size=1048576
for chunk in req.iter_content(chunk_size=chunk_size):
outfile.write(chunk)
return None
################################
##Download files
def download_files(filelist, c_DateNow):
## Authenticate
url = 'url'
values = {'email' : 'email', 'passwd' : "password", 'action' : 'login'}
ret = requests.post(url, data=values)
## Path to files
dspath = 'datasetwebpath'
savepath = join(path_script, dataset, c_DateNow)
makedirs(savepath, exist_ok = True)
#"""
processes = [Process(target=down_file, args=(dspath, file, savepath, ret)) for file in filelist]
print(["dspath, %s, savepath, ret\n"%(file) for file in filelist])
# kick them off
for process in processes:
print("\n", process)
process.start()
# now wait for them to finish
for process in processes:
process.join()
#"""
####### This works and it's what i want to parallelize
"""
##Download files
for file in filelist:
down_file(dspath, file, savepath, ret)
#"""
################################
def main(c_DateNow, c_DateIni, c_DateFin):
## Other code
files=["list of web file addresses"]
print(" ...Files being downladed\n ", "\n ".join(files), "\n")
## Doanlad files
download_files(files, c_DateNow)
我想下载 25 个文件。当我运行代码时,所有之前在代码中打印的打印行都被重新打印,即使Process
执行甚至不在它们附近。我也经常收到以下错误
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
我用谷歌搜索了错误,不知道如何修复它。是不是核心不够用?有没有办法根据我有多少可用内核来停止进程?还是完全是别的东西?
在这里的一个问题中,我读到Process
必须在__main__
函数中,但是这段代码是一个模块,它被导入到另一个代码中,所以当我运行它时,我运行它
import this_code
import another1_code
import another2_code
#Step1
another1_code.main()
#Step2
c_DateNow, c_DateIni, c_DateFin = another2_code.main()
#Step3
this_code.main(c_DateNow, c_DateIni, c_DateFin)
#step4
## More code
所以我需要这个过程在一个函数中而不是在__main__
我感谢有关如何正确并行化上述代码的任何帮助或建议,以使我可以将其用作另一个代码中的模块。
解决方案
推荐阅读
- reactjs - Bloomer 下拉菜单未触发(反应)
- python - 忽略 django 的 post_save 信号中对 m2m 关系的更改
- kubernetes - Kind Kubernetes 集群没有容器日志
- cesium - 在铯中显示当前位置
- eclipse - 我在 web.xml 文件中收到错误。我不知道为什么
- python - 简单插补 - 返回值 Err
- python - 如何使用其值对字典进行排名
- spring - 无法将“java.lang.String”类型的值转换为所需的“java.util.Date”类型
- python - Django ORM SQL 原始 Mysql
- javascript - 全局暴露了一个 es6 模块但为空