首页 > 解决方案 > 错误:多处理 Python 中打开的文件过多

问题描述

Input (a.txt) contains data as:
{person1: [www.person1links1.com]}

{person2: [www.person2links1.com,www.person2links2.com]}...(36000 lines of such data)

我有兴趣从每个人的个人链接中提取数据,我的代码是:

def get_bio(authr,urllist):
    author_data=[]
    for each in urllist:
        try:
            html = urllib.request.urlopen(each).read()
            author_data.append(html)
        except:
            continue
    f=open(authr+'.txt','w+')
    for each in author_data:
        f.write(str(each))
        f.write('\n')
        f.write('********************************************')
        f.write('\n')
    f.close()
if __name__ == '__main__':
    q=mp.Queue()
    processes=[]
    with open('a.txt') as f:
        for each in f:
            q.put(each)# dictionary
    while (q.qsize())!=0:
        for authr,urls in q.get().items():
            p=mp.Process(target=get_bio,args=(authr,urls))
            processes.append(p)
            p.start()
    for proc in processes:
        proc.join()

运行此代码时出现以下错误(我尝试设置 ulimit 但遇到相同的错误):

OSError: [Errno 24] Too many open files: 'personx.txt'
Traceback (most recent call last):
  File "perbio_mp.py", line 88, in <module>
    p.start()
  File "/usr/lib/python3.5/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/usr/lib/python3.5/multiprocessing/context.py", line 212, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/usr/lib/python3.5/multiprocessing/context.py", line 267, in _Popen
    return Popen(process_obj)
  File "/usr/lib/python3.5/multiprocessing/popen_fork.py", line 20, in __init__
    self._launch(process_obj)
  File "/usr/lib/python3.5/multiprocessing/popen_fork.py", line 66, in _launch
    parent_r, child_w = os.pipe()
OSError: [Errno 24] Too many open files

请指出我错在哪里,我该如何纠正。谢谢

标签: pythonmultithreadingmultiprocessing

解决方案


urlopen返回包装打开文件的响应对象。您的代码没有关闭这些文件,因此出现了问题。

响应对象也是一个上下文管理器,所以不是

    html = urllib.request.urlopen(each).read()
    author_data.append(html)

你可以做

with urllib.request.urlopen(each) as response:
    author_data.append(response.read())

确保文件在读取后关闭。

此外,正如folkol 在评论中所观察到的,您应该将活动进程的数量减少到一个合理的数量,因为每个进程都会在操作系统级别打开文件。


推荐阅读