首页 > 解决方案 > 运行 tensorflow 的芹菜工人无法创建 CUDA 事件

问题描述

我正在将 tensorflow 模型加载到 celery 工作人员,但是当我尝试在工作人员上运行任务时,它显示以下错误:

[2018-09-19 10:29:39,753: INFO/MainProcess] Received task: analyze_atom[f6bb76cc-aa16-4761-a7cf-0ed111886ff8]  
[2018-09-19 10:29:41,198: WARNING/ForkPoolWorker-2] paper checkpoint1 takes 1.433300495147705 senconds
2018-09-19 10:29:41.318467: E tensorflow/core/grappler/clusters/utils.cc:81] Failed to get device properties, error code: 3
2018-09-19 10:29:42.650529: E tensorflow/stream_executor/event.cc:40] could not create CUDA event: CUDA_ERROR_NOT_INITIALIZED
[2018-09-19 10:29:42,673: ERROR/MainProcess] Process 'ForkPoolWorker-2' pid:3782 exited with 'signal 11 (SIGSEGV)'
[2018-09-19 10:29:42,704: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 11 (SIGSEGV).',)
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/billiard/pool.py", line 1223, in mark_as_worker_lost
    human_status(exitcode)),
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 11 (SIGSEGV).

这是一个 tensorflow 模型,当 celery 启动时,模型已成功加载到 GPU 上,这是工作启动日志:

totalMemory: 15.90GiB freeMemory: 15.61GiB
2018-09-19 10:35:38.431559: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-09-19 10:35:38.793007: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-09-19 10:35:38.793054: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 
2018-09-19 10:35:38.793063: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N 
2018-09-19 10:35:38.793487: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15131 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:08.0, compute capability: 6.0)
2018-09-19 10:35:40.552010: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-09-19 10:35:40.552073: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-09-19 10:35:40.552080: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 
2018-09-19 10:35:40.552085: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N 
2018-09-19 10:35:40.552327: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15131 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:08.0, compute capability: 6.0)
2018-09-19 10:35:41.304281: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-09-19 10:35:41.304336: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-09-19 10:35:41.304344: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 
2018-09-19 10:35:41.304348: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N 
2018-09-19 10:35:41.304574: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15131 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:08.0, compute capability: 6.0)
2018-09-19 10:35:43.013963: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-09-19 10:35:43.014025: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-09-19 10:35:43.014033: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 
2018-09-19 10:35:43.014038: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N 
2018-09-19 10:35:43.037554: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15131 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:08.0, compute capability: 6.0)
2018-09-19 10:35:43.916442: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-09-19 10:35:43.916500: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-09-19 10:35:43.916507: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 
2018-09-19 10:35:43.916512: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N 
2018-09-19 10:35:43.916752: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15131 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:08.0, compute capability: 6.0)
2018-09-19 10:35:44.137238: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-09-19 10:35:44.137296: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-09-19 10:35:44.137304: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 
2018-09-19 10:35:44.137308: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N 
2018-09-19 10:35:44.137563: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15131 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:08.0, compute capability: 6.0)
[2018-09-19 10:35:44,650: INFO/MainProcess] Connected to amqp://yjyx:**@118.178.129.156:5672/yjyx
[2018-09-19 10:35:44,667: INFO/MainProcess] mingle: searching for neighbors
[2018-09-19 10:35:45,716: INFO/MainProcess] mingle: sync with 1 nodes
[2018-09-19 10:35:45,717: INFO/MainProcess] mingle: sync complete
[2018-09-19 10:35:45,750: INFO/MainProcess] celery@yjyx-gpu-1 ready.

我还看到分配了 GPU 内存:

在此处输入图像描述

我正在使用主管来运行 celery,这是主管配置:

[program:celeryworker_paperanalyzer]

process_name=%(process_num)02d
directory=/home/yjyx/yijiao_src/yijiao_main
command=celery worker -A project.celerytasks.celery_worker_init -Q paperanalyzer -c 2 --loglevel=INFO

user=yjyx
numprocs=1
stdout_logfile=/home/yjyx/log/celeryworker_paperanalyzer0.log
stderr_logfile=/home/yjyx/log/celeryworker_paperanalyzer1.log
stdout_logfile_maxbytes=50MB                           ; maximum size of logfile before rotation
stderr_logfile_maxbytes=50MB
stderr_logfile_backups=10                              ; number of backed up logfiles
stdout_logfile_backups=10

autostart=false
autorestart=false
startsecs=5

stopwaitsecs=8
killasgroup=true
priority=1000

这是芹菜任务代码片段:

@shared_task(name="analyze_atom", queue="paperanalyzer")
def analyze_atom(image_urls, targetdir=target_path, studentuid=None):
    try:
        if targetdir is not None and os.path.exists(targetdir):
            os.chdir(targetdir)
        paper = Paper(image_urls, studentuid)
        for image_url in paper.image_urls:
            if type(image_url) == str:
                paper.analyze(image_url)  # tensorflow inference get called within paper.analyze
            elif type(image_url) == dict:
                paper.analyze(image_url['url'], str(image_url['pn']), image_url.get('cormode', 0))
        return paper.data
    except Exception as e:
        logger.log(40, traceback.print_exc())
        logger.log(40, e)
        return {}

我确信整个过程应该没问题,实际上,我在 paper.analyze 中使用了 opencv 来处理这项工作,而且效果很好,现在我只需将 opencv 更改为 tensorflow。

环境:Python3.6.4;张量流 1.8;芹菜 4.0.2; 操作系统:Centos 7.2

任何帮助将不胜感激。:-)

谢谢。

卫斯理

标签: pythontensorflowcelery

解决方案


推荐阅读