首页 > 解决方案 > 有没有办法在脚本中为 Dask YarnCluster 更新而不是覆盖 worker_env?

问题描述

在我的 Dask-Yarn 配置文件中,即~.config/dask/yarn.yaml,我将工作环境变量设置如下:

yarn:

  name: dask                 # Application name
  queue: default             # Yarn queue to deploy to
  deploy-mode: remote        # The deploy mode to use (either remote or local)
  environment: /dask_yarn.tar.gz          # Path to conda packed environment
  user: ''                     # The user to submit the application on behalf of

  worker:                   # Specifications of worker containers
    count: 0                # Number of workers to start on initialization
    restarts: -1            # Allowed number of restarts, -1 for unlimited
    env: {"ARROW_LIBHDFS_DIR": "/usr/hdp/lib"}                 # A map of environment variables to set on the worker

现在,在我的脚本中,我想在脚本中派生的工作人员中设置另一个环境变量,例如,

cluster = YarnCluster(worker_env={"env_var": env_val})

whereenv_val是在上述语句之前在此脚本中派生的。但是这个语句,将覆盖前面指定的配置~.config/dask/yarn.yaml。我不想ARROW_LIBHDFS_DIR在我的脚本中硬编码,我也不能设置这个变量,~.config/dask/yarn.yaml因为它是在脚本执行期间派生的。那么有没有办法只更新脚本中的工作环境而不覆盖它呢?

标签: dask

解决方案


构造函数没有选项,但您现在可以通过访问 dask 的配置来做到这一点:

import dask
# Get the existing worker_env field (use `.copy` so as not to mutate it)
worker_env = dask.config.get("yarn.worker.env", {}).copy()
# Add a new environment variable
worker_env["env_var"] = env_var
# Create your cluster
cluster = YarnCluster(worker_env=worker_env, ...)

推荐阅读