首页 > 解决方案 > ML-Engine unable to access job_dir directory in bucket

问题描述

I am attempting to submit a job for training in ML-Engine using gcloud but am running into an error with service account permissions that I can't figure out. The model code exists on a Compute Engine instance from which I am running gcloud ml-engine jobs submit as part of a bash script. I have created a service account (ai-platform-developer@....iam.gserviceaccount.com) for gcloud authentication on the VM instance and have created a bucket for the job and model data. The service account has been granted Storage Object Viewer and Storage Object Creator roles for the bucket and the VM and bucket all belong to the same project.

When I try to submit a job per this tutorial, the following are executed:

time_stamp=`date +"%Y%m%d_%H%M"`
job_name='ObjectDetection_'${time_stamp}

gsutil cp object_detection/samples/configs/faster_rcnn_resnet50.config 
gs://[bucket-name]/training_configs/faster-rcnn-resnet50.${job_name}.config

gcloud ml-engine jobs submit training ${job_name} \
    --project [project-name] \
    --runtime-version 1.12 \
    --job-dir=gs://[bucket-name]/jobs/${job_name} \
    --packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz,/tmp/pycocotools/pycocotools-2.0.tar.gz \
    --module-name object_detection.model_main \
    --region us-central1 \
    --config object_detection/training-config.yml \
    -- \
    --model_dir=gs://[bucket-name]/output/${job_name}} \
    --pipeline_config_path=gs://[bucket-name]/training_configs/faster-rcnn-resnet50.${job_name}.config

where [bucket-name] and [project-name] are placeholders for the bucket created above and the project it and the VM are contained in.

The config file is successfully uploaded to the bucket, I can confirm it exists in the cloud console. However, the job fails to submit with the following error:

ERROR: (gcloud.ml-engine.jobs.submit.training) User [ai-platform-developer@....iam.gserviceaccount.com] does not have permission to access project [project-name] (or it may not exist): Field: job_dir Error: You don't have the permission to access the provided directory 'gs://[bucket-name]/jobs/ObjectDetection_20190709_2001'
- '@type': type.googleapis.com/google.rpc.BadRequest
  fieldViolations:
  - description: You don't have the permission to access the provided directory 'gs://[bucket-name]/jobs/ObjectDetection_20190709_2001'
    field: job_dir

If I look in the cloud console, the files specified by --packages exist in that location, and I've ensured the service account ai-platform-developer@....iam.gserviceaccount.com has been given Storage Object Viewer and Storage Object Creator roles for the bucket, which has bucket level permissions set. After ensuring the service account is activated and the default, I can also run

gsutil ls gs://[bucket-name]/jobs/ObjectDetection_20190709_2001

which successfully returns the contents of the folder without a permission error. In the project, there exists a managed service account service-[project-number]@cloud-ml.google.com.iam.gserviceaccount.com and I have also granted this account Storage Object Viewer and Storage Object Creator roles on the bucket.

To confirm this VM is able to submit a job, I am able to switch the gcloud user to my personal account and the script runs and submits a job without any error. However, since this exists in a shared VM, I would like to rely on service account authorization instead of my own user account.

标签: google-cloud-storagegcloudgoogle-cloud-ml

解决方案


我有一个类似的问题,错误完全相同。

我发现解决这些错误的最简单方法是转到“日志记录”并搜索“PERMISSION DENIED”文本。

在我的情况下,服务帐户缺少权限“storage.buckets.get”。然后,您需要找到具有此权限的角色。您可以从 IAM->Roles 执行此操作。在该视图中,您可以按权限名称过滤角色。事实证明,只有以下角色具有所需的权限:

  • 存储管理员
  • 存储旧版存储桶所有者
  • Storage Legacy Bucket Reader
  • 存储传统存储桶写入器

我将“Storage Legacy Bucket Writer”角色添加到存储桶中的服务帐户,然后能够提交作业。


推荐阅读