首页 > 解决方案 > nni frameworkcontroller 问题,BarrierUnknownFailed 错误

问题描述

https://nni.readthedocs.io/en/stable/TrainingService/FrameworkControllerMode.html

我按照这个例子用 nni + kubernetes 集群训练我的模型。

我已经设置了框架控制器(https://github.com/Microsoft/frameworkcontroller/tree/master/example/run#run-by-kubernetes-statefulset)、k8s-nvidia-plugin和 NFS 服务器。

在命令行中,我输入了“nnictl create --config frameworkConfig.yaml”

frameworkConfig.yaml 在这里:

authorName: default
experimentName: example_mnist
trialConcurrency: 1
maxExecDuration: 10h
maxTrialNum: 100
#choice: local, remote, pai, kubeflow, frameworkcontroller
trainingServicePlatform: frameworkcontroller
searchSpacePath: ~/nni/examples/trials/mnist-tfv1/search_space.json
#choice: true, false
useAnnotation: false
tuner:
  #choice: TPE, Random, Anneal, Evolution
  builtinTunerName: TPE
  classArgs:
    #choice: maximize, minimize
    optimize_mode: maximize
assessor:
  builtinAssessorName: Medianstop
  classArgs:
    optimize_mode: maximize
trial:
  codeDir: ~/nni/examples/trials/mnist-tfv1
  taskRoles:
    - name: worker
      taskNum: 1
      command: python3 mnist.py
      gpuNum: 1
      cpuNum: 1
      memoryMB: 8192
      image: msranni/nni:latest
      frameworkAttemptCompletionPolicy:
        minFailedTaskCount: 1
        minSucceededTaskCount: 1
frameworkcontrollerConfig:
  storage: nfs
  nfs:
    server: {your_nfs_server}
    path: {your_nfs_server_exported_path}

这是“kubectl describe pod”的日志消息

Name:         nniexploq1yrw9trialblpky-worker-0
Namespace:    default
Priority:     0
Node:         mofl-c246-wu4/192.168.0.28
Start Time:   Fri, 03 Sep 2021 09:43:07 +0900
Labels:       FC_FRAMEWORK_NAME=nniexploq1yrw9trialblpky
              FC_TASKROLE_NAME=worker
              FC_TASK_INDEX=0
Annotations:  FC_CONFIGMAP_NAME: nniexploq1yrw9trialblpky-attempt
              FC_CONFIGMAP_UID: 07f61f6b-4073-480e-90a1-cb582b8221cf
              FC_FRAMEWORK_ATTEMPT_ID: 0
              FC_FRAMEWORK_ATTEMPT_INSTANCE_UID: 0_07f61f6b-4073-480e-90a1-cb582b8221cf
              FC_FRAMEWORK_NAME: nniexploq1yrw9trialblpky
              FC_FRAMEWORK_NAMESPACE: default
              FC_FRAMEWORK_UID: 0ecef503-aaa2-435c-8237-2b6cbb0ff897
              FC_POD_NAME: nniexploq1yrw9trialblpky-worker-0
              FC_TASKROLE_NAME: worker
              FC_TASKROLE_UID: e724f9ec-0c4f-11ec-b7f1-0242ac110006
              FC_TASK_ATTEMPT_ID: 0
              FC_TASK_INDEX: 0
              FC_TASK_UID: e725037b-0c4f-11ec-b7f1-0242ac110006
Status:       Pending
IP:           172.17.0.7
IPs:
  IP:           172.17.0.7
Controlled By:  ConfigMap/nniexploq1yrw9trialblpky-attempt
Init Containers:
  frameworkbarrier:
    Container ID:  docker://da951fb0d65c6e42f440c9e950e128dc246cc72ca8b280e8887c80e6931c7847
    Image:         frameworkcontroller/frameworkbarrier
    Image ID:      docker-pullable://frameworkcontroller/frameworkbarrier@sha256:4f56b0f70d060ab610bc72d994311432565143cd4bb2613916425f8f3e80c69f
    Port:          <none>
    Host Port:     <none>
    State:         Running
      Started:     Fri, 03 Sep 2021 09:53:22 +0900
    Last State:    Terminated
      Reason:      Error
      Message:     Framework object from ApiServer: frameworks.frameworkcontroller.microsoft.com "nniexploq1yrw9trialblpky" is forbidden: User "system:serviceaccount:default:default" cannot get resource "frameworks" in API group "frameworkcontroller.microsoft.com" in the namespace "default"
W0903 00:52:56.964433       9 barrier.go:253] Failed to get Framework object from ApiServer: frameworks.frameworkcontroller.microsoft.com "nniexploq1yrw9trialblpky" is forbidden: User "system:serviceaccount:default:default" cannot get resource "frameworks" in API group "frameworkcontroller.microsoft.com" in the namespace "default"
W0903 00:53:06.962820       9 barrier.go:253] Failed to get Framework object from ApiServer: frameworks.frameworkcontroller.microsoft.com "nniexploq1yrw9trialblpky" is forbidden: User "system:serviceaccount:default:default" cannot get resource "frameworks" in API group "frameworkcontroller.microsoft.com" in the namespace "default"
W0903 00:53:16.963508       9 barrier.go:253] Failed to get Framework object from ApiServer: frameworks.frameworkcontroller.microsoft.com "nniexploq1yrw9trialblpky" is forbidden: User "system:serviceaccount:default:default" cannot get resource "frameworks" in API group "frameworkcontroller.microsoft.com" in the namespace "default"
W0903 00:53:16.963990       9 barrier.go:253] Failed to get Framework object from ApiServer: frameworks.frameworkcontroller.microsoft.com "nniexploq1yrw9trialblpky" is forbidden: User "system:serviceaccount:default:default" cannot get resource "frameworks" in API group "frameworkcontroller.microsoft.com" in the namespace "default"
E0903 00:53:16.963998       9 barrier.go:283] BarrierUnknownFailed: frameworks.frameworkcontroller.microsoft.com "nniexploq1yrw9trialblpky" is forbidden: User "system:serviceaccount:default:default" cannot get resource "frameworks" in API group "frameworkcontroller.microsoft.com" in the namespace "default"
E0903 00:53:16.964013       9 barrier.go:470] ExitCode: 1: Exit with unknown failure to tell controller to retry within maxRetryCount.

      Exit Code:    1
      Started:      Fri, 03 Sep 2021 09:43:16 +0900
      Finished:     Fri, 03 Sep 2021 09:53:16 +0900
    Ready:          False
    Restart Count:  1
    Environment:
      FC_FRAMEWORK_NAMESPACE:             default
      FC_FRAMEWORK_NAME:                  nniexploq1yrw9trialblpky
      FC_TASKROLE_NAME:                   worker
      FC_TASK_INDEX:                      0
      FC_CONFIGMAP_NAME:                  nniexploq1yrw9trialblpky-attempt
      FC_POD_NAME:                        nniexploq1yrw9trialblpky-worker-0
      FC_FRAMEWORK_UID:                   0ecef503-aaa2-435c-8237-2b6cbb0ff897
      FC_FRAMEWORK_ATTEMPT_ID:            0
      FC_FRAMEWORK_ATTEMPT_INSTANCE_UID:  0_07f61f6b-4073-480e-90a1-cb582b8221cf
      FC_CONFIGMAP_UID:                   07f61f6b-4073-480e-90a1-cb582b8221cf
      FC_TASKROLE_UID:                    e724f9ec-0c4f-11ec-b7f1-0242ac110006
      FC_TASK_UID:                        e725037b-0c4f-11ec-b7f1-0242ac110006
      FC_TASK_ATTEMPT_ID:                 0
      FC_POD_UID:                          (v1:metadata.uid)
      FC_TASK_ATTEMPT_INSTANCE_UID:       0_$(FC_POD_UID)
    Mounts:
      /mnt/frameworkbarrier from frameworkbarrier-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jgrrf (ro)
Containers:
  framework:
    Container ID:
    Image:         msranni/nni:latest
    Image ID:
    Port:          4000/TCP
    Host Port:     0/TCP
    Command:
      sh
      /tmp/mount/nni/LOq1YRw9/BlpKy/run_worker.sh
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:             1
      memory:          8Gi
      nvidia.com/gpu:  1
    Requests:
      cpu:             1
      memory:          8Gi
      nvidia.com/gpu:  1
    Environment:
      FC_FRAMEWORK_NAMESPACE:             default
      FC_FRAMEWORK_NAME:                  nniexploq1yrw9trialblpky
      FC_TASKROLE_NAME:                   worker
      FC_TASK_INDEX:                      0
      FC_CONFIGMAP_NAME:                  nniexploq1yrw9trialblpky-attempt
      FC_POD_NAME:                        nniexploq1yrw9trialblpky-worker-0
      FC_FRAMEWORK_UID:                   0ecef503-aaa2-435c-8237-2b6cbb0ff897
      FC_FRAMEWORK_ATTEMPT_ID:            0
      FC_FRAMEWORK_ATTEMPT_INSTANCE_UID:  0_07f61f6b-4073-480e-90a1-cb582b8221cf
      FC_CONFIGMAP_UID:                   07f61f6b-4073-480e-90a1-cb582b8221cf
      FC_TASKROLE_UID:                    e724f9ec-0c4f-11ec-b7f1-0242ac110006
      FC_TASK_UID:                        e725037b-0c4f-11ec-b7f1-0242ac110006
      FC_TASK_ATTEMPT_ID:                 0
      FC_POD_UID:                          (v1:metadata.uid)
      FC_TASK_ATTEMPT_INSTANCE_UID:       0_$(FC_POD_UID)
    Mounts:
      /mnt/frameworkbarrier from frameworkbarrier-volume (rw)
      /tmp/mount from nni-vol (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jgrrf (ro)
Conditions:
  Type              Status
  Initialized       False
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  nni-vol:
    Type:      NFS (an NFS mount that lasts the lifetime of a pod)
    Server:    <my nfs server ip>
    Path:      /another
    ReadOnly:  false
  frameworkbarrier-volume:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  kube-api-access-jgrrf:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age                  From               Message
  ----    ------     ----                 ----               -------
  Normal  Scheduled  12m                  default-scheduler  Successfully assigned default/nniexploq1yrw9trialblpky-worker-0 to mofl-c246-wu4
  Normal  Pulled     12m                  kubelet            Successfully pulled image "frameworkcontroller/frameworkbarrier" in 2.897722411s
  Normal  Pulling    2m10s (x2 over 12m)  kubelet            Pulling image "frameworkcontroller/frameworkbarrier"
  Normal  Pulled     2m8s                 kubelet            Successfully pulled image "frameworkcontroller/frameworkbarrier" in 2.790335708s
  Normal  Created    2m7s (x2 over 12m)   kubelet            Created container frameworkbarrier
  Normal  Started    2m6s (x2 over 12m)   kubelet            Started container frameworkbarrier

kubernetes nni pod 永久保持在“Init”状态

frameworkcontroller-0               1/1     Running    0          5m49s
nniexploq1yrw9trialblpky-worker-0   0/1     Init:0/1   0          42s

标签: kubernetesneural-network

解决方案


github.com/microsoft/frameworkcontroller/issues/64

请参考这个链接来解决这个问题!


推荐阅读