kubernetes - nni frameworkcontroller 问题,BarrierUnknownFailed 错误
问题描述
https://nni.readthedocs.io/en/stable/TrainingService/FrameworkControllerMode.html
我按照这个例子用 nni + kubernetes 集群训练我的模型。
我已经设置了框架控制器(https://github.com/Microsoft/frameworkcontroller/tree/master/example/run#run-by-kubernetes-statefulset)、k8s-nvidia-plugin和 NFS 服务器。
在命令行中,我输入了“nnictl create --config frameworkConfig.yaml”
frameworkConfig.yaml 在这里:
authorName: default
experimentName: example_mnist
trialConcurrency: 1
maxExecDuration: 10h
maxTrialNum: 100
#choice: local, remote, pai, kubeflow, frameworkcontroller
trainingServicePlatform: frameworkcontroller
searchSpacePath: ~/nni/examples/trials/mnist-tfv1/search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
assessor:
builtinAssessorName: Medianstop
classArgs:
optimize_mode: maximize
trial:
codeDir: ~/nni/examples/trials/mnist-tfv1
taskRoles:
- name: worker
taskNum: 1
command: python3 mnist.py
gpuNum: 1
cpuNum: 1
memoryMB: 8192
image: msranni/nni:latest
frameworkAttemptCompletionPolicy:
minFailedTaskCount: 1
minSucceededTaskCount: 1
frameworkcontrollerConfig:
storage: nfs
nfs:
server: {your_nfs_server}
path: {your_nfs_server_exported_path}
这是“kubectl describe pod”的日志消息
Name: nniexploq1yrw9trialblpky-worker-0
Namespace: default
Priority: 0
Node: mofl-c246-wu4/192.168.0.28
Start Time: Fri, 03 Sep 2021 09:43:07 +0900
Labels: FC_FRAMEWORK_NAME=nniexploq1yrw9trialblpky
FC_TASKROLE_NAME=worker
FC_TASK_INDEX=0
Annotations: FC_CONFIGMAP_NAME: nniexploq1yrw9trialblpky-attempt
FC_CONFIGMAP_UID: 07f61f6b-4073-480e-90a1-cb582b8221cf
FC_FRAMEWORK_ATTEMPT_ID: 0
FC_FRAMEWORK_ATTEMPT_INSTANCE_UID: 0_07f61f6b-4073-480e-90a1-cb582b8221cf
FC_FRAMEWORK_NAME: nniexploq1yrw9trialblpky
FC_FRAMEWORK_NAMESPACE: default
FC_FRAMEWORK_UID: 0ecef503-aaa2-435c-8237-2b6cbb0ff897
FC_POD_NAME: nniexploq1yrw9trialblpky-worker-0
FC_TASKROLE_NAME: worker
FC_TASKROLE_UID: e724f9ec-0c4f-11ec-b7f1-0242ac110006
FC_TASK_ATTEMPT_ID: 0
FC_TASK_INDEX: 0
FC_TASK_UID: e725037b-0c4f-11ec-b7f1-0242ac110006
Status: Pending
IP: 172.17.0.7
IPs:
IP: 172.17.0.7
Controlled By: ConfigMap/nniexploq1yrw9trialblpky-attempt
Init Containers:
frameworkbarrier:
Container ID: docker://da951fb0d65c6e42f440c9e950e128dc246cc72ca8b280e8887c80e6931c7847
Image: frameworkcontroller/frameworkbarrier
Image ID: docker-pullable://frameworkcontroller/frameworkbarrier@sha256:4f56b0f70d060ab610bc72d994311432565143cd4bb2613916425f8f3e80c69f
Port: <none>
Host Port: <none>
State: Running
Started: Fri, 03 Sep 2021 09:53:22 +0900
Last State: Terminated
Reason: Error
Message: Framework object from ApiServer: frameworks.frameworkcontroller.microsoft.com "nniexploq1yrw9trialblpky" is forbidden: User "system:serviceaccount:default:default" cannot get resource "frameworks" in API group "frameworkcontroller.microsoft.com" in the namespace "default"
W0903 00:52:56.964433 9 barrier.go:253] Failed to get Framework object from ApiServer: frameworks.frameworkcontroller.microsoft.com "nniexploq1yrw9trialblpky" is forbidden: User "system:serviceaccount:default:default" cannot get resource "frameworks" in API group "frameworkcontroller.microsoft.com" in the namespace "default"
W0903 00:53:06.962820 9 barrier.go:253] Failed to get Framework object from ApiServer: frameworks.frameworkcontroller.microsoft.com "nniexploq1yrw9trialblpky" is forbidden: User "system:serviceaccount:default:default" cannot get resource "frameworks" in API group "frameworkcontroller.microsoft.com" in the namespace "default"
W0903 00:53:16.963508 9 barrier.go:253] Failed to get Framework object from ApiServer: frameworks.frameworkcontroller.microsoft.com "nniexploq1yrw9trialblpky" is forbidden: User "system:serviceaccount:default:default" cannot get resource "frameworks" in API group "frameworkcontroller.microsoft.com" in the namespace "default"
W0903 00:53:16.963990 9 barrier.go:253] Failed to get Framework object from ApiServer: frameworks.frameworkcontroller.microsoft.com "nniexploq1yrw9trialblpky" is forbidden: User "system:serviceaccount:default:default" cannot get resource "frameworks" in API group "frameworkcontroller.microsoft.com" in the namespace "default"
E0903 00:53:16.963998 9 barrier.go:283] BarrierUnknownFailed: frameworks.frameworkcontroller.microsoft.com "nniexploq1yrw9trialblpky" is forbidden: User "system:serviceaccount:default:default" cannot get resource "frameworks" in API group "frameworkcontroller.microsoft.com" in the namespace "default"
E0903 00:53:16.964013 9 barrier.go:470] ExitCode: 1: Exit with unknown failure to tell controller to retry within maxRetryCount.
Exit Code: 1
Started: Fri, 03 Sep 2021 09:43:16 +0900
Finished: Fri, 03 Sep 2021 09:53:16 +0900
Ready: False
Restart Count: 1
Environment:
FC_FRAMEWORK_NAMESPACE: default
FC_FRAMEWORK_NAME: nniexploq1yrw9trialblpky
FC_TASKROLE_NAME: worker
FC_TASK_INDEX: 0
FC_CONFIGMAP_NAME: nniexploq1yrw9trialblpky-attempt
FC_POD_NAME: nniexploq1yrw9trialblpky-worker-0
FC_FRAMEWORK_UID: 0ecef503-aaa2-435c-8237-2b6cbb0ff897
FC_FRAMEWORK_ATTEMPT_ID: 0
FC_FRAMEWORK_ATTEMPT_INSTANCE_UID: 0_07f61f6b-4073-480e-90a1-cb582b8221cf
FC_CONFIGMAP_UID: 07f61f6b-4073-480e-90a1-cb582b8221cf
FC_TASKROLE_UID: e724f9ec-0c4f-11ec-b7f1-0242ac110006
FC_TASK_UID: e725037b-0c4f-11ec-b7f1-0242ac110006
FC_TASK_ATTEMPT_ID: 0
FC_POD_UID: (v1:metadata.uid)
FC_TASK_ATTEMPT_INSTANCE_UID: 0_$(FC_POD_UID)
Mounts:
/mnt/frameworkbarrier from frameworkbarrier-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jgrrf (ro)
Containers:
framework:
Container ID:
Image: msranni/nni:latest
Image ID:
Port: 4000/TCP
Host Port: 0/TCP
Command:
sh
/tmp/mount/nni/LOq1YRw9/BlpKy/run_worker.sh
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Limits:
cpu: 1
memory: 8Gi
nvidia.com/gpu: 1
Requests:
cpu: 1
memory: 8Gi
nvidia.com/gpu: 1
Environment:
FC_FRAMEWORK_NAMESPACE: default
FC_FRAMEWORK_NAME: nniexploq1yrw9trialblpky
FC_TASKROLE_NAME: worker
FC_TASK_INDEX: 0
FC_CONFIGMAP_NAME: nniexploq1yrw9trialblpky-attempt
FC_POD_NAME: nniexploq1yrw9trialblpky-worker-0
FC_FRAMEWORK_UID: 0ecef503-aaa2-435c-8237-2b6cbb0ff897
FC_FRAMEWORK_ATTEMPT_ID: 0
FC_FRAMEWORK_ATTEMPT_INSTANCE_UID: 0_07f61f6b-4073-480e-90a1-cb582b8221cf
FC_CONFIGMAP_UID: 07f61f6b-4073-480e-90a1-cb582b8221cf
FC_TASKROLE_UID: e724f9ec-0c4f-11ec-b7f1-0242ac110006
FC_TASK_UID: e725037b-0c4f-11ec-b7f1-0242ac110006
FC_TASK_ATTEMPT_ID: 0
FC_POD_UID: (v1:metadata.uid)
FC_TASK_ATTEMPT_INSTANCE_UID: 0_$(FC_POD_UID)
Mounts:
/mnt/frameworkbarrier from frameworkbarrier-volume (rw)
/tmp/mount from nni-vol (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jgrrf (ro)
Conditions:
Type Status
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
nni-vol:
Type: NFS (an NFS mount that lasts the lifetime of a pod)
Server: <my nfs server ip>
Path: /another
ReadOnly: false
frameworkbarrier-volume:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
kube-api-access-jgrrf:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 12m default-scheduler Successfully assigned default/nniexploq1yrw9trialblpky-worker-0 to mofl-c246-wu4
Normal Pulled 12m kubelet Successfully pulled image "frameworkcontroller/frameworkbarrier" in 2.897722411s
Normal Pulling 2m10s (x2 over 12m) kubelet Pulling image "frameworkcontroller/frameworkbarrier"
Normal Pulled 2m8s kubelet Successfully pulled image "frameworkcontroller/frameworkbarrier" in 2.790335708s
Normal Created 2m7s (x2 over 12m) kubelet Created container frameworkbarrier
Normal Started 2m6s (x2 over 12m) kubelet Started container frameworkbarrier
kubernetes nni pod 永久保持在“Init”状态
frameworkcontroller-0 1/1 Running 0 5m49s
nniexploq1yrw9trialblpky-worker-0 0/1 Init:0/1 0 42s
解决方案
github.com/microsoft/frameworkcontroller/issues/64
请参考这个链接来解决这个问题!
推荐阅读
- javascript - 为什么我通过 javascript 的 css 样式不适用于桌面屏幕尺寸?
- php - 即使我正在注册 Laravel 6.x,也找不到命令
- html - 无法使用 GitHub 托管/渲染图像
- python - 使用两个数据库时,Django中的表不存在错误
- rtf - 同一文档中的 BI 发布者肖像景观
- python - 无法写入 y/n 以继续在 kaggle 中安装库
- spring - 理解 Spring Integration 上 Http GET 请求的正确配置的问题
- hibernate - 在刷新之前保存瞬态实例:com.bookstore.domain.security.UserRole.role -> com.bookstore.domain.security.Role
- javascript - 在 .ejs 文件中使用 .js 文件中的函数
- sql - 查找没有特定类型相关行的事务