google-kubernetes-engine - 具有默认类 PV 的 Pod 需要 30 分钟升级等待磁盘连接
问题描述
我部署了一个带有 1 个 pod 和 2 个容器的 helm chart (statefulSet),其中一个容器附加了一个 PV (readwriteonce)。升级时,需要 30 分钟(7 次失败尝试)才能再次上升(因此服务关闭 30 分钟)
一些上下文:
- PV 使用默认的 GKE 类
- 是一个 GKE 区域,每个区域中有一个节点
- 即使没有强制执行,Pod 也会在同一个节点中再次出现(所以不是我可以看到的节点转移)
- 我在 azure AKS 中遇到了类似的问题,它也失败了 7 次,但速度更快,因此停机时间最短,并且涉及节点传输
yaml文件的相关部分:
volumeMounts:
- mountPath: /app/data
name: prod-data
volumeClaimTemplates:
- metadata:
creationTimestamp: null
name: prod-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 500Gi
storageClassName: standard
volumeMode: Filesystem
错误消息:
Unable to mount volumes for pod "foo" timeout expired waiting for volumes to attach or mount for pod "foo". list of unmounted volumes=[foo] list of unattached volumes [foo default-token-foo]
附加上下文,这是触发 StatefulSet 升级后发生的情况:
什么都没有改变
Name: prod-data-prod-0
Namespace: prod
StorageClass: standard
Status: Bound
Volume: pvc-16f49d12-f644-11e9-952a-4201ac100008
Labels: app=prod
release=prod
Annotations: pv.kubernetes.io/bind-completed: yes
pv.kubernetes.io/bound-by-controller: yes
volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/gce-pd
Finalizers: [kubernetes.io/pvc-protection]
Capacity: 500Gi
Access Modes: RWO
VolumeMode: Filesystem
Mounted By: prod-0
Events: <none>
然后第一个错误
Unable to mount volumes for pod "prod-0_prod(89fb0cf5-0008-11ea-b349-4201ac100009)": timeout expired waiting for volumes to attach or mount for pod "prod"/"prod-0". list of unmounted volumes=[prod-data]. list of unattached volumes=[prod-data default-token-4624v]
还是一样的描述
Name: prod-data-prod-0
Namespace: prod
StorageClass: standard
Status: Bound
Volume: pvc-16f49d12-f644-11e9-952a-4201ac100008
Labels: app=prod
release=prod
Annotations: pv.kubernetes.io/bind-completed: yes
pv.kubernetes.io/bound-by-controller: yes
volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/gce-pd
Finalizers: [kubernetes.io/pvc-protection]
Capacity: 500Gi
Access Modes: RWO
VolumeMode: Filesystem
Mounted By: prod-0
Events: <none>
在第二次挂载失败后,这是 pod 描述
Conditions:
Type Status
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
vlapi-prod-data:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: prod-data-prod-0
ReadOnly: false
default-token-4624v:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-4624v
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
FailedMount nr 3 对 pod 描述的 pvc 描述事件没有变化
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 8m44s default-scheduler Successfully assigned prod/prod-0 to gke-vlgke-a-default-pool-312c60b0-p8lb
Warning FailedMount 2m8s (x3 over 6m41s) kubelet, gke-vlgke-a-default-pool-312c60b0-p8lb Unable to mount volumes for pod "prod-0_prod(89fb0cf5-0008-11ea-b349-4201ac100009)": timeout expired waiting for volumes to attach or mount for pod "prod"/"prod-0". list of unmounted volumes=[prod-data]. list of unattached volumes=[prod-data default-token-4624v]
警告 FailedMount 48s (x4 over 7m38s) 警告 FailedMount 13s (x5 over 9m17s)
Name: pvc-16f49d12-f644-11e9-952a-4201ac100008
Labels: failure-domain.beta.kubernetes.io/region=europe-west1
failure-domain.beta.kubernetes.io/zone=europe-west1-d
Annotations: kubernetes.io/createdby: gce-pd-dynamic-provisioner
pv.kubernetes.io/bound-by-controller: yes
pv.kubernetes.io/provisioned-by: kubernetes.io/gce-pd
Finalizers: [kubernetes.io/pv-protection]
StorageClass: standard
Status: Bound
Claim: prod/prod-data-prod-0
Reclaim Policy: Retain
Access Modes: RWO
VolumeMode: Filesystem
Capacity: 500Gi
Node Affinity:
Required Terms:
Term 0: failure-domain.beta.kubernetes.io/zone in [europe-west1-d]
failure-domain.beta.kubernetes.io/region in [europe-west1]
Message:
Source:
Type: GCEPersistentDisk (a Persistent Disk resource in Google Compute Engine)
PDName: gke-vlgke-a-0d42343f-d-pvc-16f49d12-f644-11e9-952a-4201ac100008
FSType: ext4
Partition: 0
ReadOnly: false
FailedMount 47s (x6 over 12m) FailedMount 11s (x7 over 13m) FailedMount 33s (x8 over 16m) FailedMount 9s (x9 over 18m) FailedMount 0s (x10 over 20m) ~2m between FailedMount timeout
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 24m default-scheduler Successfully assigned prod/prod-0 to gke-vlgke-a-default-pool-312c60b0-p8lb
Warning FailedMount 2m4s (x10 over 22m) kubelet, gke-vlgke-a-default-pool-312c60b0-p8lb Unable to mount volumes for pod "prod-0_prod(89fb0cf5-0008-11ea-b349-4201ac100009)": timeout expired waiting for volumes to attach or mount for pod "prod"/"prod-0". list of unmounted volumes=[prod-data]. list of unattached volumes=[prod-data default-token-4624v]
Normal Pulling 11s kubelet, gke-gke-default-pool-312c60b0-p8lb Pulling image "gcr.io/foo-251818/`foo:2019-11-05"
第 11 次尝试安装没有任何变化,我可以从 PVC 描述中了解到
解决方案
一种可能性是您的 pod 的 spec.securityContext.runAsUser 和 spec.securityContext.fsGroup 不同于 0(非 root),并且 k8s 会尝试更改卷上所有文件的文件访问权限,这需要一些时间。尝试在您的 pod 定义中将它们设置为
spec:
securityContext:
runAsUser: 0
fsGroup: 0
其他可能性可能包括 PVC 和 PV 之间的属性(访问模式、容量)不匹配。此外,如果您定义了一个此类 PV,则使用 RWO PVC 提升多个 pod 可能会产生争用。
推荐阅读
- flutter - 我不断收到一条错误消息,提示 Flutter Error Unable to Load Asset
- geopandas - 在 OSMnx 中撤消投影
- webpack - webpack 中的 IgnorePlugin 有什么用
- python - 有没有办法摆脱python中“”中没有的所有其他字符?
- android - 使用 react-native-camera 打开相机时屏幕变黑
- sql - 从枢轴中的另一个 col 值过滤 varchar col
- ubuntu - Shellcode 编译错误
- asp.net-core - 在 Startup.cs 种子数据库期间添加 UserClaims
- jmeter - JMETER 中未生成 HTML 文件
- java - Spring JPA - 如何从数据库中删除对象但不相关