kubernetes - k8s:多容器 pod 上的 Liveness 和 Readiness 探测失败
问题描述
我有一个在 AWS EKS 上运行的多容器 pod。一个运行在 80 端口的 Web 应用容器和一个运行在 6379 端口的 Redis 容器。
部署完成后,集群内对 pod 的 IP 地址:端口的手动 curl 探测始终是良好的响应。
服务入口也很好。
但是,kubelet 的探测失败,导致重新启动,我不确定如何复制该探测失败或修复它。
谢谢阅读!
以下是事件:
0s Warning Unhealthy pod/app-7cddfb865b-gsvbg Readiness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s Warning Unhealthy pod/app-7cddfb865b-gsvbg Liveness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s Warning Unhealthy pod/app-7cddfb865b-gsvbg Readiness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s Warning Unhealthy pod/app-7cddfb865b-gsvbg Readiness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s Warning Unhealthy pod/app-7cddfb865b-gsvbg Readiness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s Warning Unhealthy pod/app-7cddfb865b-gsvbg Liveness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s Warning Unhealthy pod/app-7cddfb865b-gsvbg Readiness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s Warning Unhealthy pod/app-7cddfb865b-gsvbg Liveness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s Warning Unhealthy pod/app-7cddfb865b-gsvbg Liveness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s Normal Killing pod/app-7cddfb865b-gsvbg Container app failed liveness probe, will be restarted
0s Normal Pulling pod/app-7cddfb865b-gsvbg Pulling image "registry/app:latest"
0s Normal Pulled pod/app-7cddfb865b-gsvbg Successfully pulled image "registry/app:latest"
0s Normal Created pod/app-7cddfb865b-gsvbg Created container app
让事情变得通用,这是我的部署 yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "16"
creationTimestamp: "2021-05-26T22:01:19Z"
generation: 19
labels:
app: app
chart: app-1.0.0
environment: production
heritage: Helm
owner: acme
release: app
name: app
namespace: default
resourceVersion: "234691173"
selfLink: /apis/apps/v1/namespaces/default/deployments/app
uid: 3149acc2-031e-4719-89e6-abafb0bcdc3c
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: app
release: app
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 100%
type: RollingUpdate
template:
metadata:
annotations:
kubectl.kubernetes.io/restartedAt: "2021-09-17T09:04:49-07:00"
creationTimestamp: null
labels:
app: app
environment: production
owner: acme
release: app
spec:
containers:
- image: redis:5.0.6-alpine
imagePullPolicy: IfNotPresent
name: redis
ports:
- containerPort: 6379
hostPort: 6379
name: redis
protocol: TCP
resources:
limits:
cpu: 500m
memory: 500Mi
requests:
cpu: 500m
memory: 500Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
- env:
- name: SYSTEM_ENVIRONMENT
value: production
envFrom:
- configMapRef:
name: app-production
- secretRef:
name: app-production
image: registry/app:latest
imagePullPolicy: Always
livenessProbe:
failureThreshold: 3
httpGet:
path: /
port: 80
scheme: HTTP
initialDelaySeconds: 90
periodSeconds: 20
successThreshold: 1
timeoutSeconds: 1
name: app
ports:
- containerPort: 80
hostPort: 80
name: app
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /
port: 80
scheme: HTTP
initialDelaySeconds: 90
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: "1"
memory: 500Mi
requests:
cpu: "1"
memory: 500Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
priorityClassName: critical-app
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
status:
availableReplicas: 1
conditions:
- lastTransitionTime: "2021-08-10T17:34:18Z"
lastUpdateTime: "2021-08-10T17:34:18Z"
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
- lastTransitionTime: "2021-05-26T22:01:19Z"
lastUpdateTime: "2021-09-17T16:48:54Z"
message: ReplicaSet "app-7f7cb8fd4" has successfully progressed.
reason: NewReplicaSetAvailable
status: "True"
type: Progressing
observedGeneration: 19
readyReplicas: 1
replicas: 1
updatedReplicas: 1
这是我的服务 yaml:
apiVersion: v1
kind: Service
metadata:
creationTimestamp: "2021-05-05T20:11:33Z"
labels:
app: app
chart: app-1.0.0
environment: production
heritage: Helm
owner: acme
release: app
name: app
namespace: default
resourceVersion: "163989104"
selfLink: /api/v1/namespaces/default/services/app
uid: 1f54cd2f-b978-485e-a1af-984ffeeb7db0
spec:
clusterIP: 172.20.184.161
externalTrafficPolicy: Cluster
ports:
- name: http
nodePort: 32648
port: 80
protocol: TCP
targetPort: 80
selector:
app: app
release: app
sessionAffinity: None
type: NodePort
status:
loadBalancer: {}
2021 年 10 月 20 日更新:
所以我接受了建议,用这些慷慨的设置来修改就绪探测器:
readinessProbe:
failureThreshold: 3
httpGet:
path: /
port: 80
scheme: HTTP
initialDelaySeconds: 300
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 10
这些是事件:
5m21s Normal Scheduled pod/app-686494b58b-6cjsq Successfully assigned default/app-686494b58b-6cjsq to ip-10-10-14-127.compute.internal
5m20s Normal Created pod/app-686494b58b-6cjsq Created container redis
5m20s Normal Started pod/app-686494b58b-6cjsq Started container redis
5m20s Normal Pulling pod/app-686494b58b-6cjsq Pulling image "registry/app:latest"
5m20s Normal Pulled pod/app-686494b58b-6cjsq Successfully pulled image "registry/app:latest"
5m20s Normal Created pod/app-686494b58b-6cjsq Created container app
5m20s Normal Pulled pod/app-686494b58b-6cjsq Container image "redis:5.0.6-alpine" already present on machine
5m19s Normal Started pod/app-686494b58b-6cjsq Started container app
0s Warning Unhealthy pod/app-686494b58b-6cjsq Readiness probe failed: Get http://10.10.14.117:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
虽然当我手动请求运行状况检查页面(根页面)时,我看到就绪探测开始起作用,这很奇怪。但即便如此,探测失败并不是因为容器运行不正常——它们是——而是其他地方。
解决方案
让我们检查一下您的探测,以便您了解正在发生的事情并可能找到修复它的方法:
### Readiness probe - "waiting" for the container to be ready
### to get to work.
###
### Liveness is executed once the pod is running which means that
### you have passed the readinessProbe so you might want to start
### with the readinessProbe first
livenessProbe:
### - Define how many retries to test the URL before restarting the pod.
### Try to increase this number and once your pod is restarted reduce
### it back to a lower value
failureThreshold: 3
httpGet:
path: /
port: 80
scheme: HTTP
###
### Delay before executing the first test
### As before - try to increase the delay and reduce it
### back when you figured out the correct value
###
initialDelaySeconds: 90
### How often (in seconds) to perform the test.
periodSeconds: 20
successThreshold: 1
### Number of seconds after which the probe times out.
### Since the value is 1 I assume that you did not change it.
### Same as before - increase the value and figure out what
### the current value
timeoutSeconds: 1
### Same comments as above + `initialDelaySeconds`
### Readiness is "waiting" for the container to be ready to
### get to work.
readinessProbe:
failureThreshold: 3
httpGet:
path: /
port: 80
scheme: HTTP
### Again, nothing new here, same comments to increase the value
### and then reduce it until you figure out what is desired value
### for this probe
initialDelaySeconds: 90
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
查看日志/事件
- 如果您不确定探测是根本原因,请查看日志和事件以找出这些故障的根本原因
推荐阅读
- spring-kafka - spring-kafka:使用 ReplyingKafkaTemplate 的自动配置问题
- c++ - 带有 realloc() 的程序工作得很好,但最后它返回错误 -1073741819。有人可以在这段代码中找到我的错误吗?
- android - 无法使用 RoomDatabase.query 更新 sqlite_sequence 表
- java - 将布尔值存储在一个大的二维数组中
- python - 如何将公式存储在函数中,作为字符串稍后输出?
- html - 半圆与全圆 CSS 一个元素
- jupyter-notebook - 有没有办法在 bitbucket io 站点中分配非默认的 mime 类型?
- java - Java FX 中的动态条形图
- java - Tomcat的“Catalina服务”和“Servlet引擎”的区别
- session - 当一个 hazelcast 在码头会话复制中出现故障时,我们得到 com.hazelcast.core.MemberLeftException