首页 > 解决方案 > k8s:多容器 pod 上的 Liveness 和 Readiness 探测失败

问题描述

我有一个在 AWS EKS 上运行的多容器 pod。一个运行在 80 端口的 Web 应用容器和一个运行在 6379 端口的 Redis 容器。

部署完成后,集群内对 pod 的 IP 地址:端口的手动 curl 探测始终是良好的响应。
服务入口也很好。

但是,kubelet 的探测失败,导致重新启动,我不确定如何复制该探测失败或修复它。

谢谢阅读!

以下是事件:

0s          Warning   Unhealthy                pod/app-7cddfb865b-gsvbg                                   Readiness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s          Warning   Unhealthy                pod/app-7cddfb865b-gsvbg                                   Liveness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s          Warning   Unhealthy                pod/app-7cddfb865b-gsvbg                                   Readiness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s          Warning   Unhealthy                pod/app-7cddfb865b-gsvbg                                   Readiness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s          Warning   Unhealthy                pod/app-7cddfb865b-gsvbg                                   Readiness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s          Warning   Unhealthy                pod/app-7cddfb865b-gsvbg                                   Liveness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s          Warning   Unhealthy                pod/app-7cddfb865b-gsvbg                                   Readiness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s          Warning   Unhealthy                pod/app-7cddfb865b-gsvbg                                   Liveness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s          Warning   Unhealthy                pod/app-7cddfb865b-gsvbg                                   Liveness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s          Normal    Killing                  pod/app-7cddfb865b-gsvbg                                   Container app failed liveness probe, will be restarted
0s          Normal    Pulling                  pod/app-7cddfb865b-gsvbg                                   Pulling image "registry/app:latest"
0s          Normal    Pulled                   pod/app-7cddfb865b-gsvbg                                   Successfully pulled image "registry/app:latest"
0s          Normal    Created                  pod/app-7cddfb865b-gsvbg                                   Created container app

让事情变得通用,这是我的部署 yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "16"
  creationTimestamp: "2021-05-26T22:01:19Z"
  generation: 19
  labels:
    app: app
    chart: app-1.0.0
    environment: production
    heritage: Helm
    owner: acme
    release: app
  name: app
  namespace: default
  resourceVersion: "234691173"
  selfLink: /apis/apps/v1/namespaces/default/deployments/app
  uid: 3149acc2-031e-4719-89e6-abafb0bcdc3c
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: app
      release: app
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 100%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        kubectl.kubernetes.io/restartedAt: "2021-09-17T09:04:49-07:00"
      creationTimestamp: null
      labels:
        app: app
        environment: production
        owner: acme
        release: app
    spec:
      containers:
        - image: redis:5.0.6-alpine
          imagePullPolicy: IfNotPresent
          name: redis
          ports:
            - containerPort: 6379
              hostPort: 6379
              name: redis
              protocol: TCP
          resources:
            limits:
              cpu: 500m
              memory: 500Mi
            requests:
              cpu: 500m
              memory: 500Mi
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
        - env:
            - name: SYSTEM_ENVIRONMENT
              value: production
          envFrom:
            - configMapRef:
                name: app-production
            - secretRef:
                name: app-production
          image: registry/app:latest
          imagePullPolicy: Always
          livenessProbe:
            failureThreshold: 3
            httpGet:
              path: /
              port: 80
              scheme: HTTP
            initialDelaySeconds: 90
            periodSeconds: 20
            successThreshold: 1
            timeoutSeconds: 1
          name: app
          ports:
            - containerPort: 80
              hostPort: 80
              name: app
              protocol: TCP
          readinessProbe:
            failureThreshold: 3
            httpGet:
              path: /
              port: 80
              scheme: HTTP
            initialDelaySeconds: 90
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          resources:
            limits:
              cpu: "1"
              memory: 500Mi
            requests:
              cpu: "1"
              memory: 500Mi
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      priorityClassName: critical-app
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
status:
  availableReplicas: 1
  conditions:
    - lastTransitionTime: "2021-08-10T17:34:18Z"
      lastUpdateTime: "2021-08-10T17:34:18Z"
      message: Deployment has minimum availability.
      reason: MinimumReplicasAvailable
      status: "True"
      type: Available
    - lastTransitionTime: "2021-05-26T22:01:19Z"
      lastUpdateTime: "2021-09-17T16:48:54Z"
      message: ReplicaSet "app-7f7cb8fd4" has successfully progressed.
      reason: NewReplicaSetAvailable
      status: "True"
      type: Progressing
  observedGeneration: 19
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1

这是我的服务 yaml:

apiVersion: v1
kind: Service
metadata:
  creationTimestamp: "2021-05-05T20:11:33Z"
  labels:
    app: app
    chart: app-1.0.0
    environment: production
    heritage: Helm
    owner: acme
    release: app
  name: app
  namespace: default
  resourceVersion: "163989104"
  selfLink: /api/v1/namespaces/default/services/app
  uid: 1f54cd2f-b978-485e-a1af-984ffeeb7db0
spec:
  clusterIP: 172.20.184.161
  externalTrafficPolicy: Cluster
  ports:
    - name: http
      nodePort: 32648
      port: 80
      protocol: TCP
      targetPort: 80
  selector:
    app: app
    release: app
  sessionAffinity: None
  type: NodePort
status:
  loadBalancer: {}

2021 年 10 月 20 日更新:

所以我接受了建议,用这些慷慨的设置来修改就绪探测器:

readinessProbe:
  failureThreshold: 3
  httpGet:
    path: /
    port: 80
    scheme: HTTP
  initialDelaySeconds: 300
  periodSeconds: 10
  successThreshold: 1
  timeoutSeconds: 10

这些是事件:

5m21s       Normal    Scheduled                pod/app-686494b58b-6cjsq                                   Successfully assigned default/app-686494b58b-6cjsq to ip-10-10-14-127.compute.internal
5m20s       Normal    Created                  pod/app-686494b58b-6cjsq                                   Created container redis
5m20s       Normal    Started                  pod/app-686494b58b-6cjsq                                   Started container redis
5m20s       Normal    Pulling                  pod/app-686494b58b-6cjsq                                   Pulling image "registry/app:latest"
5m20s       Normal    Pulled                   pod/app-686494b58b-6cjsq                                   Successfully pulled image "registry/app:latest"
5m20s       Normal    Created                  pod/app-686494b58b-6cjsq                                   Created container app
5m20s       Normal    Pulled                   pod/app-686494b58b-6cjsq                                   Container image "redis:5.0.6-alpine" already present on machine
5m19s       Normal    Started                  pod/app-686494b58b-6cjsq                                   Started container app
0s          Warning   Unhealthy                pod/app-686494b58b-6cjsq                                   Readiness probe failed: Get http://10.10.14.117:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

虽然当我手动请求运行状况检查页面(根页面)时,我看到就绪探测开始起作用,这很奇怪。但即便如此,探测失败并不是因为容器运行不正常——它们是——而是其他地方。

标签: kubernetesamazon-eks

解决方案


让我们检查一下您的探测,以便您了解正在发生的事情并可能找到修复它的方法:


### Readiness probe - "waiting" for the container to be ready
### to get to work.
###

### Liveness is executed once the pod is running which means that
### you have passed the readinessProbe so you might want to start
### with the readinessProbe first


livenessProbe:

  ### - Define how many retries to test the URL before restarting the pod.
  ### Try to increase this number and once your pod is restarted reduce
  ### it back to a lower value
  failureThreshold: 3
    httpGet:
      path: /
      port: 80
      scheme: HTTP
    ###
    ### Delay before executing the first test
    ### As before - try to increase the delay and reduce it 
    ### back when you figured out the correct value
    ###
    initialDelaySeconds: 90

    ### How often (in seconds) to perform the test.
    periodSeconds: 20
    successThreshold: 1

    ### Number of seconds after which the probe times out.
    ### Since the value is 1 I assume that you did not change it.
    ### Same as before - increase the value and figure out what
    ### the current value
    timeoutSeconds: 1


### Same comments as above + `initialDelaySeconds`
### Readiness is "waiting" for the container to be ready to
### get to work.

readinessProbe:
  failureThreshold: 3
  httpGet:
    path: /
    port: 80
    scheme: HTTP

  ### Again, nothing new here, same comments to increase the value
  ### and then reduce it until you figure out what is desired value
  ### for this probe
  initialDelaySeconds: 90
  periodSeconds: 10
  successThreshold: 1
  timeoutSeconds: 1

在此处输入图像描述


查看日志/事件

  • 如果您不确定探测是根本原因,请查看日志和事件以找出这些故障的根本原因

推荐阅读