首页 > 解决方案 > Kubernetes Pod 因 CrashLoopBackoff 而失败,即使 Airflow 2.0 中的退出代码为 0

问题描述

我正在将 Airflow 从版本 1.10 升级到 2.1.0。我的项目用于KubernetesPodOperatorKubernetesExecutor. 在 Airflow 1.10 中一切正常。但是当我升级 Airflow 2.1.0 时,Pod 能够运行任务,并且在成功完成后,它会以CrashLoopBackoff状态重新启动。我已经检查过了livenessProbe,它按预期工作。我检查了其他日志,但在指定的任何容器或 pod 中都找不到任何问题。

部署.yaml 文件:

# Airflows
apiVersion: apps/v1
kind: Deployment
metadata:
  name: airflow
spec:
  selector:
    matchLabels:
      app: airflow
  replicas: 1
  template:
    metadata:
        labels:
          app: airflow
    spec:
      hostAliases:
      - ip: "xx.xx.xx.xx"
        hostnames:
        - "xxx.xxx.xxx"
      initContainers:
        - name: init-db
          image: "{{ .Values.dags_image.repository }}:{{ .Values.dags_image.tag }}"
          imagePullPolicy: Always
          command:
            - "/bin/sh"
          args:
            - "-c"
            - "/usr/local/bin/bootstrap.sh"
          env:
          - name: AIRFLOW__CORE__SQL_ALCHEMY_CONN
            valueFrom:
              secretKeyRef:
                key: AIRFLOW__CORE__SQL_ALCHEMY_CONN
                name: airflow-secrets
          - name: AFPW
            valueFrom:
              secretKeyRef:
                key: AFPW
                name: airflow-secrets
      containers:
      - name: web
        image: "{{ .Values.dags_image.repository }}:{{ .Values.dags_image.tag }}"
        imagePullPolicy: Always
        ports:
        - name: web
          containerPort: 8080
        command:
          - "airflow"
        args:
          - "webserver"
        livenessProbe:
          httpGet:
            path: /
            port: 8080
          initialDelaySeconds: 240
          periodSeconds: 60
        env:
        - name: AIRFLOW__CORE__SQL_ALCHEMY_CONN
          valueFrom:
            secretKeyRef:
              key: AIRFLOW__CORE__SQL_ALCHEMY_CONN
              name: airflow-secrets
## The following values have been created as part of production setup
      - name: scheduler
        image: "{{ .Values.dags_image.repository }}:{{ .Values.dags_image.tag }}"
        imagePullPolicy: Always
        command:
          - "airflow"
        args:
          - "scheduler"
        env:
        - name: AIRFLOW__CORE__SQL_ALCHEMY_CONN
          valueFrom:
            secretKeyRef:
              key: AIRFLOW__CORE__SQL_ALCHEMY_CONN
              name: airflow-secrets

描述吊舱:

Name:         airflow-66776dc57c-z98vd
Namespace:    default
Priority:     0
Node:         gke-gke-xxxxx-de-nodes-xxxxx--ccb62dc3-24us/xxx.xx.xx.xx
Start Time:   Sat, 19 Jun 2021 17:49:16 +0000
Labels:       app=airflow
              pod-template-hash=66776dc57c
Annotations:  <none>
Status:       Running
IP:           xxx.xx.xx.xx
IPs:
  IP:           xxx.xx.xx.xx
Controlled By:  ReplicaSet/airflow-66776dc57c
Init Containers:
  init-db:
    Container ID:  xxxxxxxxx
    Image:         xxxxxxxxx
    Image ID:      xxxxxxxxx
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
    Args:
      -c
      /usr/local/bin/bootstrap.sh
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sat, 19 Jun 2021 17:50:04 +0000
      Finished:     Sat, 19 Jun 2021 17:50:23 +0000
    Ready:          True
    Restart Count:  0
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-kw529 (ro)
Containers:
  web:
    Container ID:  xxxxxxxxx
    Image:         xxxxxxxxx
    Image ID:      xxxxxxxxx
    Port:          8080/TCP
    Host Port:     0/TCP
    Command:
      airflow
    Args:
      webserver
    State:          Running
      Started:      Sat, 19 Jun 2021 17:50:24 +0000
    Ready:          True
    Restart Count:  0
    Liveness:       http-get http://:8080/ delay=240s timeout=1s period=60s #success=1 #failure=3
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-kw529 (ro)
  scheduler:
    Container ID:  xxxxxxxxx
    Image:         xxxxxxxxx
    Image ID:      xxxxxxxxx
    Port:          <none>
    Host Port:     <none>
    Command:
      airflow
    Args:
      scheduler
    State:          Running
      Started:      Sat, 19 Jun 2021 17:50:25 +0000
    Ready:          True
    Restart Count:  0
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-kw529 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  default-token-kw529:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-kw529
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s

Worker pod 列表和日志

标签: kubernetesairflowkubernetes-helmkubernetes-podairflow-2.x

解决方案


restartPolicy: Always

总是意味着容器将被重新启动,即使它以零退出代码退出(即成功)。您可以明确指定restartPolicy: Never. 它始终默认

检查为什么在 Pod 中启动 daskdev/dask 会失败?对于几乎相同的


推荐阅读