首页 > 解决方案 > 为什么这个 Kubernetes pod 没有触发我们的自动扩缩器进行扩容?

问题描述

我们正在运行一个具有自动缩放器的 Kubernetes 集群,据我所知,它在大多数情况下都能完美运行。当我们更改给定部署的副本数超过集群资源时,自动缩放器会捕获它并向上扩展。同样,如果我们需要更少的资源,我们会缩小规模。

直到今天,我们的 Airflow 部署的一些 pod 停止工作,因为它们无法获得所需的资源。Pod 不会触发集群扩展,而是会立即失败或因试图请求或使用比可用资源更多的资源而被驱逐。请参阅下面失败的 pod 之一的 YAML 输出。pod 也永远不会显示为Pending:它们会立即从启动跳到失败状态。

就某种重试容错而言,我是否缺少某些东西会触发 Pod 挂起并因此等待扩大规模?

apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubernetes.io/psp: eks.privileged
  creationTimestamp: "2019-12-02T22:41:19Z"
  name: ingest-customer-ff06ae4d
  namespace: airflow
  resourceVersion: "32545690"
  selfLink: /api/v1/namespaces/airflow/pods/ingest-customer-ff06ae4d
  uid: dba8b4c1-1554-11ea-ac6b-12ff56d05229
spec:
  affinity: {}
  containers:
  - args:
    - scripts/fetch_and_run.sh
    env:
    - name: COMPANY
      value: acme
    - name: ENVIRONMENT
      value: production
    - name: ELASTIC_BUCKET
      value: customer
    - name: ELASTICSEARCH_HOST
      value: <redacted>
    - name: PATH_TO_EXEC
      value: tools/storage/store_elastic.py
    - name: PYTHONWARNINGS
      value: ignore:Unverified HTTPS request
    - name: PATH_TO_REQUIREMENTS
      value: tools/requirements.txt
    - name: GIT_REPO_URL
      value: <redacted>
    - name: GIT_COMMIT
      value: <redacted>
    - name: SPARK
      value: "true"
    image: dkr.ecr.us-east-1.amazonaws.com/spark-runner:dev
    imagePullPolicy: IfNotPresent
    name: base
    resources:
      limits:
        memory: 28Gi
      requests:
        memory: 28Gi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /mnt/ssd
      name: tmp-disk
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-cgpcc
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostNetwork: true
  priority: 0
  restartPolicy: Never
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - emptyDir: {}
    name: tmp-disk
  - name: default-token-cgpcc
    secret:
      defaultMode: 420
      secretName: default-token-cgpcc
status:
  conditions:
  - lastProbeTime: "2019-12-02T22:41:19Z"
    lastTransitionTime: "2019-12-02T22:41:19Z"
    message: '0/9 nodes are available: 9 Insufficient memory.'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  phase: Pending
  qosClass: Burstable

标签: kubernetesairflow

解决方案


推荐阅读