kubernetes - 为什么这个 Kubernetes pod 没有触发我们的自动扩缩器进行扩容?
问题描述
我们正在运行一个具有自动缩放器的 Kubernetes 集群,据我所知,它在大多数情况下都能完美运行。当我们更改给定部署的副本数超过集群资源时,自动缩放器会捕获它并向上扩展。同样,如果我们需要更少的资源,我们会缩小规模。
直到今天,我们的 Airflow 部署的一些 pod 停止工作,因为它们无法获得所需的资源。Pod 不会触发集群扩展,而是会立即失败或因试图请求或使用比可用资源更多的资源而被驱逐。请参阅下面失败的 pod 之一的 YAML 输出。pod 也永远不会显示为Pending
:它们会立即从启动跳到失败状态。
就某种重试容错而言,我是否缺少某些东西会触发 Pod 挂起并因此等待扩大规模?
apiVersion: v1
kind: Pod
metadata:
annotations:
kubernetes.io/psp: eks.privileged
creationTimestamp: "2019-12-02T22:41:19Z"
name: ingest-customer-ff06ae4d
namespace: airflow
resourceVersion: "32545690"
selfLink: /api/v1/namespaces/airflow/pods/ingest-customer-ff06ae4d
uid: dba8b4c1-1554-11ea-ac6b-12ff56d05229
spec:
affinity: {}
containers:
- args:
- scripts/fetch_and_run.sh
env:
- name: COMPANY
value: acme
- name: ENVIRONMENT
value: production
- name: ELASTIC_BUCKET
value: customer
- name: ELASTICSEARCH_HOST
value: <redacted>
- name: PATH_TO_EXEC
value: tools/storage/store_elastic.py
- name: PYTHONWARNINGS
value: ignore:Unverified HTTPS request
- name: PATH_TO_REQUIREMENTS
value: tools/requirements.txt
- name: GIT_REPO_URL
value: <redacted>
- name: GIT_COMMIT
value: <redacted>
- name: SPARK
value: "true"
image: dkr.ecr.us-east-1.amazonaws.com/spark-runner:dev
imagePullPolicy: IfNotPresent
name: base
resources:
limits:
memory: 28Gi
requests:
memory: 28Gi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /mnt/ssd
name: tmp-disk
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-cgpcc
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
hostNetwork: true
priority: 0
restartPolicy: Never
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- emptyDir: {}
name: tmp-disk
- name: default-token-cgpcc
secret:
defaultMode: 420
secretName: default-token-cgpcc
status:
conditions:
- lastProbeTime: "2019-12-02T22:41:19Z"
lastTransitionTime: "2019-12-02T22:41:19Z"
message: '0/9 nodes are available: 9 Insufficient memory.'
reason: Unschedulable
status: "False"
type: PodScheduled
phase: Pending
qosClass: Burstable
解决方案
推荐阅读
- javascript - HTML5 必需标记不适用于我的 javascript 创建的函数
- docker - How does docker-compose create aliases for link ips
- machine-learning - 学习曲线:训练集大小应该递增还是随机选择?
- javascript - 仅当所有字段都有效时才反应最终形式我想处理提交
- mobile - 有什么好的手机框架吗?
- python - 如何将 np.matrix 的每一行保存到 9 对 touple 数组中?
- javascript - 包括用于加速移动页面的外部 CSS 和 JS?
- d3.js - 如何在 d3.scaleTime() 轴刻度中禁止年份?
- intellij-idea - 如何创建一个通过热键插入字符的动作?
- python - PySpark:使用 goup 索引添加列