kubernetes - 在 Google Kubernetes Engine 上使用 Horizontal Pod Autoscaler 失败并显示:无法读取所有指标
问题描述
我正在尝试设置 Horizontal Pod Autoscaler 以根据 CPU 使用率自动扩展和缩减我的 api 服务器 pod。
我目前为我的 API 运行了 12 个 pod,但它们使用的 CPU 约为 0%。
kubectl get pods
NAME READY STATUS RESTARTS AGE
api-server-deployment-578f8d8649-4cbtc 2/2 Running 2 12h
api-server-deployment-578f8d8649-8cv77 2/2 Running 2 12h
api-server-deployment-578f8d8649-c8tv2 2/2 Running 1 12h
api-server-deployment-578f8d8649-d8c6r 2/2 Running 2 12h
api-server-deployment-578f8d8649-lvbgn 2/2 Running 1 12h
api-server-deployment-578f8d8649-lzjmj 2/2 Running 2 12h
api-server-deployment-578f8d8649-nztck 2/2 Running 1 12h
api-server-deployment-578f8d8649-q25xb 2/2 Running 2 12h
api-server-deployment-578f8d8649-tx75t 2/2 Running 1 12h
api-server-deployment-578f8d8649-wbzzh 2/2 Running 2 12h
api-server-deployment-578f8d8649-wtddv 2/2 Running 1 12h
api-server-deployment-578f8d8649-x95gq 2/2 Running 2 12h
model-server-deployment-76d466dffc-4g2nd 1/1 Running 0 23h
model-server-deployment-76d466dffc-9pqw5 1/1 Running 0 23h
model-server-deployment-76d466dffc-d29fx 1/1 Running 0 23h
model-server-deployment-76d466dffc-frrgn 1/1 Running 0 23h
model-server-deployment-76d466dffc-sfh45 1/1 Running 0 23h
model-server-deployment-76d466dffc-w2hqj 1/1 Running 0 23h
我的 api_hpa.yaml 看起来像:
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server-deployment
minReplicas: 4
maxReplicas: 12
targetCPUUtilizationPercentage: 50
现在已经 24 小时了,即使没有看到 CPU 使用情况,HPA 仍然没有将我的 pod 缩减到 4 个。
当我查看 GKE 部署详细信息仪表板时,我看到警告无法读取所有指标
这是否会导致自动缩放器无法缩小我的 pod?
我该如何解决?
据我了解,GKE 会自动运行一个指标服务器:
kubectl get deployment --namespace=kube-system
NAME READY UP-TO-DATE AVAILABLE AGE
event-exporter-gke 1/1 1 1 18d
kube-dns 2/2 2 2 18d
kube-dns-autoscaler 1/1 1 1 18d
l7-default-backend 1/1 1 1 18d
metrics-server-v0.3.6 1/1 1 1 18d
stackdriver-metadata-agent-cluster-level 1/1 1 1 18d
这是该指标服务器的配置:
Name: metrics-server-v0.3.6
Namespace: kube-system
CreationTimestamp: Sun, 21 Feb 2021 11:20:55 -0800
Labels: addonmanager.kubernetes.io/mode=Reconcile
k8s-app=metrics-server
kubernetes.io/cluster-service=true
version=v0.3.6
Annotations: deployment.kubernetes.io/revision: 14
Selector: k8s-app=metrics-server,version=v0.3.6
Replicas: 1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 25% max unavailable, 25% max surge
Pod Template:
Labels: k8s-app=metrics-server
version=v0.3.6
Annotations: seccomp.security.alpha.kubernetes.io/pod: docker/default
Service Account: metrics-server
Containers:
metrics-server:
Image: k8s.gcr.io/metrics-server-amd64:v0.3.6
Port: 443/TCP
Host Port: 0/TCP
Command:
/metrics-server
--metric-resolution=30s
--kubelet-port=10255
--deprecated-kubelet-completely-insecure=true
--kubelet-preferred-address-types=InternalIP,Hostname,InternalDNS,ExternalDNS,ExternalIP
Limits:
cpu: 48m
memory: 95Mi
Requests:
cpu: 48m
memory: 95Mi
Environment: <none>
Mounts: <none>
metrics-server-nanny:
Image: gke.gcr.io/addon-resizer:1.8.10-gke.0
Port: <none>
Host Port: <none>
Command:
/pod_nanny
--config-dir=/etc/config
--cpu=40m
--extra-cpu=0.5m
--memory=35Mi
--extra-memory=4Mi
--threshold=5
--deployment=metrics-server-v0.3.6
--container=metrics-server
--poll-period=300000
--estimator=exponential
--scale-down-delay=24h
--minClusterSize=5
--use-metrics=true
Limits:
cpu: 100m
memory: 300Mi
Requests:
cpu: 5m
memory: 50Mi
Environment:
MY_POD_NAME: (v1:metadata.name)
MY_POD_NAMESPACE: (v1:metadata.namespace)
Mounts:
/etc/config from metrics-server-config-volume (rw)
Volumes:
metrics-server-config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: metrics-server-config
Optional: false
Priority Class Name: system-cluster-critical
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True NewReplicaSetAvailable
OldReplicaSets: <none>
NewReplicaSet: metrics-server-v0.3.6-787886f769 (1/1 replicas created)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ScalingReplicaSet 3m10s (x2 over 5m39s) deployment-controller Scaled up replica set metrics-server-v0.3.6-7c9d64c44 to 1
Normal ScalingReplicaSet 2m54s (x2 over 5m23s) deployment-controller Scaled down replica set metrics-server-v0.3.6-787886f769 to 0
Normal ScalingReplicaSet 2m50s (x2 over 4m49s) deployment-controller Scaled up replica set metrics-server-v0.3.6-787886f769 to 1
Normal ScalingReplicaSet 2m33s (x2 over 4m34s) deployment-controller Scaled down replica set metrics-server-v0.3.6-7c9d64c44 to 0
编辑:2021-03-13
这是 api 服务器部署的配置:
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server-deployment
spec:
replicas: 12
selector:
matchLabels:
app: api-server
template:
metadata:
labels:
app: api-server
spec:
serviceAccountName: api-kubernetes-service-account
nodeSelector:
#<labelname>:value
cloud.google.com/gke-nodepool: api-nodepool
containers:
- name: api-server
image: gcr.io/questions-279902/taskserver:latest
imagePullPolicy: "Always"
ports:
- containerPort: 80
#- containerPort: 443
args:
- --disable_https
- --db_ip_address=127.0.0.1
- --modelserver_address=http://10.128.0.18:8501 # kubectl get service model-service --output yaml
resources:
# You must specify requests for CPU to autoscale
# based on CPU utilization
requests:
cpu: "250m"
- name: cloud-sql-proxy
...
解决方案
我没有看到任何“<a href="https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#resource-types" rel="nofollow noreferrer">resources:”字段(例如cpu,mem等)分配,这应该是根本原因。请注意,在 HPA(Horizontal Pod Autoscaler)上设置资源是一项要求,在Kubernetes 官方文档中进行了解释
请注意,如果某些 Pod 的容器没有设置相关的资源请求,则不会定义 Pod 的 CPU 利用率,并且自动缩放器不会对该指标采取任何操作。
这肯定会导致消息无法读取目标部署上的所有指标。
推荐阅读
- mysql - 空表上的左连接在 MySQL 中返回错误
- elixir - 为什么进程 ID 返回未定义?
- apache-spark - 在 PySPARK 中创建具有从所有其他列创建的值作为 JSON 的列
- c# - => 和 get {} 之间的区别
- r - 获取与值匹配的 data.table 列的索引
- python-3.x - 仅使用 Pymongo 获取关于(时间或 ID)的最新文档
- javascript - 带有装饰器的 JS 验证
- spring - 如何在 Spring 集成 ServiceActivator 上触发多个输出通道?
- android - DatagramSocket 在 Android 9 上没有接收到一些数据包
- css - 试图向我的 CSS 添加代码但它不起作用