首页 > 解决方案 > kube-state-metrics 的错误“没有与所有谓词匹配的节点:MatchNodeSelector (7), PodToleratesNodeTaints (1)”

问题描述

对于 kube-state-metrics,我收到错误消息“没有可用的节点与所有谓词匹配:MatchNodeSelector (7)、PodToleratesNodeTaints (1)”。请指导我如何解决此问题

admin@ip-172-20-58-79:~/kubernetes-prometheus$ kubectl describe po -n kube-system kube-state-metrics-747bcc4d7d-kfn7t

Events:
  Type     Reason            Age               From               Message
  ----     ------            ----              ----               -------
  Warning  FailedScheduling  3s (x20 over 4m)  default-scheduler  No nodes are available that match all of the predicates: MatchNodeSelector (7), PodToleratesNodeTaints (1).

这个问题与节点上的内存有关吗?如果是,我该如何确认?我检查了所有节点,只有一个节点似乎在 80% 以上,剩余的内存使用率在 45% 到 70% 之间

内存使用率为 44% 的节点: 内存使用率为 44% 的节点:

集群内存使用总量:
集群内存使用总量:

以下屏幕截图显示了 kube-state-metrics (0/1 up):

在此处输入图像描述

此外,Prometheus 显示 kubernetes-pods (0/0 up) 是由于 kube-state-metrics 不起作用还是其他原因?和上面截图中看到的 kubernetes-apiservers (0/1 up) 为什么不起来?如何解决它?

在此处输入图像描述

admin@ip-172-20-58-79:~/kubernetes-prometheus$ sudo tail -f /var/log/kube-apiserver.log | grep 错误

I0110 10:15:37.153827       7 logs.go:41] http: TLS handshake error from 172.20.44.75:60828: remote error: tls: bad certificate
I0110 10:15:42.153543       7 logs.go:41] http: TLS handshake error from 172.20.44.75:60854: remote error: tls: bad certificate
I0110 10:15:47.153699       7 logs.go:41] http: TLS handshake error from 172.20.44.75:60898: remote error: tls: bad certificate
I0110 10:15:52.153788       7 logs.go:41] http: TLS handshake error from 172.20.44.75:60936: remote error: tls: bad certificate
I0110 10:15:57.154014       7 logs.go:41] http: TLS handshake error from 172.20.44.75:60992: remote error: tls: bad certificate
E0110 10:15:58.929167       7 status.go:62] apiserver received an error that is not an metav1.Status: write tcp 172.20.58.79:443->172.20.42.187:58104: write: connection reset by peer
E0110 10:15:58.931574       7 status.go:62] apiserver received an error that is not an metav1.Status: write tcp 172.20.58.79:443->172.20.42.187:58098: write: connection reset by peer
E0110 10:15:58.933864       7 status.go:62] apiserver received an error that is not an metav1.Status: write tcp 172.20.58.79:443->172.20.42.187:58088: write: connection reset by peer
E0110 10:16:00.842018       7 status.go:62] apiserver received an error that is not an metav1.Status: write tcp 172.20.58.79:443->172.20.42.187:58064: write: connection reset by peer
E0110 10:16:00.844301       7 status.go:62] apiserver received an error that is not an metav1.Status: write tcp 172.20.58.79:443->172.20.42.187:58058: write: connection reset by peer
E0110 10:18:17.275590       7 status.go:62] apiserver received an error that is not an metav1.Status: write tcp 172.20.58.79:443->172.20.44.75:37402: write: connection reset by peer
E0110 10:18:17.275705       7 runtime.go:66] Observed a panic: &errors.errorString{s:"kill connection/stream"} (kill connection/stream)
E0110 10:18:17.276401       7 runtime.go:66] Observed a panic: &errors.errorString{s:"kill connection/stream"} (kill connection/stream)
E0110 10:18:17.277808       7 status.go:62] apiserver received an error that is not an metav1.Status: write tcp 172.20.58.79:443->172.20.44.75:37392: write: connection reset by peer

MaggieO 回复后更新:

admin@ip-172-20-58-79:~/kubernetes-prometheus/kube-state-metrics-configs$ cat   deployment.yaml
apiVersion: apps/v1beta1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/version: v1.8.0
  name: kube-state-metrics
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: kube-state-metrics
  template:
    metadata:
      labels:
        app.kubernetes.io/name: kube-state-metrics
        app.kubernetes.io/version: v1.8.0
    spec:
      containers:
      - image: quay.io/coreos/kube-state-metrics:v1.8.0
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 5
          timeoutSeconds: 5
        name: kube-state-metrics
        ports:
        - containerPort: 8080
          name: http-metrics
        - containerPort: 8081
          name: telemetry
        readinessProbe:
          httpGet:
            path: /
            port: 8081
          initialDelaySeconds: 5
          timeoutSeconds: 5
      nodeSelector:
        kubernetes.io/os: linux
      serviceAccountName: kube-state-metrics

此外,我想将此命令添加到上面的 deployment.yaml 但出现缩进错误。显示请帮助我应该在哪里添加它。

command:
- /metrics-server
- --kubelet-insecure-tls
- --kubelet-preferred-address-types=InternalIP

更新 2:@MaggieO 即使在添加了命令/参数后,它仍显示相同的错误并且 pod 处于挂起状态:

更新 deployment.yaml :

# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "3"
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"apps/v1","kind":"Deployment","metadata":{"annotations":{},"labels":{"app.kubernetes.io/name":"kube-state-metrics","app.kubernetes.io/version":"v1.8.0"},"name":"kube-state-metrics","namespace":"kube-system"},"spec":{"replicas":1,"selector":{"matchLabels":{"app.kubernetes.io/name":"kube-state-metrics"}},"template":{"metadata":{"labels":{"app.kubernetes.io/name":"kube-state-metrics","app.kubernetes.io/version":"v1.8.0"}},"spec":{"containers":[{"args":["--kubelet-insecure-tls","--kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname"],"image":"quay.io/coreos/kube-state-metrics:v1.8.0","imagePullPolicy":"Always","livenessProbe":{"httpGet":{"path":"/healthz","port":8080},"initialDelaySeconds":5,"timeoutSeconds":5},"name":"kube-state-metrics","ports":[{"containerPort":8080,"name":"http-metrics"},{"containerPort":8081,"name":"telemetry"}],"readinessProbe":{"httpGet":{"path":"/","port":8081},"initialDelaySeconds":5,"timeoutSeconds":5}}],"nodeSelector":{"kubernetes.io/os":"linux"},"serviceAccountName":"kube-state-metrics"}}}}
  creationTimestamp: 2020-01-10T05:33:13Z
  generation: 4
  labels:
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/version: v1.8.0
  name: kube-state-metrics
  namespace: kube-system
  resourceVersion: "178851301"
  selfLink: /apis/extensions/v1beta1/namespaces/kube-system/deployments/kube-state-metrics
  uid: b20aa645-336a-11ea-9618-0607d7cb72ed
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 2
  selector:
    matchLabels:
      app.kubernetes.io/name: kube-state-metrics
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/name: kube-state-metrics
        app.kubernetes.io/version: v1.8.0
    spec:
      containers:
      - args:
        - --kubelet-insecure-tls
        - --kubelet-preferred-address-types=InternalIP
        image: quay.io/coreos/kube-state-metrics:v1.8.0
        imagePullPolicy: Always
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthz
            port: 8080
            scheme: HTTP
initialDelaySeconds: 5
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        name: kube-state-metrics
        ports:
        - containerPort: 8080
          name: http-metrics
          protocol: TCP
        - containerPort: 8081
          name: telemetry
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /
            port: 8081
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      nodeSelector:
        kubernetes.io/os: linux
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: kube-state-metrics
      serviceAccountName: kube-state-metrics
      terminationGracePeriodSeconds: 30
status:
  conditions:
  - lastTransitionTime: 2020-01-10T05:33:13Z
    lastUpdateTime: 2020-01-10T05:33:13Z
    message: Deployment does not have minimum availability.
    reason: MinimumReplicasUnavailable
    status: "False"
    type: Available
  - lastTransitionTime: 2020-01-15T07:24:27Z
    lastUpdateTime: 2020-01-15T07:29:12Z
    message: ReplicaSet "kube-state-metrics-7f8c9c6c8d" is progressing.
    reason: ReplicaSetUpdated
    status: "True"
    type: Progressing
  observedGeneration: 4
  replicas: 2
  unavailableReplicas: 2
  updatedReplicas: 1

更新 3:如下图所示,无法获取节点,请告诉我如何解决此问题

在此处输入图像描述

标签: kubernetesprometheuskubernetes-podkube-apiserverkube-state-metrics

解决方案


kubernetes-apiservers 上的错误Get https:// ...: x509: certificate is valid for 100.64.0.1, 127.0.0.1, not 172.20.58.79意味着 controlplane 节点是随机定位的,并且 apiEndpoint 仅在节点从集群中删除时发生变化,它不会立即引起注意,因为它需要随着集群中的节点发生变化。

解决方法修复:在主节点之间手动同步 kube-apiserver.pem 并重新启动 kube-apiserver 容器。

您还可以删除apiserver。apiserver-kubelet-client。并使用命令重新创建它们:

$ kubeadm init phase certs apiserver --config=/etc/kubernetes/kubeadm-config.yaml
$ kubeadm init phase certs apiserver-kubelet-client --config=/etc/kubernetes/kubeadm-config.yaml
$ systemctl stop kubelet
delete the docker container with kubelet
$ systemctl restart kubelet

类似问题:x509 证书kubelet-x509

然后解决指标服务器的问题。

更改 metrics-server-deployment.yaml 文件,并设置以下参数:

command:
- /metrics-server
- --kubelet-insecure-tls
- --kubelet-preferred-address-types=InternalIP

指标服务器现在能够与节点通信(之前失败,因为它无法解析节点的主机名)。

您可以在此处找到更多信息:metrics-server-issue


推荐阅读