首页 > 解决方案 > GCP:在 GKE 上安装 Kafka - zookeeper 未启动

问题描述

我在 GCP 上创建了一个 Kubernetes 集群(GKE),并尝试在此安装 Kafka(参考链接 - https://snourian.com/kafka-kubernetes-strimzi-part-1-creating-deploying-strimzi-kafka / )

我部署 kafka 集群时 Zookeeper 没有启动:

karan@cloudshell:~/strimzi-0.26.0 (versa-kafka-poc)$ kubectl get pv,pvc,pods -n kafka
NAME                                                        CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                               STORAGECLASS   REASON   AGE
persistentvolume/pvc-96957b25-f49b-4598-869c-a73b32325bc7   2Gi        RWO            Delete           Bound    kafka/data-my-cluster-zookeeper-0   standard                6m17s

NAME                                                STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
persistentvolumeclaim/data-my-cluster-zookeeper-0   Bound    pvc-96957b25-f49b-4598-869c-a73b32325bc7   2Gi        RWO            standard       6m20s

NAME                                         READY   STATUS    RESTARTS   AGE
pod/my-cluster-zookeeper-0                   0/1     Pending   0          6m18s
pod/strimzi-cluster-operator-85bb4c6-cfl4p   1/1     Running   0          8m29s


aran@cloudshell:~/strimzi-0.26.0 (versa-kafka-poc)$ kc describe pod my-cluster-zookeeper-0 -n kafka
Name:           my-cluster-zookeeper-0
Namespace:      kafka
Priority:       0
Node:           <none>
Labels:         app.kubernetes.io/instance=my-cluster
                app.kubernetes.io/managed-by=strimzi-cluster-operator
                app.kubernetes.io/name=zookeeper
                app.kubernetes.io/part-of=strimzi-my-cluster
                controller-revision-hash=my-cluster-zookeeper-867c478fc4
                statefulset.kubernetes.io/pod-name=my-cluster-zookeeper-0
                strimzi.io/cluster=my-cluster
                strimzi.io/kind=Kafka
                strimzi.io/name=my-cluster-zookeeper
Annotations:    strimzi.io/cluster-ca-cert-generation: 0
                strimzi.io/generation: 0
                strimzi.io/logging-hash: 0f057cb0003c78f02978b83e4fabad5bd508680c
Status:         Pending
IP:
IPs:            <none>
Controlled By:  StatefulSet/my-cluster-zookeeper
Containers:
  zookeeper:
    Image:       quay.io/strimzi/kafka:0.26.0-kafka-3.0.0
    Ports:       2888/TCP, 3888/TCP, 2181/TCP
    Host Ports:  0/TCP, 0/TCP, 0/TCP
    Command:
      /opt/kafka/zookeeper_run.sh
    Limits:
      cpu:     1500m
      memory:  2Gi
    Requests:
      cpu:      1
      memory:   1Gi
    Liveness:   exec [/opt/kafka/zookeeper_healthcheck.sh] delay=15s timeout=5s period=10s #success=1 #failure=3
    Readiness:  exec [/opt/kafka/zookeeper_healthcheck.sh] delay=15s timeout=5s period=10s #success=1 #failure=3
    Environment:
      ZOOKEEPER_METRICS_ENABLED:         false
      ZOOKEEPER_SNAPSHOT_CHECK_ENABLED:  true
      STRIMZI_KAFKA_GC_LOG_ENABLED:      false
      DYNAMIC_HEAP_FRACTION:             0.75
      DYNAMIC_HEAP_MAX:                  2147483648
      ZOOKEEPER_CONFIGURATION:           tickTime=2000
                                         initLimit=5
                                         syncLimit=2
                                         autopurge.purgeInterval=1

    Mounts:
      /opt/kafka/cluster-ca-certs/ from cluster-ca-certs (rw)
      /opt/kafka/custom-config/ from zookeeper-metrics-and-logging (rw)
      /opt/kafka/zookeeper-node-certs/ from zookeeper-nodes (rw)
      /tmp from strimzi-tmp (rw)
      /var/lib/zookeeper from data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cgm22 (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  data-my-cluster-zookeeper-0
    ReadOnly:   false
  strimzi-tmp:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  1Mi
  zookeeper-metrics-and-logging:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      my-cluster-zookeeper-config
    Optional:  false
  zookeeper-nodes:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  my-cluster-zookeeper-nodes
    Optional:    false
  cluster-ca-certs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  my-cluster-cluster-ca-cert
    Optional:    false
  kube-api-access-cgm22:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason             Age                 From                Message
  ----     ------             ----                ----                -------
  Warning  FailedScheduling   10m                 default-scheduler   0/3 nodes are available: 3 pod has unbound immediate PersistentVolumeClaims.
  Warning  FailedScheduling   40s (x10 over 10m)  default-scheduler   0/3 nodes are available: 3 Insufficient cpu.
  Normal   NotTriggerScaleUp  37s (x61 over 10m)  cluster-autoscaler  pod didn't trigger scale-up:

这是用于创建集群的 yaml 文件:

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: my-cluster #1
spec:
  kafka:
    version: 3.0.0
    replicas: 1
    listeners:
      - name: plain
        port: 9092
        type: internal
        tls: false
      - name: tls
        port: 9093
        type: internal
        tls: true
    config:
      offsets.topic.replication.factor: 1
      transaction.state.log.replication.factor: 1
      transaction.state.log.min.isr: 1
      log.message.format.version: "3.0"
      inter.broker.protocol.version: "3.0"
    storage:
      type: jbod
      volumes:
      - id: 0
        type: persistent-claim
        size: 2Gi
        deleteClaim: false
    logging: #9
      type: inline
      loggers:
        kafka.root.logger.level: "INFO"
  zookeeper:
    replicas: 1
    storage:
      type: persistent-claim
      size: 2Gi
      deleteClaim: false
    resources:
      requests:
        memory: 1Gi
        cpu: "1"
      limits:
        memory: 2Gi
        cpu: "1.5"
    logging:
      type: inline
      loggers:
        zookeeper.root.logger: "INFO"
  entityOperator: #11
    topicOperator: {}
    userOperator: {}

PersistentVolume 显示为绑定到 PersistentVolumeClaim,但是 zookeeper 没有启动说节点 CPU 不足。

关于需要做什么的任何指针?

cpu in 2 of the 3 nodes have limit - 0%

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                   Requests     Limits
  --------                   --------     ------
  cpu                        483m (51%)   0 (0%)
  memory                     410Mi (14%)  890Mi (31%)
  ephemeral-storage          0 (0%)       0 (0%)
  hugepages-1Gi              0 (0%)       0 (0%)
  hugepages-2Mi              0 (0%)       0 (0%)
Edit cancelled, no changes made.          0


3rd node :

 Resource                   Requests         Limits
  --------                   --------         ------
  cpu                        511m (54%)       1143m (121%)
  memory                     868783744 (29%)  1419Mi (50%)


kc 描述 pod my-cluster-zookeeper-0 -n kafka


karan@cloudshell:~ (versa-kafka-poc)$ kc describe pod my-cluster-zookeeper-0 -n kafka
Name:           my-cluster-zookeeper-0
Namespace:      kafka
Priority:       0
Node:           <none>
Labels:         app.kubernetes.io/instance=my-cluster
                app.kubernetes.io/managed-by=strimzi-cluster-operator
                app.kubernetes.io/name=zookeeper
                app.kubernetes.io/part-of=strimzi-my-cluster
                controller-revision-hash=my-cluster-zookeeper-867c478fc4
                statefulset.kubernetes.io/pod-name=my-cluster-zookeeper-0
                strimzi.io/cluster=my-cluster
                strimzi.io/kind=Kafka
                strimzi.io/name=my-cluster-zookeeper
Annotations:    strimzi.io/cluster-ca-cert-generation: 0
                strimzi.io/generation: 0
                strimzi.io/logging-hash: 0f057cb0003c78f02978b83e4fabad5bd508680c
Status:         Pending
IP:
IPs:            <none>
Controlled By:  StatefulSet/my-cluster-zookeeper
Containers:
  zookeeper:
    Image:       quay.io/strimzi/kafka:0.26.0-kafka-3.0.0
    Ports:       2888/TCP, 3888/TCP, 2181/TCP
    Host Ports:  0/TCP, 0/TCP, 0/TCP
    Command:
      /opt/kafka/zookeeper_run.sh
    Limits:
      cpu:     1500m
      memory:  2Gi
    Requests:
      cpu:      1
      memory:   1Gi
    Liveness:   exec [/opt/kafka/zookeeper_healthcheck.sh] delay=15s timeout=5s period=10s #success=1 #failure=3
    Readiness:  exec [/opt/kafka/zookeeper_healthcheck.sh] delay=15s timeout=5s period=10s #success=1 #failure=3
    Environment:
      ZOOKEEPER_METRICS_ENABLED:         false
      ZOOKEEPER_SNAPSHOT_CHECK_ENABLED:  true
      STRIMZI_KAFKA_GC_LOG_ENABLED:      false
      DYNAMIC_HEAP_FRACTION:             0.75
      DYNAMIC_HEAP_MAX:                  2147483648
      ZOOKEEPER_CONFIGURATION:           tickTime=2000
                                         initLimit=5
                                         syncLimit=2
                                         autopurge.purgeInterval=1

    Mounts:
      /opt/kafka/cluster-ca-certs/ from cluster-ca-certs (rw)
      /opt/kafka/custom-config/ from zookeeper-metrics-and-logging (rw)
      /opt/kafka/zookeeper-node-certs/ from zookeeper-nodes (rw)
      /tmp from strimzi-tmp (rw)
      /var/lib/zookeeper from data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cgm22 (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  data-my-cluster-zookeeper-0
    ReadOnly:   false
  strimzi-tmp:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  1Mi
  zookeeper-metrics-and-logging:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      my-cluster-zookeeper-config
    Optional:  false
  zookeeper-nodes:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  my-cluster-zookeeper-nodes
    Optional:    false
  cluster-ca-certs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  my-cluster-cluster-ca-cert
    Optional:    false
  kube-api-access-cgm22:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason             Age                     From                Message
  ----     ------             ----                    ----                -------
  Warning  FailedScheduling   5h27m                   default-scheduler   0/3 nodes are available: 3 Insufficient cpu.
  Normal   NotTriggerScaleUp  28m (x1771 over 5h26m)  cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added):
  Normal   NotTriggerScaleUp  4m17s (x91 over 19m)    cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 max node group size reached
  Warning  FailedScheduling   80s (x19 over 20m)      default-scheduler   0/3 nodes are available: 3 Insufficient cpu.

标签: kubernetesapache-kafkagoogle-kubernetes-enginestrimzi

解决方案


当 Pod 请求的 CPU 数量超过集群中的 CPU 时,无法安排它。如果您现有的 pod 已经消耗了总 CPU,那么您将无法安排更多的 pod,除非在您请求安排新 pod 时您的一些现有 pod 被杀死。在 Horizo​​ntal Pod Autoscaler (HPA) 中可以遵循一个简单的公式:RESOURCE REQUEST CPU * HPA MAX PODS <= Total Kubernetes CPU

使用 kubectl describe node xxxx 检查每个节点。您可能会发现节点上的 CPU 使用率过高,例如在您自己的情况下为 80%。您可能需要从节点中删除一些资源(例如,任何不需要的未使用的 pod),以便成功地将新的 pod 调度到节点上。有关 CPU 不足的信息,请参阅链接

请参阅修复 - pod 具有未绑定的立即持久性卷声明和stackpost以获取有关 pod 具有未绑定的立即 PersistentVolumeClaims 的信息。


推荐阅读