首页 > 解决方案 > Kubernetes 上的 Rabbitmq pod 处于 pod 初始化状态

问题描述

我在 Kubernetes 上运行 3 个节点的 rabbitmq 集群。Kubernetes 集群在 AWS Spot 实例上运行,不知何故,其中一个 Kubernetes 节点意外终止,其中一个 Rabbitmq pod 正在运行。现在 pod git 安排在另一个节点上,从那时起,我的 rabbitmq pod 就卡在了 pod 初始化状态。

Kubernetes 事件显示“FailedPostStartHook”。

日志:

9m46s       Warning   FailedPostStartHook      pod/rabbitmq-0   Exec lifecycle hook ([/bin/sh -c until rabbitmqctl --erlang-cookie ${RABBITMQ_ERLANG_COOKIE} node_health_check; do sleep 1; done; rabbitmqctl --erlang-cookie ${RABBITMQ_ERLANG_COOKIE} set_policy ha-all "" '{"ha-mode":"all", "ha-sync-mode": "automatic"}'
]) for Container "rabbitmq" in Pod "rabbitmq-0_devops(c96c1a6e-bf9a-450d-828d-ed0e8a0ad949)" failed - error: command '/bin/sh -c until rabbitmqctl --erlang-cookie ${RABBITMQ_ERLANG_COOKIE} node_health_check; do sleep 1; done; rabbitmqctl --erlang-cookie ${RABBITMQ_ERLANG_COOKIE} set_policy ha-all "" '{"ha-mode":"all", "ha-sync-mode": "automatic"}'
' exited with 137: Error: unable to perform an operation on node 'rabbit@rabbitmq-0.rabbitmq-service.devops.svc.cluster.local'. Please see diagnostics information and suggestions below.
Most common reasons for this are:
 * Target node is unreachable (e.g. due to hostname resolution, TCP connection or firewall issues)
 * CLI tool fails to authenticate with the server (e.g. due to CLI tool's Erlang cookie not matching that of the server)
 * Target node is not running
In addition to the diagnostics info below:
 * See the CLI, clustering and networking guides on https://rabbitmq.com/documentation.html to learn more
 * Consult server logs on node rabbit@rabbitmq-0.rabbitmq-service.devops.svc.cluster.local
 * If target node is configured to use long node names, don't forget to use --longnames with CLI tools
DIAGNOSTICS
===========
attempted to contact: ['rabbit@rabbitmq-0.rabbitmq-service.devops.svc.cluster.local']

Kubernetes 状态集清单:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: rabbitmq
  namespace: devops
spec:
  podManagementPolicy: OrderedReady
  replicas: 3
  revisionHistoryLimit: 3
  selector:
    matchLabels:
      app: rabbitmq
  serviceName: rabbitmq-service
  template:
    metadata:
      annotations:
      labels:
        app: rabbitmq
      name: rabbitmq
    spec:
      containers:
      - env:
        - name: HOSTNAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: RABBITMQ_USE_LONGNAME
          value: "true"
        - name: RABBITMQ_BASIC_AUTH
          valueFrom:
            secretKeyRef:
              key: password
              name: rabbitmq
        - name: RABBITMQ_NODENAME
          value: rabbit@$(HOSTNAME).rabbitmq-service.$(NAMESPACE).svc.cluster.local
        - name: K8S_SERVICE_NAME
          value: rabbitmq-service
        - name: RABBITMQ_DEFAULT_USER
          value: admin
        - name: RABBITMQ_DEFAULT_PASS
          valueFrom:
            secretKeyRef:
              key: password
              name: rabbitmq
        - name: RABBITMQ_ERLANG_COOKIE
          value: some-cookie
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        image: rabbitmq:3.8.1-management-alpine
        imagePullPolicy: IfNotPresent
        lifecycle:
          postStart:
            exec:
              command:
              - /bin/sh
              - -c
              - |
                until rabbitmqctl --erlang-cookie ${RABBITMQ_ERLANG_COOKIE} node_health_check; do sleep 1; done; rabbitmqctl --erlang-cookie ${RABBITMQ_ERLANG_COOKIE} set_policy ha-all "" '{"ha-mode":"all", "ha-sync-mode": "automatic"}'
        livenessProbe:
          exec:
            command:
            - rabbitmqctl
            - status
          failureThreshold: 3
          initialDelaySeconds: 60
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 30
        name: rabbitmq
        ports:
        - containerPort: 4369
          protocol: TCP
        - containerPort: 5672
          protocol: TCP
        - containerPort: 5671
          protocol: TCP
        - containerPort: 25672
          protocol: TCP
        - containerPort: 15672
          protocol: TCP
        readinessProbe:
          exec:
            command:
            - rabbitmqctl
            - status
          failureThreshold: 3
          initialDelaySeconds: 20
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 30
        resources:
          limits:
            cpu: "2"
            memory: 3Gi
          requests:
            cpu: "1"
            memory: 2Gi
        volumeMounts:
        - mountPath: /var/lib/rabbitmq/
          name: rabbitmq-data
        - mountPath: /etc/rabbitmq
          name: config
      dnsPolicy: ClusterFirst
      initContainers:
      - command:
        - /bin/bash
        - -euc
        - |
          rm -f /var/lib/rabbitmq/.erlang.cookie
          cp /rabbitmqconfig/rabbitmq.conf /etc/rabbitmq/rabbitmq.conf
          cp /rabbitmqconfig/enabled_plugins /etc/rabbitmq/enabled_plugins
        image: rabbitmq:3.8.1-management-alpine
        imagePullPolicy: Always
        name: copy-rabbitmq-config
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /rabbitmqconfig
          name: rabbitmq-configmap
        - mountPath: /etc/rabbitmq
          name: config
        - mountPath: /var/lib/rabbitmq
          name: rabbitmq-data
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: rabbitmq
      serviceAccountName: rabbitmq
      terminationGracePeriodSeconds: 10
      volumes:
      - configMap:
          defaultMode: 420
          items:
          - key: rabbitmq.conf
            path: rabbitmq.conf
          - key: enabled_plugins
            path: enabled_plugins
          name: rabbitmq-configmap
        name: rabbitmq-configmap
      - emptyDir: {}
        name: config
  updateStrategy:
    type: RollingUpdate
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      creationTimestamp: null
      name: rabbitmq-data
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 20Gi
      storageClassName: gp2
      volumeMode: Filesystem

我尝试过的事情:

  1. 登录到被击中的 pod 并执行(这个命令刚刚被击中,没有任何响应)

rabbitmqctl stop_app

  1. 尝试强行删除 pod 但没有运气。

  2. 登录到被攻击的 pod 并执行

rabbitmqctl 重置

  1. 登录到被攻击的 pod 并执行

rabbitmqctl force_boot

  1. 登录到被攻击的 pod 并执行

rm /var/log/rabbitmq/*

以上都没有帮助。

请注意,其他 2 个 rabbitmq 节点运行良好并为流量提供服务,并将故障节点显示为 up:

rabbitmq-2 rabbitmq 2021-07-04 12:19:07.233 [info] <0.490.0> node 'rabbit@rabbitmq-0.rabbitmq-service.devops.svc.cluster.local' up
rabbitmq-1 rabbitmq 2021-07-04 12:19:07.208 [info] <0.494.0> node 'rabbit@rabbitmq-0.rabbitmq-service.devops.svc.cluster.local' up 

标签: dockerkubernetesrabbitmqqueuestateful

解决方案


运行 statefulset 命令的 rollout restart 对我有用。

kubectl rollout restart statefulset rabbitmq -n devops

在此命令之后,rabbitmq 集群启动并运行,所有三个节点都加入了集群,没有任何问题。

完成此操作后,需要重新启动连接到此 rabbitmq 集群的应用程序。


推荐阅读