首页 > 解决方案 > 管理代理产品上的就绪探测失败

问题描述

我正在尝试在 AKS 上设置 SQLServer BDC,但该过程似乎并没有超出某个点。AKS 群集是在 Standard_E8_v3 VM ScaleSet 上构建的 3 节点群集。

以下是 pod 列表:C:\Users\rgn>kubectl get pods -n mssql-cluster

NAME              READY   STATUS    RESTARTS   AGE
control-qm754     3/3     Running   0          35m
controldb-0       2/2     Running   0          35m
controlwd-wxrlg   1/1     Running   0          32m
logsdb-0          1/1     Running   0          32m
logsui-mqfcv      1/1     Running   0          32m
metricsdb-0       1/1     Running   0          32m
metricsdc-9frbb   1/1     Running   0          32m
metricsdc-jr5hk   1/1     Running   0          32m
metricsdc-ls7mf   1/1     Running   0          32m
metricsui-pn9qf   1/1     Running   0          32m
mgmtproxy-x4ctb   2/2     Running   0          32m

当我对 mgmtproxy-x4ctb pod 运行 describe 时,我看到了以下内容。即使该状态表明它正在运行,它也不是(就绪探测失败)。我相信这就是部署没有进行的原因。

Events:
  Type     Reason     Age                From                                        Message
  ----     ------     ----               ----                                        -------
  Normal   Scheduled  11m                default-scheduler                           Successfully assigned mssql-cluster/mgmtproxy-x4ctb to aks-agentpool-34156060-vmss000002
  Normal   Pulling    11m                kubelet, aks-agentpool-34156060-vmss000002  Pulling image "mcr.microsoft.com/mssql/bdc/mssql-service-proxy:2019-CU4-ubuntu-16.04"
  Normal   Pulled     11m                kubelet, aks-agentpool-34156060-vmss000002  Successfully pulled image "mcr.microsoft.com/mssql/bdc/mssql-service-proxy:2019-CU4-ubuntu-16.04"
  Normal   Created    11m                kubelet, aks-agentpool-34156060-vmss000002  Created container service-proxy
  Normal   Started    11m                kubelet, aks-agentpool-34156060-vmss000002  Started container service-proxy
  Normal   Pulling    11m                kubelet, aks-agentpool-34156060-vmss000002  Pulling image "mcr.microsoft.com/mssql/bdc/mssql-monitor-fluentbit:2019-CU4-ubuntu-16.04"
  Normal   Pulled     11m                kubelet, aks-agentpool-34156060-vmss000002  Successfully pulled image "mcr.microsoft.com/mssql/bdc/mssql-monitor-fluentbit:2019-CU4-ubuntu-16.04"
  Normal   Created    11m                kubelet, aks-agentpool-34156060-vmss000002  Created container fluentbit
  Normal   Started    11m                kubelet, aks-agentpool-34156060-vmss000002  Started container fluentbit
  Warning  Unhealthy  10m (x6 over 11m)  kubelet, aks-agentpool-34156060-vmss000002  Readiness probe failed: cat: /var/run/container.ready: No such file or directory

我尝试了两次,但两次都无法超越这一点。从链接看来,这个问题自上​​个月以来才存在。有人可以指出我正确的方向吗?

来自代理 pod 的日志列表:

2020/06/13 16:25:35 Setting the directories for 'agent:agent' owner with '-rwxrwxr-x' mode: [/var/opt /var/log /var/run/secrets /var/run/secrets/keytabs /var/run/secrets/certificates /var/run/secrets/credentials /var/opt/agent /var/log/agent /var/run/agent]
2020/06/13 16:25:35 Setting the directories for 'agent:agent' owner with '-rwxrwx---' mode: [/var/opt/agent /var/log/agent /var/run/agent]
2020/06/13 16:25:35 Searching agent configuration file at /opt/agent/conf/mgmtproxy.json
2020/06/13 16:25:35 Searching agent configuration file at /opt/agent/conf/agent.json
2020/06/13 16:25:35.777955 Changed the container umask from '-----w--w-' to '--------w-'
2020/06/13 16:25:35.778031 Setting the directories for 'supervisor:supervisor' owner with '-rwxrwx---' mode: [/var/log/supervisor/log /var/opt/supervisor /var/log/supervisor /var/run/supervisor]
2020/06/13 16:25:35.778170 Setting the directories for 'fluentbit:fluentbit' owner with '-rwxrwx---' mode: [/var/opt/fluentbit /var/log/fluentbit /var/run/fluentbit]
2020/06/13 16:25:35.778411 Agent configuration: {"PodType":"mgmtproxy","ContainerName":"fluentbit","GrpcPort":8311,"HttpsPort":8411,"ScaledSetKind":"ReplicaSet","securityPolicy":"certificate","dnsServicesToWaitFor":null,"cronJobs":null,"serviceJobs":null,"healthModules":null,"logRotation":{"agentLogMaxSize":500,"agentLogRotateCount":3,"serviceLogRotateCount":10},"fileMap":{"fluentbit-certificate.pem":"/var/run/secrets/certificates/fluentbit/fluentbit-certificate.pem","fluentbit-privatekey.pem":"/var/run/secrets/certificates/fluentbit/fluentbit-privatekey.pem","krb5.conf":"/etc/krb5.conf","nsswitch.conf":"/etc/nsswitch.conf","resolv.conf":"/etc/resolv.conf","smb.conf":"/etc/samba/smb.conf"},"userPermissions":{"agent":{"user":"agent","group":"agent","mode":"0770","modeSetgid":false,"directories":[]},"fluentbit":{"user":"fluentbit","group":"","mode":"","modeSetgid":false,"directories":[]},"fundamental":{"user":"agent","group":"agent","mode":"0775","modeSetgid":false,"directories":["/var/opt","/var/log","/var/run/secrets","/var/run/secrets/keytabs","/var/run/secrets/certificates","/var/run/secrets/credentials"]},"supervisor":{"user":"supervisor","group":"supervisor","mode":"0770","modeSetgid":false,"directories":["/var/log/supervisor/log"]}},"fileIgnoreList":["agent-certificate.pem","agent-privatekey.pem"],"InstanceId":"t4KLx1m5vDsHCHc038KgKHH5HOcQVR0Z","ContainerId":"","StartServicesImmediately":false,"DisableFileDownloads":false,"DisableHealthChecks":false,"serviceFencingEnabled":false,"isPrivileged":true,"IsConfigurationManagerEnabled":false,"LWriter":{"filename":"/var/log/agent/agent.log","maxsize":500,"maxage":0,"maxbackups":10,"localtime":true,"compress":false}}
2020/06/13 16:25:36.316209 Attempting to join cluster...
2020/06/13 16:25:36.316301 Source directory /var/opt/secrets/certificates/ca does not exist
2020/06/13 16:25:36.316520 [Reaper] Starting the signal loop for reaper
2020/06/13 16:25:40.642164 [Reaper] Received SIGCHLD signal. Starting process reaper.
2020/06/13 16:25:40.652703 Starting secure gRPC listener on 0.0.0.0:8311
2020/06/13 16:25:40.943805 Cluster join successful.
2020/06/13 16:25:40.943846 Stopping gRPC listener on 0.0.0.0:8311
2020/06/13 16:25:40.944704 Getting manifest from controller...
2020/06/13 16:25:40.964774 Downloading '/config/scaledsets/mgmtproxy/containers/fluentbit/files/fluentbit-certificate.pem' from controller...
2020/06/13 16:25:40.964816 Downloading '/config/scaledsets/mgmtproxy/containers/fluentbit/files/fluentbit-privatekey.pem' from controller...
2020/06/13 16:25:40.987309 Stored 1206 bytes to /var/run/secrets/certificates/fluentbit/fluentbit-certificate.pem
2020/06/13 16:25:40.992108 Stored 1694 bytes to /var/run/secrets/certificates/fluentbit/fluentbit-privatekey.pem
2020/06/13 16:25:40.992235 Agent is ready.
2020/06/13 16:25:40.992348 Starting supervisord with command: '[supervisord --nodaemon -c /etc/supervisord.conf]'
2020/06/13 16:25:40.992719 Started supervisord with pid=1437
2020/06/13 16:25:40.993030 Starting secure gRPC listener on 0.0.0.0:8311
2020/06/13 16:25:40.996580 Starting HTTPS listener on 0.0.0.0:8411
2020/06/13 16:25:41.998667 [READINESS] Not all supervisord processes are ready. Attempts: 1, Max attempts: 250
2020/06/13 16:25:41.999567 Loading go plugin plugins/bdc.so
2020/06/13 16:25:41.999588 Loading go plugin plugins/platform.so
2020/06/13 16:25:41.999600 Starting the health monitoring, number of modules: 2, services: ["fluentbit","agent"]
2020/06/13 16:25:41.999605 Starting the health service
2020/06/13 16:25:41.999609 Starting the health durable store
2020/06/13 16:25:41.999614 Loading existing health properties from /var/opt/agent/health/health-properties-main.gob
2020/06/13 16:25:41.999642 No existing file path for file: /var/opt/agent/health/health-properties-main.gob
2020/06/13 16:25:42.640719 Adding a new plugin plugins/bdc.so 
2020/06/13 16:25:43.302872 Adding a new plugin plugins/platform.so 
2020/06/13 16:25:43.302932 Created a health module watcher for service 'fluentbit'
2020/06/13 16:25:43.302948 Starting a new watcher for health module: fluentbit 
2020/06/13 16:25:43.302983 Starting a new watcher for health module: agent 
2020/06/13 16:25:43.302992 Health monitoring started
2020/06/13 16:25:53.000908 [READINESS] All services marked as ready.
2020/06/13 16:25:53.000966 [READINESS] Container is now ready.
2020/06/13 16:26:01.995093 [MONITOR] Service states: map[fluentbit:RUNNING]

标签: deploymentazure-akssql-server-2019

解决方案


全部,

终于解决了。

我们的 azure 政策和网络政策存在几个问题。

(1) It was not allowing new IP addresses to be assigned to the loadbalancer. 
(2) The gateway proxy was not getting new IP Addresses since we ran out of our quota of 10  max IPs that were allowed. 
(3) My desktop from where I started to deploy was not able to ping the controller service IP addresses and Port.

我们一个接一个地解决了上述问题,我们正处于最后阶段。

鉴于 IP 地址是静态的,但它是动态生成的,因此无法进行配置。其他人是如何与他们的网络/Azure 基础架构团队一起处理这个问题的?

谢谢,rgn


推荐阅读