首页 > 解决方案 > 什么会触发 k8s 中的 SyncLoop DELETE api 调用?

问题描述

我有一个在集群replicasetnginx-ingress运行的,有两个实例。两天前,两个容器同时被删除(彼此相隔几毫秒),并在同一个副本集中创建了两个新实例。我不知道是什么触发了删除。在 kubelet 日志中,我可以看到以下内容:

kubelet[13317]: I0207 22:01:36.843804 13317 kubelet.go:1918] SyncLoop (DELETE, "api"): "nginx-ingress-public-controller-6bf8d59c4c

稍后在日志中列出了一个失败的活性探测:

kubelet[13317]: I0207 22:01:42.596603 13317 prober.go:116] Liveness probe for "nginx-ingress-public-controller-6bf8d59c4c (60c3f9e5-e228-44c8-abd5-b0a4a8507b5c):nginx-ingress-controller" failed (failure): HTTP probe failed with statuscode: 500

从理论上讲,这可以解释 pod 删除,但我对订单感到困惑。这个活性探测失败是因为删除命令已经杀死了底层的 docker 容器,还是这是触发删除的原因?

标签: kuberneteskubelet

解决方案


To determine what exactly caused deletion of your nginx pod is hard to guess without full logs. Also as you mention it's customer environment there might be many reasons. As I've asked in comments it might be HPA or CA, preemptible nodes, temporary network issues, etc.

Regarding the second part about pod deletion and Liveness, Liveness probe failed because nginx pod was in the deletion process.

One of Kubernetes default settings is grace-period equal to 30 seconds. In short it means that Pod will be in Terminating status for 30 seconds, and after this time it will be removed.

Tests

If you would like to verify it by yourself you can do some testing to confirm. It would require kubeadm master and change of Verbosity. You can do it by editing the /var/lib/kubelet/kubeadm-flags.env file (you must have root rights) and add --v=X where X is number 0-9. Details which level shows specific logs can be found here.

  • Set verbosity level at least to level=5, I've tested on level=8
  • Deploy Nginx Ingress Controller
  • Delete Nginx Ingress Controller pod manually
  • Check logs using $ journalctl -u kubelet, you can use grep to narrow output and save it to file ($ journalctl -u kubelet | grep ingress-nginx-controller-s2kfr > nginx.log)

Below examples from my tests:

#Liveness and Readiness probe works properly:
Feb 24 14:18:35 kubeadm kubelet[11922]: I0224 14:18:35.399156   11922 prober.go:126] Readiness probe for "ingress-nginx-controller-s2kfr_ingress-nginx(9046e404-1b9e-44fd-86f3-5a16ebf27c21):controller" succeeded
Feb 24 14:18:40 kubeadm kubelet[11922]: I0224 14:18:40.587129   11922 prober.go:126] Liveness probe for "ingress-nginx-controller-s2kfr_ingress-nginx(9046e404-1b9e-44fd-86f3-5a16ebf27c21):controller" succeeded

#Once Deletion process start you can find DELETE api and other information

Feb 24 14:18:46 kubeadm kubelet[11922]: I0224 14:18:46.900957   11922 kubelet.go:1931] SyncLoop (DELETE, "api"): "ingress-nginx-controller-s2kfr_ingress-nginx(9046e404-1b9e-44fd-86f3-5a16ebf27c21)"
Feb 24 14:18:46 kubeadm kubelet[11922]: I0224 14:18:46.901057   11922 kubelet_pods.go:1482] Generating status for "ingress-nginx-controller-s2kfr_ingress-nginx(9046e404-1b9e-44fd-86f3-5a16ebf27c21)"
Feb 24 14:18:46 kubeadm kubelet[11922]: I0224 14:18:46.901914   11922 round_trippers.go:422] GET https://10.154.15.225:6443/api/v1/namespaces/ingress-nginx/pods/ingress-nginx-controller-s2kfr
Feb 24 14:18:46 kubeadm kubelet[11922]: I0224 14:18:46.909123   11922 event.go:291] "Event occurred" object="ingress-nginx/ingress-nginx-controller-s2kfr" kind="Pod" apiVersion="v1" type="Normal" reason="Killing" message="Stopping container controller"

# This entry occurs as default grace-period-time was kept
Feb 24 14:18:46 kubeadm kubelet[11922]: I0224 14:18:46.947193   11922 kubelet_pods.go:952] Pod "ingress-nginx-controller-s2kfr_ingress-nginx(9046e404-1b9e-44fd-86f3-5a16ebf27c21)" is terminated, but some containers are still running

# As Pod was in deletion, Probes failed.
Feb 24 14:18:50 kubeadm kubelet[11922]: I0224 14:18:50.584208   11922 prober.go:117] Liveness probe for "ingress-nginx-controller-s2kfr_ingress-nginx(9046e404-1b9e-44fd-86f3-5a16ebf27c21):controller" failed (failure): HTTP probe failed with statuscode: 500
Feb 24 14:18:50 kubeadm kubelet[11922]: I0224 14:18:50.584338   11922 event.go:291] "Event occurred" object="ingress-nginx/ingress-nginx-controller-s2kfr" kind="Pod" apiVersion="v1" type="Warning" reason="Unhealthy" message="Liveness probe failed: HTTP probe failed with statuscode: 500"
Feb 24 14:18:52 kubeadm kubelet[11922]: I0224 14:18:52.045155   11922 kubelet_pods.go:952] Pod "ingress-nginx-controller-s2kfr_ingress-nginx(9046e404-1b9e-44fd-86f3-5a16ebf27c21)" is terminated, but some containers are still running
Feb 24 14:18:55 kubeadm kubelet[11922]: I0224 14:18:55.398025   11922 prober.go:117] Readiness probe for "ingress-nginx-controller-s2kfr_ingress-nginx(9046e404-1b9e-44fd-86f3-5a16ebf27c21):controller" failed (failure): HTTP probe failed with statuscode: 500

In logs, the time between SyncLoop (DELETE, "api") and Liveness probe is 4 seconds. In others, the test time was a few seconds (4-7 seconds difference).

If you would like to perform your own test you can change Readiness and Liveness probe check to 1 second (not 10 as is set by default) you would get probe issues in the same second as Delete api.

Feb 24 15:09:40 kubeadm kubelet[11922]: I0224 15:09:40.865718   11922 prober.go:126] Liveness probe for "ingress-nginx-controller-wwrdw_ingress-nginx(427bc9d6-261e-4427-b034-7abe8cbbfea6):controller" succeeded
Feb 24 15:09:41 kubeadm kubelet[11922]: I0224 15:09:41.488819   11922 kubelet.go:1931] SyncLoop (DELETE, "api"): "ingress-nginx-controller-wwrdw_ingress-nginx(427bc9d6-261e-4427-b034-7abe8cbbfea6)"
...
Feb 24 15:09:41 kubeadm kubelet[11922]: I0224 15:09:41.865422   11922 prober.go:117] Liveness probe for "ingress-nginx-controller-wwrdw_ingress-nginx(427bc9d6-261e-4427-b034-7abe8cbbfea6):controller" failed (failure): HTTP probe failed with statuscode: 500

Good explanation of syncLoop you can find in Alibaba docs

As indicated in the comments, the syncLoop function is the major cycle of Kubelet. This function listens on the updates, obtains the latest Pod configurations, and synchronizes the running state and desired state. In this way, all Pods on the local node can run in the expected states. Actually, syncLoop only encapsulates syncLoopIteration, while the synchronization operation is carried out by syncLoopIteration.

Conclusion

If you don't have additional logging to save outputs from pods before termination it's hard to determine the root cause after a while since that event.

In the setup you have provided, the Liveness probe failed because nginx-ingress pod was already in the termination process. Liveness probe fail did not trigger pod deletion but it was the result of that deletion.

In addition, you can also check Kubelet and Prober source code.


推荐阅读