首页 > 解决方案 > Kubernetes集群间歇性丢失

问题描述

我一直在尝试诊断几天前刚刚开始的问题。

正在运行kubeletkubeadm版本 1.13.1。集群有 5 个节点,并且在上周末之前已经好几个月了。在具有足够免费资源的 RHEL 7.x 机器上运行它。

有一个奇怪的问题是集群资源(api、调度程序、etcd)变得不可用。这最终会自行纠正,并且集群会再次恢复一段时间。

如果我做一个sudo systemctl restart kubelet集群内的一切工作再次正常,直到间歇性的奇怪发生。

我正在监视journactl日志以查看发生这种情况时发生了什么,突出的部分是:

Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: 2018-12-26 21:28:06.762004 I | etcdserver: skipped leadership transfer for single member cluster
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: W1226 21:28:06.763648       1 reflector.go:270] k8s.io/client-go/informers/factory.go:132: watch of *v1beta1.Event ended with: Internal error occurred: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: E1226 21:28:06.762788       1 watcher.go:208] watch chan error: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: W1226 21:28:06.762910       1 reflector.go:270] storage/cacher.go:/podsecuritypolicy: watch of *policy.PodSecurityPolicy ended with: Internal error occurred: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: E1226 21:28:06.763149       1 watcher.go:208] watch chan error: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: E1226 21:28:06.763232       1 watcher.go:208] watch chan error: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: W1226 21:28:06.763439       1 reflector.go:270] storage/cacher.go:/apiregistration.k8s.io/apiservices: watch of *apiregistration.APIService ended with: Internal error occurred: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: E1226 21:28:06.763719       1 watcher.go:208] watch chan error: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: W1226 21:28:06.763786       1 reflector.go:270] storage/cacher.go:/daemonsets: watch of *apps.DaemonSet ended with: Internal error occurred: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: E1226 21:28:06.763937       1 watcher.go:208] watch chan error: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: W1226 21:28:06.764016       1 reflector.go:270] storage/cacher.go:/cronjobs: watch of *batch.CronJob ended with: Internal error occurred: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: E1226 21:28:06.764250       1 watcher.go:208] watch chan error: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: E1226 21:28:06.764324       1 watcher.go:208] watch chan error: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: W1226 21:28:06.764386       1 reflector.go:270] storage/cacher.go:/services/endpoints: watch of *core.Endpoints ended with: Internal error occurred: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: W1226 21:28:06.764440       1 reflector.go:270] storage/cacher.go:/deployments: watch of *apps.Deployment ended with: Internal error occurred: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: WARNING: 2018/12/26 21:28:06 grpc: addrConn.transportMonitor exits due to: context canceled
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: 2018-12-26 21:28:06.765201 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = body closed by handler")
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: 2018-12-26 21:28:06.765384 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = body closed by handler")

...

Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: W1226 21:28:06.784805       1 reflector.go:270] storage/cacher.go:/controllerrevisions: watch of *apps.ControllerRevision ended with: Internal error occurred: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: W1226 21:28:06.784871       1 reflector.go:270] storage/cacher.go:/pods: watch of *core.Pod ended with: Internal error occurred: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: E1226 21:28:06.786587       1 watcher.go:208] watch chan error: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: W1226 21:28:06.786700       1 reflector.go:270] storage/cacher.go:/horizontalpodautoscalers: watch of *autoscaling.HorizontalPodAutoscaler ended with: Internal error occurred: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: E1226 21:28:06.788274       1 watcher.go:208] watch chan error: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: W1226 21:28:06.788385       1 reflector.go:270] storage/cacher.go:/crd.projectcalico.org/clusterinformations: watch of *unstructured.Unstructured ended with: Internal error occurred: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain oci-systemd-hook[9353]: systemdhook <debug>: 02cb55687848: Skipping as container command is etcd, not init or systemd
Dec 26 15:28:06 thalia0.domain oci-umount[9355]: umounthook <debug>: 02cb55687848: only runs in prestart stage, ignoring
Dec 26 15:28:07 thalia0.domain dockerd-current[1609]: time="2018-12-26T15:28:07.003175741-06:00" level=warning msg="02cb556878485b24e4705dd0efe1051c02f3e3bbbe7b8a7ab23ea71bd6d82b2f cleanup: failed to unmount secrets: invalid argument"
Dec 26 15:28:07 thalia0.domain kubelet[24604]: E1226 15:28:07.006714   24604 pod_workers.go:190] Error syncing pod 0264932236d6afef396f466fc3bd3181 ("etcd-thalia0.domain_kube-system(0264932236d6afef396f466fc3bd3181)"), skipping: failed to "StartContainer" for "etcd" with CrashLoopBackOff: "Back-off 2m40s restarting failed container=etcd pod=etcd-thalia0.domain_kube-system(0264932236d6afef396f466fc3bd3181)"
Dec 26 15:28:07 thalia0.domain kubelet[24604]: E1226 15:28:07.040361   24604 pod_workers.go:190] Error syncing pod 0264932236d6afef396f466fc3bd3181 ("etcd-thalia0.domain_kube-system(0264932236d6afef396f466fc3bd3181)"), skipping: failed to "StartContainer" for "etcd" with CrashLoopBackOff: "Back-off 2m40s restarting failed container=etcd pod=etcd-thalia0.domain_kube-system(0264932236d6afef396f466fc3bd3181)"

为了减少原木中的噪音,我封锁了其他节点。

如前所述,如果我重新启动kubelet服务,一段时间内一切正常,然后出现间歇性行为。

任何建议都将受到欢迎。我正在与我们的系统管理员合作,他说似乎etcd正在频繁重启。我认为当CrashLoopBackOff事情开始发生时,麻烦就开始了。

标签: dockerkubernetesrhelkubeadmkubelet

解决方案


实际上似乎是一个 RHEL/Docker 错误。请参阅错误 1655214 - docker exec 不适用于 registry.access.redhat.com/rhel7:7.3

我们已经应用了修复程序,到目前为止情况似乎很稳定。


推荐阅读