kubernetes - 为什么 K8S 中的 Redis 总是重启?
问题描述
Redis pod 像疯了一样重启。我怎样才能找出这种行为的原因?
我发现应该升级资源配额,但我不知道最佳 cpu/ram 比率是多少。为什么没有崩溃事件或日志?
这是豆荚:
> kubectl get pods
redis-master-5d9cfb54f8-8pbgq 1/1 Running 33 3d16h
以下是日志:
> kubectl logs --follow redis-master-5d9cfb54f8-8pbgq
[1] 08 Sep 07:02:12.152 # Server started, Redis version 2.8.19
[1] 08 Sep 07:02:12.153 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
[1] 08 Sep 07:02:12.153 * The server is now ready to accept connections on port 6379
[1] 08 Sep 07:03:13.085 * 10000 changes in 60 seconds. Saving...
[1] 08 Sep 07:03:13.085 * Background saving started by pid 8
[8] 08 Sep 07:03:13.101 * DB saved on disk
[8] 08 Sep 07:03:13.101 * RDB: 0 MB of memory used by copy-on-write
[1] 08 Sep 07:03:13.185 * Background saving terminated with success
[1] 08 Sep 07:04:14.018 * 10000 changes in 60 seconds. Saving...
[1] 08 Sep 07:04:14.018 * Background saving started by pid 9
...
[93] 08 Sep 08:38:30.160 * DB saved on disk
[93] 08 Sep 08:38:30.164 * RDB: 2 MB of memory used by copy-on-write
[1] 08 Sep 08:38:30.259 * Background saving terminated with success
[1] 08 Sep 08:39:31.072 * 10000 changes in 60 seconds. Saving...
[1] 08 Sep 08:39:31.074 * Background saving started by pid 94
这是同一 pod 以前的日志。
> kubectl logs --previous --follow redis-master-5d9cfb54f8-8pbgq
[1] 08 Sep 09:41:46.057 * Background saving terminated with success
[1] 08 Sep 09:42:47.073 * 10000 changes in 60 seconds. Saving...
[1] 08 Sep 09:42:47.076 * Background saving started by pid 140
[140] 08 Sep 09:43:14.398 * DB saved on disk
[140] 08 Sep 09:43:14.457 * RDB: 1 MB of memory used by copy-on-write
[1] 08 Sep 09:43:14.556 * Background saving terminated with success
[1] 08 Sep 09:44:15.073 * 10000 changes in 60 seconds. Saving...
[1] 08 Sep 09:44:15.077 * Background saving started by pid 141
[1 | signal handler] (1599558267) Received SIGTERM scheduling shutdown...
[1] 08 Sep 09:44:28.052 # User requested shutdown...
[1] 08 Sep 09:44:28.052 # There is a child saving an .rdb. Killing it!
[1] 08 Sep 09:44:28.052 * Saving the final RDB snapshot before exiting.
[1] 08 Sep 09:44:49.592 * DB saved on disk
[1] 08 Sep 09:44:49.592 # Redis is now ready to exit, bye bye...
这是 pod 的描述。如您所见,限制为 100Mi,但我看不到阈值,之后 Pod 重新启动。
> kubectl describe pod redis-master-5d9cfb54f8-8pbgq
Name: redis-master-5d9cfb54f8-8pbgq
Namespace: cryptoman
Priority: 0
Node: gke-my-cluster-default-pool-818613a8-smmc/10.172.0.28
Start Time: Fri, 04 Sep 2020 18:52:17 +0300
Labels: app=redis
pod-template-hash=5d9cfb54f8
role=master
tier=backend
Annotations: <none>
Status: Running
IP: 10.36.2.124
IPs: <none>
Controlled By: ReplicaSet/redis-master-5d9cfb54f8
Containers:
master:
Container ID: docker://3479276666a41df502f1f9eb9bb2ff9cfa592f08a33e656e44179042b6233c6f
Image: k8s.gcr.io/redis:e2e
Image ID: docker-pullable://k8s.gcr.io/redis@sha256:f066bcf26497fbc55b9bf0769cb13a35c0afa2aa42e737cc46b7fb04b23a2f25
Port: 6379/TCP
Host Port: 0/TCP
State: Running
Started: Wed, 09 Sep 2020 10:27:56 +0300
Last State: Terminated
Reason: OOMKilled
Exit Code: 0
Started: Wed, 09 Sep 2020 07:34:18 +0300
Finished: Wed, 09 Sep 2020 10:27:55 +0300
Ready: True
Restart Count: 42
Limits:
cpu: 100m
memory: 250Mi
Requests:
cpu: 100m
memory: 250Mi
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-5tds9 (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
default-token-5tds9:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-5tds9
Optional: false
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SandboxChanged 52m (x42 over 4d13h) kubelet, gke-my-cluster-default-pool-818613a8-smmc Pod sandbox changed, it will be killed and re-created.
Normal Killing 52m (x42 over 4d13h) kubelet, gke-my-cluster-default-pool-818613a8-smmc Stopping container master
Normal Created 52m (x43 over 4d16h) kubelet, gke-my-cluster-default-pool-818613a8-smmc Created container master
Normal Started 52m (x43 over 4d16h) kubelet, gke-my-cluster-default-pool-818613a8-smmc Started container master
Normal Pulled 52m (x42 over 4d13h) kubelet, gke-my-cluster-default-pool-818613a8-smmc Container image "k8s.gcr.io/redis:e2e" already present on machine
解决方案
这是它重新启动的限制。CPU 刚刚受到限制,内存已OOM。
Limits:
cpu: 100m
memory: 250Mi
Reason: OOMKilled
- 删除请求和限制
- 运行 pod,确保它不会重新启动
- 如果您已经有 prometheus,请运行VPA Recommender以检查它需要多少资源。或者只是使用任何监控堆栈:GKE Prometheus、prometheus-operator、DataDog等来检查实际资源消耗并相应地调整限制。
推荐阅读
- c++ - 为什么 freopen() 不能在 Microsoft Visual Studio 上工作,而是在 CodeBlocks 上工作?
- typescript - 我可以创建两个从数字或字符串派生的相等类型,然后求和或连接这些类型吗?
- android - 向通知中心添加信息
- html - 如何使背景适合整个 iPhoneX 屏幕,包括刘海
- javascript - `Gatsby Build` 错误地输出混洗数组对象
- sql - 使用 PostgreSQL 在 Django 中寻找读写锁,例如,SELECT FOR SHARE
- github-pages - Github-pages/kramdown 不会正确呈现我的混合 markdown/html
- css - 在 Outlook 2010 的电子邮件中选择表的第一个子项
- sql - SQL查找所有注册学生数量超过该课程允许的enroll_limit的课程
- java - 使用 StringBuilder 进行字符串连接