redis-cluster - redis sentinel 未将 SDOWN 升级为 ODOWN 事件
问题描述
需要帮助以了解出了什么问题
我已经在 kubernetes 环境中部署了 redis,我有 1 个 master 2 个 slave 和 3 个 sentinel。我正在使用 redis 6.2.3 alpine 图像。所有 redis/sentinel 在单独的 pod 中运行。
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
redis-0 1/1 Running 0 31m 10.233.64.143 vm1 <none> <none>
redis-1 1/1 Running 0 34m 10.233.64.90 vm1 <none> <none>
redis-2 1/1 Running 0 34m 10.233.64.40 vm1 <none> <none>
sentinel-0 1/1 Running 0 34m 10.233.64.93 vm1 <none> <none>
sentinel-1 1/1 Running 0 34m 10.233.64.35 vm1 <none> <none>
sentinel-2 1/1 Running 0 34m 10.233.64.34 vm1 <none> <none>
此外,我还为 redis 和 sentinel pod 编写了无头服务,使用它我可以联系到服务后面的特定 pod。
[root@master-1 ~]# kubectl describe svc sentinel -n ankit
Name: sentinel
Namespace: ankit
Labels: <none>
Annotations: <none>
Selector: app=sentinel
Type: ClusterIP
IP: None
Port: sentinel 5000/TCP
TargetPort: 5000/TCP
Endpoints: 10.233.64.34:5000,10.233.64.35:5000,10.233.64.93:5000
Session Affinity: None
[root@master-1 ~]# kubectl describe svc redis -n ankit
Name: redis
Namespace: ankit
Labels: <none>
Annotations: <none>
Selector: app=redis
Type: ClusterIP
IP: None
Port: redis 6379/TCP
TargetPort: 6379/TCP
Endpoints: 10.233.64.143:6379,10.233.64.40:6379,10.233.64.90:6379
Session Affinity: None
Events: <none>
[root@master-1 ~]#
当部署 redis statefulset pod 时,我在 redis yaml 的 init 容器中编写了一个逻辑,以使 redis-0 pod 默认为 master。我可以看到所有 pod 都已启动并完美运行,所有 thress 哨兵也能够与 master 和其他哨兵连接,但是当我删除 redis 主 pod 时,所有三个哨兵都记录了 SDOWN 事件,但它没有升级为 ODOWN 事件,因此没有发生故障转移,并且当 redis-0 作为从属服务器出现时,哨兵无法选择新的主服务器,由于没有主服务器,集群处于错误状态。
redis master删除后的sentinel-0日志:
1:X 15 Oct 2021 02:13:52.155 * +fix-slave-config slave 10.233.64.40:6379 10.233.64.40 6379 @ mymaster redis-0.redis.ankit.svc.cluster.local 6379
1:X 15 Oct 2021 02:13:52.322 * +fix-slave-config slave 10.233.64.90:6379 10.233.64.90 6379 @ mymaster redis-0.redis.ankit.svc.cluster.local 6379
1:X 15 Oct 2021 02:13:53.194 * +fix-slave-config slave 10.233.64.40:6379 10.233.64.40 6379 @ mymaster redis-0.redis.ankit.svc.cluster.local 6379
1:X 15 Oct 2021 02:13:53.338 * +fix-slave-config slave 10.233.64.90:6379 10.233.64.90 6379 @ mymaster redis-0.redis.ankit.svc.cluster.local 6379
1:X 15 Oct 2021 02:13:54.203 * +fix-slave-config slave 10.233.64.40:6379 10.233.64.40 6379 @ mymaster redis-0.redis.ankit.svc.cluster.local 6379
1:X 15 Oct 2021 02:13:54.399 * +fix-slave-config slave 10.233.64.90:6379 10.233.64.90 6379 @ mymaster redis-0.redis.ankit.svc.cluster.local 6379
1:X 15 Oct 2021 02:13:54.635 # +sdown master mymaster redis-0.redis.ankit.svc.cluster.local 6379
1:X 15 Oct 2021 02:14:00.040 - Accepted 10.233.64.143:33288
1:X 15 Oct 2021 02:14:00.047 - Client closed connection
删除主 redis pod 后的 Sentinel-1 日志
1:X 15 Oct 2021 02:11:10.200 . Rewritten config file (/etc/redis/sentinel.conf) successfully
1:X 15 Oct 2021 02:13:54.600 # +sdown master mymaster redis-0.redis.ankit.svc.cluster.local 6379
1:X 15 Oct 2021 02:14:00.054 - Accepted 10.233.64.143:48550
1:X 15 Oct 2021 02:14:00.055 - Client closed connection
删除主 redis pod 后的 Sentinel-2 日志
1:X 15 Oct 2021 02:11:09.858 . Rewritten config file (/etc/redis/sentinel.conf) successfully
1:X 15 Oct 2021 02:11:10.244 - Accepted 10.233.64.93:35181
1:X 15 Oct 2021 02:11:10.264 - Accepted 10.233.64.35:56403
1:X 15 Oct 2021 02:13:54.636 # +sdown master mymaster redis-0.redis.ankit.svc.cluster.local 6379
正如我们所看到的,它没有升级为 ODOWN 事件,因此也没有发生进一步的故障转移。
附加redis和sentinel conf文件
Redis 配置文件:
masterauth password
requirepass password
bind 0.0.0.0
protected-mode no
port 6379
tcp-backlog 511
# Close the connection after a client is idle for N seconds (0 to disable)
timeout 0
tcp-keepalive 300
daemonize no
supervised no
pidfile "/var/run/redis_6379.pid"
loglevel debug
logfile ""
databases 16
always-show-logo yes
save 900 1
save 300 10
save 60 10000
stop-writes-on-bgsave-error yes
rdbcompression yes
rdbchecksum yes
dbfilename "dump.rdb"
rdb-del-sync-files no
dir "/data"
replica-serve-stale-data yes
replica-read-only yes
repl-diskless-sync no
repl-diskless-sync-delay 5
repl-diskless-load disabled
repl-disable-tcp-nodelay no
replica-priority 100
acllog-max-len 128
maxclients 9000
lazyfree-lazy-eviction no
lazyfree-lazy-expire no
lazyfree-lazy-server-del no
replica-lazy-flush no
lazyfree-lazy-user-del no
appendonly yes
appendfilename "appendonly.aof"
appendfsync everysec
no-appendfsync-on-rewrite no
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb
aof-load-truncated yes
aof-use-rdb-preamble yes
lua-time-limit 5000
latency-monitor-threshold 0
notify-keyspace-events ""
hash-max-ziplist-entries 512
hash-max-ziplist-value 64
list-max-ziplist-size -2
list-compress-depth 0
set-max-intset-entries 512
zset-max-ziplist-entries 128
zset-max-ziplist-value 64
hll-sparse-max-bytes 3000
stream-node-max-bytes 4kb
stream-node-max-entries 100
activerehashing yes
client-output-buffer-limit normal 0 0 0
client-output-buffer-limit replica 256mb 64mb 60
client-output-buffer-limit pubsub 32mb 8mb 60
hz 10
dynamic-hz yes
aof-rewrite-incremental-fsync yes
rdb-save-incremental-fsync yes
# Jemalloc background thread for purging will be enabled by default
jemalloc-bg-thread yes
slaveof redis-0.redis.ankit.svc.cluster.local 6379
哨兵配置文件:
port 5000
daemonize no
protected-mode no
bind 0.0.0.0
acllog-max-len 128
sentinel deny-scripts-reconfig yes
sentinel resolve-hostnames yes
sentinel announce-hostnames yes
sentinel monitor mymaster redis-0.redis.ankit.svc.cluster.local 6379 2
sentinel down-after-milliseconds mymaster 4000
sentinel failover-timeout mymaster 2000
sentinel auth-pass mymaster password
maxclients 9000
loglevel debug
# Generated by CONFIG REWRITE
user default on nopass ~* &* +@all
dir "/data"
sentinel myid c7d1f666d94b7ab0a05701c83ccd1246d2628ca1
sentinel config-epoch mymaster 0
sentinel leader-epoch mymaster 0
sentinel current-epoch 0
sentinel known-replica mymaster 10.233.64.90 6379
sentinel known-replica mymaster 10.233.64.40 6379
sentinel known-sentinel mymaster 10.233.64.34 5000 6e5e0ecf8551c21b543815c966a19a54809677c4
sentinel known-sentinel mymaster 10.233.64.35 5000 3f4493c38c5514d76f2eb698aed9c0b6ba550be9
解决方案
推荐阅读
- spring-boot - 使用spring包网关的机会,spring boot MS架构中的服务发现实际上在apache服务器后面
- javascript - 单击其他地方并勾选/取消勾选复选框
- python - Pandas - TypeError:无法使用 dtyped [bool] 数组和 [bool] 类型的标量执行“rand_”
- html - div的高宽比不变
- android-viewmodel - SavedStateHandle 不持久化数据
- javascript - 动态路由 NuxtJS 内的静态页面
- graphql - GraphQL (A = a AND B = b) OR (A = b AND B = a)
- firebase - React Native 如何从其他集合 firestore Flatist 获取数据
- typescript - 我是否需要更新路径才能在 iterm (zsh) 上运行 ts-node?
- c - 使用 gdb 远程调试时找不到当前函数的边界