首页 > 解决方案 > Kubernetes pod 中出现间歇性 502 bad gateway 错误

问题描述

我们在 AWS 中使用 Kubernetes,使用 kops 进行部署。我们使用 Nginx 作为我们的入口控制器,它工作了将近 2 年。但最近我们开始在多个 pod 中随机出现 502 错误网关问题。

入口日志显示 502

[23/Sep/2021:10:53:43 +0000] "GET /service HTTP/2.0" 502 559 "https://mydomain/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36" 4691 0.040 [default-myservice-80] 100.96.13.157:80, 100.96.13.157:80, 100.96.13.157:80 0, 0, 0 0.000, 0.000, 0.000 502, 502, 502 258a09eaaddef85cae2a0c2f706ce06b
..
[error] 1050#1050: *1352377 connect() failed (111: Connection refused) while connecting to upstream, client: CLIENT_IP_HERE , server: my.domain.com , request: "GET /index.html HTTP/2.0", upstream: "http://POD_IP:8080/index.html", host: "my.domain.com", referrer: "https://my.domain/index.html"

我们尝试连接到从入口 pod 给出 502 的 pod-ip

www-data@nginx-ingress-controller-664f488479-7cp57:/etc/nginx$ curl 100.96.13.157
curl: (7) Failed to connect to 100.96.13.157 port 80: Connection refused

它显示连接被拒绝

我们从 pod 给出 502 的节点监控 tcpdump 流量

root@node-ip:/home/admin# tcpdump -i cbr0 dst 100.96.13.157
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on cbr0, link-type EN10MB (Ethernet), capture size 262144 bytes
17:39:16.779950 ARP, Request who-has 100.96.13.157 tell 100.96.13.22, length 28
17:39:16.780207 IP 100.96.13.22.57610 > 100.96.13.157.http: Flags [S], seq 2263585697, win 26883, options [mss 8961,sackOK,TS val 1581767928 ecr 0,nop,wscale 9], length 0
17:39:21.932839 ARP, Reply 100.96.13.22 is-at 0a:58:64:60:0d:16 (oui Unknown), length 28


root@node-ip:/home/admin# ping 100.96.13.157
PING 100.96.13.157 (100.96.13.157) 56(84) bytes of data.
64 bytes from 100.96.13.157: icmp_seq=1 ttl=64 time=0.309 ms
64 bytes from 100.96.13.157: icmp_seq=2 ttl=64 time=0.042 ms
64 bytes from 100.96.13.157: icmp_seq=3 ttl=64 time=0.044 ms

看起来 pod 可以相互访问,并且 ping 正常工作,

root@node-ip:/home/admin# tcpdump -i cbr0 src 100.96.13.157
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on cbr0, link-type EN10MB (Ethernet), capture size 262144 bytes
17:39:16.780076 ARP, Reply 100.96.13.157 is-at 0a:58:64:60:0d:9d (oui Unknown), length 28
17:39:16.780175 ARP, Reply 100.96.13.157 is-at 0a:58:64:60:0d:9d (oui Unknown), length 28
17:39:16.780238 IP 100.96.13.157.http > 100.96.13.22.57610: Flags [R.], seq 0, ack 2263585698, win 0, length 0
17:39:21.932808 ARP, Request who-has 100.96.13.22 tell 100.96.13.157, length 28

这里入口正在发送请求,但它已被重置,(tcp 转储中的标志 [R.] = RST-ACK)并且 http 请求丢失。

我们不知道这个连接在哪里丢失,我们检查了我们的服务和 pod 标签,一切都配置正确。大部分时间 my.domain.com 都可以访问并且问题看起来是间歇性的,我们需要检查日志的任何其他地方......?或者有没有人遇到过同样的问题?提前致谢

标签: kuberneteskube-proxy

解决方案


推荐阅读