首页 > 解决方案 > RabbitMQ 节点因代理 Nginx 服务器而死

问题描述

我们有一个 RabbitMQ 集群,该集群是由不再存在的某个人设置的,我对此知之甚少,无法解决问题。我知道它失败的唯一方法是因为使用队列的新贵服务的程序错误警报。

这就是我目前所知道的:

在代理服务器上(Nginx 1.10.3,Ubuntu 16.04):

/etc/nginx/nginx.conf中,我有:

stream {

    # debug|info|notice|warn|error|crit|alert|emerg
    error_log  /var/log/nginx/stream_error.log info;

    server {
        listen 192.168.70.11:5672 so_keepalive=on;
        proxy_pass rabbitmq_backend;

    }

    upstream rabbitmq_backend {
        server services-01:5672;
        #server services-00:5672;   <--- Commented this out 
    }

}

在其中一个日志中运行 upstart 服务的服务器上,例如/var/log/upstart/<service-name>.log

2021-08-21 05:33:10NOT_FOUND - home node 'rabbit@rabbit-services-00' of durable queue 'cert_collect_info' in vhost 'certificate' is down or inaccessible
2021-08-21 05:33:11NOT_FOUND - home node 'rabbit@rabbit-services-00' of durable queue 'cert_collect_info' in vhost 'certificate' is down or inaccessible
2021-08-21 05:33:11NOT_FOUND - home node 'rabbit@rabbit-services-00' of durable queue 'cert_collect_info' in vhost 'certificate' is down or inaccessible
2021-08-21 05:33:11NOT_FOUND - home node 'rabbit@rabbit-services-00' of durable queue 'cert_collect_info' in vhost 'certificate' is down or inaccessible
2021-08-21 05:34:12NOT_FOUND - home node 'rabbit@rabbit-services-00' of durable queue 'cert_collect_info' in vhost 'certificate' is down or inaccessible
2021-08-21 05:34:12NOT_FOUND - home node 'rabbit@rabbit-services-00' of durable queue 'cert_collect_info' in vhost 'certificate' is down or inaccessible
2021-08-21 05:34:13NOT_FOUND - home node 'rabbit@rabbit-services-00' of durable queue 'cert_collect_info' in vhost 'certificate' is down or inaccessible
2021-08-21 05:34:13NOT_FOUND - home node 'rabbit@rabbit-services-00' of durable queue 'cert_collect_info' in vhost 'certificate' is down or inaccessible
2021-08-21 05:34:13NOT_FOUND - home node 'rabbit@rabbit-services-00' of durable queue 'cert_collect_info' in vhost 'certificate' is down or inaccessible
2021-08-21 05:34:13NOT_FOUND - home node 'rabbit@rabbit-services-00' of durable queue 'cert_collect_info' in vhost 'certificate' is down or inaccessible

在 RabbitMQ 仪表板上,我看到:

RabbitMQ 节点未运行

在仪表板的顶部,我看到:

检测到 Rabbit MQ 网络分区

这两个集群节点位于两个独立服务器(services-00 和 services-01)上的 docker 容器中。在报告问题的节点上,在容器内,我找到了 command rabbitmqctl cluster_status。它给了我:

root@rabbit-services-00:/# rabbitmqctl cluster_status
Cluster status of node 'rabbit@rabbit-services-00' ...
[{nodes,[{disc,['rabbit@rabbit-services-00','rabbit@rabbit-services-01']}]},
 {running_nodes,['rabbit@rabbit-services-00']},
 {cluster_name,<<"rabbit@rabbit-services-01">>},
 {partitions,[{'rabbit@rabbit-services-00',['rabbit@rabbit-services-01']}]},
 {alarms,[{'rabbit@rabbit-services-00',[]}]}]

但我不确定如何解释它,至少在短时间内我必须尝试让它运行。任何帮助将不胜感激。从凌晨 1 点开始,我一直在努力解决这个问题,但我已经没有选择了。我想在配置中注释掉问题节点Nginx会有所帮助,但它没有。当upstart我重新启动时,服务不断死机。

标签: dockernginxrabbitmqupstart

解决方案


我将其追溯到一个开关信号(谢谢 Nagios)。我从容器中执行了以下操作:

docker exec rabbit-services-00 rabbitmqctl stop_app

docker exec rabbit-services-00 rabbitmqctl start_app

然后我重新检查了队列,他们又开始工作了。看来这只是一次短暂的网络中断。


推荐阅读