首页 > 解决方案 > Nagios 事件处理程序忽略检查间隔

问题描述

我最近为服务检查创建了一个事件处理程序,它将在 3 个不同的盒子上重新启动 Tomcat。

检查设置为:

5张支票

正常时 2 分钟检查

5 分钟检查,否则

在事件处理程序脚本中,我有:

# What state is the iOS PN in?
case "$1" in
OK)
        # The service is ok, so don't do anything...
        ;;
WARNING)
        # Is this a "soft" or a "hard" state?
        case "$2" in
                SOFT)
                        case "$3" in
                                #Check number
                                2)
                                        echo "`date` Restarting Tomcat on Node 1 for iOS PN (2nd soft warning state)..." >> /tmp/iOSPN.log
                                ;;
                                3)
                                        echo "`date` Restarting Tomcat on Node 2 for iOS PN (3rd soft warning state)..." >> /tmp/iOSPN.log
                                ;;
                                4)
                                        echo "`date` Restarting Tomcat on Node 3 for iOS PN (4th soft warning state)..." >> /tmp/iOSPN.log
                                ;;
                        esac
                        ;;
                HARD)
                        # Do nothing let Nagios send alert
                        ;;
                esac
        ;;
CRITICAL)
        # In theory nothing should reach this point...
        ;;
esac
exit 0

因此,事件处理程序应在第二次警告检查后在节点 1 上重新启动 Tomcat,等待 5 分钟后再再次检查,如果仍然存在问题则重新启动节点 2,然后等待 5 分钟并再次检查,如果仍然存在问题则重新启动节点 3问题。

但是,当我检查日志文件时,我可以看到以下内容:

Thu Apr 18 15:09:13 2019 Restarting Tomcat on Node 1 for iOS PN (2nd soft warning state)...
Thu Apr 18 15:09:23 2019 Restarting Tomcat on Node 2 for iOS PN (3rd soft warning state)...
Thu Apr 18 15:09:33 2019 Restarting Tomcat on Node 3 for iOS PN (4th soft warning state)...

正如您所看到的,它会在 10 秒而不是 5 分钟后重新启动每个框,我已经删除了实际调用 Tomcat 重新启动的行,因为这不能在这么短的时间内完成。

我在 Nagios 日志中看不到任何详细说明它为何如此迅速地进行下一次检查的任何内容,因此将不胜感激。

额外的:

这是服务定义:

define service{
        use                     5check-service
        host_name               ACTIVEMQ1
        contact_groups          tyrell-admins-non-critical
        service_description     ActiveMQ - iOS PushNotification Queue Pending Items
        event_handler           restartRemote_Tomcat!$SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$
        check_command           check_activemq_queue_item2!http://activemq1:8161/admin/xml/queues.jsp!IosPushNotificationQueue!100!300
        }

define service{
        name                            5check-service      ; The 'name' of this service template
        active_checks_enabled           1                       ; Active service checks are enabled
        passive_checks_enabled          1                       ; Passive service checks are enabled/accepted
        parallelize_check               1                       ; Active service checks should be parallelized (disabling this can lead to major performance problems)
        obsess_over_service             1                       ; We should obsess over this service (if necessary)
        check_freshness                 0                       ; Default is to NOT check service 'freshness'
        notifications_enabled           1                       ; Service notifications are enabled
        event_handler_enabled           1                       ; Service event handler is enabled
        flap_detection_enabled          1                       ; Flap detection is enabled
        failure_prediction_enabled      1                       ; Failure prediction is enabled
        process_perf_data               1                       ; Process performance data
        retain_status_information       1                       ; Retain status information across program restarts
        retain_nonstatus_information    1                       ; Retain non-status information across program restarts
        is_volatile                     0                       ; The service is not volatile
        check_period                    24x7                    ; The service can be checked at any time of the day
        max_check_attempts              5                       ; Re-check the service up to 5 times in order to determine its final (hard) state
        normal_check_interval           2                       ; Check the service every 5 minutes under normal conditions
        retry_check_interval            5                       ; Re-check the service every two minutes until a hard state can be determined
        contact_groups                  support                 ; Notifications get sent out to everyone in the 'admins' group
        notification_options            w,u,c,r                 ; Send notifications about warning, unknown, critical, and recovery events
        notification_interval           5                       ; Re-notify about service problems every 5 mins
        notification_period             24x7                    ; Notifications can be sent out at any time
        register                        0                       ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
        }

标签: event-handlingnagios

解决方案


推荐阅读