event-handling - Nagios 事件处理程序忽略检查间隔
问题描述
我最近为服务检查创建了一个事件处理程序,它将在 3 个不同的盒子上重新启动 Tomcat。
检查设置为:
5张支票
正常时 2 分钟检查
5 分钟检查,否则
在事件处理程序脚本中,我有:
# What state is the iOS PN in?
case "$1" in
OK)
# The service is ok, so don't do anything...
;;
WARNING)
# Is this a "soft" or a "hard" state?
case "$2" in
SOFT)
case "$3" in
#Check number
2)
echo "`date` Restarting Tomcat on Node 1 for iOS PN (2nd soft warning state)..." >> /tmp/iOSPN.log
;;
3)
echo "`date` Restarting Tomcat on Node 2 for iOS PN (3rd soft warning state)..." >> /tmp/iOSPN.log
;;
4)
echo "`date` Restarting Tomcat on Node 3 for iOS PN (4th soft warning state)..." >> /tmp/iOSPN.log
;;
esac
;;
HARD)
# Do nothing let Nagios send alert
;;
esac
;;
CRITICAL)
# In theory nothing should reach this point...
;;
esac
exit 0
因此,事件处理程序应在第二次警告检查后在节点 1 上重新启动 Tomcat,等待 5 分钟后再再次检查,如果仍然存在问题则重新启动节点 2,然后等待 5 分钟并再次检查,如果仍然存在问题则重新启动节点 3问题。
但是,当我检查日志文件时,我可以看到以下内容:
Thu Apr 18 15:09:13 2019 Restarting Tomcat on Node 1 for iOS PN (2nd soft warning state)...
Thu Apr 18 15:09:23 2019 Restarting Tomcat on Node 2 for iOS PN (3rd soft warning state)...
Thu Apr 18 15:09:33 2019 Restarting Tomcat on Node 3 for iOS PN (4th soft warning state)...
正如您所看到的,它会在 10 秒而不是 5 分钟后重新启动每个框,我已经删除了实际调用 Tomcat 重新启动的行,因为这不能在这么短的时间内完成。
我在 Nagios 日志中看不到任何详细说明它为何如此迅速地进行下一次检查的任何内容,因此将不胜感激。
额外的:
这是服务定义:
define service{
use 5check-service
host_name ACTIVEMQ1
contact_groups tyrell-admins-non-critical
service_description ActiveMQ - iOS PushNotification Queue Pending Items
event_handler restartRemote_Tomcat!$SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$
check_command check_activemq_queue_item2!http://activemq1:8161/admin/xml/queues.jsp!IosPushNotificationQueue!100!300
}
define service{
name 5check-service ; The 'name' of this service template
active_checks_enabled 1 ; Active service checks are enabled
passive_checks_enabled 1 ; Passive service checks are enabled/accepted
parallelize_check 1 ; Active service checks should be parallelized (disabling this can lead to major performance problems)
obsess_over_service 1 ; We should obsess over this service (if necessary)
check_freshness 0 ; Default is to NOT check service 'freshness'
notifications_enabled 1 ; Service notifications are enabled
event_handler_enabled 1 ; Service event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
failure_prediction_enabled 1 ; Failure prediction is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program restarts
retain_nonstatus_information 1 ; Retain non-status information across program restarts
is_volatile 0 ; The service is not volatile
check_period 24x7 ; The service can be checked at any time of the day
max_check_attempts 5 ; Re-check the service up to 5 times in order to determine its final (hard) state
normal_check_interval 2 ; Check the service every 5 minutes under normal conditions
retry_check_interval 5 ; Re-check the service every two minutes until a hard state can be determined
contact_groups support ; Notifications get sent out to everyone in the 'admins' group
notification_options w,u,c,r ; Send notifications about warning, unknown, critical, and recovery events
notification_interval 5 ; Re-notify about service problems every 5 mins
notification_period 24x7 ; Notifications can be sent out at any time
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
}
解决方案
推荐阅读
- java - SpringBoot:在同一个响应中发送一个 zip 文件作为附件和一个 JSON 正文
- google-apps-script - 使用 getNotes 动态自动填充列
- javascript - 异步函数中的 JSON.parse(Fetch_Get_JSON_Data)
- node.js - 我该如何克服这个“Cannot Get /”nginx 错误?
- python - django.template.exceptions.TemplateSyntaxError:无法解析某些字符:|{{b.count}}||rangef
- amazon-web-services - 将基本域 url(例如没有 www 的 site.com)配置到 api 网关,并使其在浏览器中工作
- python-3.x - tensorflow-probability: AttributeError: Tensor.op 在启用急切执行时毫无意义
- groovy - 执行脚本时出现异常:[{}]java.lang.NullPointerException:无法在空对象上获取属性“httpResponse”
- scala - 我将如何根据一列或另一列是否匹配案例在 Scala 中执行联接?
- html - 列表中的链接标题呈现在导航栏中的链接下方