bash - bash 中的 Nagios 事件处理程序脚本以重新启动服务,如果未启动,则在满足条件之前不要重新启动下一个
问题描述
嗨 Stackoverflow 社区,
我需要 bash 脚本的帮助,因为我是新手。我想要完成的是,我们有一个 Windows 服务器,有时它会达到 90% 的内存,所以每当 nagios 捕获它时,我们想通过 nrpe 重新启动这些服务。但是在重新启动所有服务之前,必须先启动第一个服务,并且一旦启动就继续下一次服务重新启动。
另一种选择是停止所有 4 个服务,然后依次启动它们。
这是我写的脚本:
case "$1" in
OK)
;;
WARNING)
;;
UNKNOWN)
;;
CRITICAL) ## DECISION ENGINE RESTART
echo -n "Restarting Decision Engine_1"
cat /usr/local/nagios/libexec/mail/DeServiceRestart.txt | mail -s "Restarting DE services" email@someteam.com -r Nagios@ATL-NM-01
/usr/local/nagios/libexec/check_nrpe -H "$2" -p 5666 -c restart_service -a DecisionEngine_1;
if /usr/local/nagios/libexec/check_nrpe -H "$2" -t 30 -c check_service -a DecisionEngine_1 'crit=not state_is_ok()' > OK:
then
echo -n "Restarting Decision Engine_2"
/usr/local/nagios/libexec/check_nrpe -H "$2" -p 5666 -c restart_service -a DecisionEngine_2
if /usr/local/nagios/libexec/check_nrpe -H "$2" -t 30 -c check_service -a DecisionEngine_2 'crit=not state_is_ok()' > OK:
then
echo -n "Restarting Decision Engine_3"
/usr/local/nagios/libexec/check_nrpe -H "$2" -p 5666 -c restart_service -a DecisionEngine_3
if /usr/local/nagios/libexec/check_nrpe -H "$2" -t 30 -c check_service -a DecisionEngine_3 'crit=not state_is_ok()' > OK:
then
echo -n "Restarting Decision Engine_4"
/usr/local/nagios/libexec/check_nrpe -H "$2" -p 5666 -c restart_service -a DecisionEngine_4
else
echo " Restart is complete"
fi
;;
esac
exit 0
不知道我在哪里犯了错误,将不胜感激任何反馈。
谢谢!
解决方案
所有注释都在代码中。仔细检查StopService功能,因为你没有提到如何停止服务的方式,所以我同样做了。
#!/bin/bash
SERVICESTATE=$1; #Common Check State (OK,WARNING,CRITICAL or UNKNOWN)
Host=$2; #HostName or IP
SERVICESTATETYPE=$3; #Hard or Soft service type
TimeOut=3; #Time (seconds) to wait service start/stop
#before next service processing
#You could not make infinite TimeOut, because
#nagios process will kill this handler if it
#will run too long
#Services is array with service names
Services=(DecisionEngine_1 DecisionEngine_2 DecisionEngine_3 DecisionEngine_4)
#add path to nagios plugins dir
PATH=$PATH:/usr/local/nagios/libexec
RestartService() {
#function restarts services via NRPE.
#Usage: RestartService ServiceName
echo -n " Restarting $1;"
check_nrpe -H "${Host}" -p 5666 -c restart_service -a "$1" >/dev/null 2>&1
return $?
}
StopService() {
#function stops services via NRPE.
#Usage: StopService ServiceName
echo -n " Stopping $1;"
check_nrpe -H "${Host}" -p 5666 -c stop_service -a "$1" >/dev/null 2>&1
return $?
}
ServiceWait() {
#function do continious checks service via NRPE, until success,
#unsuccess check or TimeOut
#Usage: ServiceWait ServiceName {start|stop}
#start optin waits for success check
#stop option waits for unsuccess check
Logic="";
[ "$2" == "start" ] && Logic="-eq"; #RC for start check should be 0
[ "$2" == "stop" ] && Logic="-ne" ; #RC for stop check should NOT be 0
[ -z "$Logic" ] && { echo "ServiceWait function usage error"; exit 19; }
t=${TimeOut}
while [ "$t" -ge 0 ]; do
check_nrpe -H "${Host}" -p 5666 -t 30 \
-c check_service -a "$1" 'crit=not state_is_ok()' >/dev/null 2>&1
RC=$?
[ "$RC" $Logic 0 ] && { echo -n "CheckRC=$RC;"; return $RC; }
#success check, no need to wait anymore
let t--
sleep 1
done
echo -n "TimeOut; "
return 3
}
#check if script received zero params in $1, $2 and $3
[ -z "${SERVICESTATE}" -o -z "${Host}" -o -z "${SERVICESTATETYPE}" ] && {
echo "Usage: $0 {OK|WARNING|UNKNOWN|CRITICAL} Hostname {SOFT|HARD}";
exit 1;
}
case "${SERVICESTATE}" in
OK)
;;
WARNING)
;;
UNKNOWN)
;;
CRITICAL) ## DECISION ENGINE RESTART
#uncomment if you need @mail
#cat /usr/local/nagios/libexec/mail/DeServiceRestart.txt | \
# mail -s "Restarting DE services" email@someteam.com -r Nagios@ATL-NM-01
RC=0
if [ "$SERVICESTATETYPE" == "SOFT" ] ; then
for (( i=0; i<${#Services[*]}; i++ )); do
RestartService ${Services[$i]}
ServiceWait ${Services[$i]} start
RC=$?
#if previous check failed, then do not try to do any restarts anymore
[ "$RC" -ne 0 ] && break;
SuccessRestart+=(${Services[$i]})
done
echo "Restart is complete. ${SuccessRestart[*]} Return Code is ${RC}"
elif [ "$SERVICESTATETYPE" == "HARD" ] ; then
#Stop all services sequentially.
for (( i=0; i<${#Services[*]}; i++ )); do
StopService ${Services[$i]}
#Here you need to experiment what to wait
#May be it will be better to stay here for N seconds while
#service is been stopped
#rather then try to check service state
ServiceWait ${Services[$i]} stop
#sleep $TimeOut
done
#Start all services sequentially.
for (( i=0; i<${#Services[*]}; i++ )); do
RestartService ${Services[$i]}
ServiceWait ${Services[$i]} start
RC=$?
#if previous check failed, then do not try to do any restarts anymore
[ "$RC" -ne 0 ] && break;
SuccessRestart+=(${Services[$i]})
done
else
echo "Unknown SERVICESTATETYPE $SERVICESTATETYPE option"
exit 20
fi
;;
esac
exit 0
推荐阅读
- html - Angular HTML元素对齐问题
- php - 注意:数组到字符串转换 PHP 错误,SESSION ARRAYS?
- python - 使用 Python 中的有序键列表遍历带有列表的嵌套字典
- python - Scrapy将html元素保存到html文件
- android - 使用 google place api 在 android 中自定义搜索
- git - How to manage absolute file path in visual studio project in source control?
- c# - 使用枚举显示数据时出错
- visual-studio-2015 - GIT 命令致命错误无法访问“https://mygiturl/”:请求的 URL 返回错误:400
- python - 如何使用 find_packages() 打包子目录中的所有文件
- python - 处理来自 python 列表框的错误