kubernetes - 应该在什么公制编织网上发出警报?
问题描述
以下警报看起来是否正确,可以收到警报?我们应该在这些指标的哪些值上提高警惕以监控编织网的健康状况?
- WeaveNoFastDP weave_flows[5m] > 0
- WeaveIPAMUnreachable weave_ipam_unreachable_percentage > 0
- WeaveIPAMPendingAllocates weave_ipam_pending_allocates > 0
- WeavePendingClaims weave_ipam_pending_claims > 0
- WeaveConnecTerm weave_connection_terminations_total > 300
解决方案
在编织指标之上制作了 grafana 仪表板。这是仪表板
- WeaveNet https://grafana.com/grafana/dashboards/11789
- WeaveNet(集群)https://grafana.com/grafana/dashboards/11804
以下是应监控编织网的有用指标。以下警报为 json 格式。
{
"groups": [
{
"name": "nodeagent",
"rules": [
{
"alert": "UnhealthyNodes",
"expr": "changes(central_nodeagent:node_route_unhealthy_count[3m]) > 0",
"for": "1m",
"labels": {
"severity": "critical"
},
"annotations": {
"summary": "Unhealthy nodes in the cluster. Go to prometheus the below prometheus link for details.",
"description": "Actionable: Find why the node(s) are unhealthy and fix it."
}
}
]
},
{
"name": "weave-net",
"rules": [
{
"alert": "WeaveNetIPAMSPlitBrain",
"expr": "max(weave_ipam_unreachable_percentage) - min(weave_ipam_unreachable_percentage) > 0",
"for": "3m",
"labels": {
"severity": "critical"
},
"annotations": {
"summary": "WeaveNetIPAM has a split brain. Go to the below prometheus link for details.",
"description": "Actionable: Every node should see same unreachability percentage. Please check and fix why it is not so."
}
},
{
"alert": "WeaveNetIPAMUnreachable",
"expr": "weave_ipam_unreachable_percentage[10m] > 25",
"for": "10m",
"labels": {
"severity": "critical"
},
"annotations": {
"summary": "WeaveNetIPAM unreachability percentage is above threshold. Go to the below prometheus link for details.",
"description": "Actionable: Find why the unreachability threshold have increased from threshold and fix it. WeaveNet is responsible to keep it under control. Weave rm peer deployment can help clean things."
}
},
{
"alert": "WeaveNetIPAMPendingAllocates",
"expr": "sum(weave_ipam_pending_allocates) > 0",
"for": "3m",
"labels": {
"severity": "critical"
},
"annotations": {
"summary": "WeaveNet IPAM has pending allocates. Go to the below prometheus link for details.",
"description": "Actionable: Find the reason for IPAM allocates to be in pending state and fix it."
}
},
{
"alert": "WeaveNetIPAMPendingClaims",
"expr": "sum(weave_ipam_pending_claims) > 0",
"for": "3m",
"labels": {
"severity": "critical"
},
"annotations": {
"summary": "WeaveNet IPAM has pending claims. Go to the below prometheus link for details.",
"description": "Actionable: Find the reason for IPAM claims to be in pending state and fix it."
}
},
{
"alert": "WeaveNetFastDPFlowsLow",
"expr": "sum(weave_flows) < 15000",
"for": "3m",
"labels": {
"severity": "critical"
},
"annotations": {
"summary": "WeaveNet total FastDP flows is below threshold. Go to the below prometheus link for details.",
"description": "Actionable: Find the reason for fast dp flows dropping below the threshold."
}
},
{
"alert": "WeaveNetFastDPFlowsOff",
"expr": "sum(weave_flows == bool 0) > 0",
"for": "3m",
"labels": {
"severity": "critical"
},
"annotations": {
"summary": "WeaveNet FastDP flows is not happening in some or all nodes. Go to the below prometheus link for details.",
"description": "Actionable: Find the reason for fast dp being off."
}
},
{
"alert": "WeaveNetHighConnectionTerminationRate",
"expr": "rate(weave_connection_terminations_total[5m]) > 0.1",
"for": "5m",
"labels": {
"severity": "critical"
},
"annotations": {
"summary": "A lot of connections are getting terminated. Go to the below prometheus link for details.",
"description": "Actionable: Find the reason for high connection termination rate and fix it."
}
},
{
"alert": "WeaveNetConnectionsConnecting",
"expr": "sum(weave_connections{state='connecting'}) > 0",
"for": "3m",
"labels": {
"severity": "critical"
},
"annotations": {
"summary": "A lot of connections are in connecting state. Go to the below prometheus link for details.",
"description": "Actionable: Find the reason and fix it."
}
},
{
"alert": "WeaveNetConnectionsRetying",
"expr": "sum(weave_connections{state='retrying'}) > 0",
"for": "3m",
"labels": {
"severity": "critical"
},
"annotations": {
"summary": "A lot of connections are in retrying state. Go to the below prometheus link for details.",
"description": "Actionable: Find the reason and fix it."
}
},
{
"alert": "WeaveNetConnectionsPending",
"expr": "sum(weave_connections{state='pending'}) > 0",
"for": "3m",
"labels": {
"severity": "critical"
},
"annotations": {
"summary": "A lot of connections are in pending state. Go to the below prometheus link for details.",
"description": "Actionable: Find the reason and fix it."
}
},
{
"alert": "WeaveNetConnectionsFailed",
"expr": "sum(weave_connections{state='failed'}) > 0",
"for": "3m",
"labels": {
"severity": "critical"
},
"annotations": {
"summary": "A lot of connections are in failed state. Go to the below prometheus link for details.",
"description": "Actionable: Find the reason and fix it."
}
}
]
}
]
}
推荐阅读
- database - SQLAlchemy 从表中不存在的列表中查找 ID
- php - 对 XPath 查询使用 XSD 模式验证
- java - 为什么我不能取消我的执行人提交的工作?
- ssis - 我可以只执行已部署 SQL 目录中包中的某些任务吗?
- excel - 遍历所有工作表以查找包含特殊字符的单元格
- python-3.x - 如何在使用 abs 后添加 str 文本(+ 的不支持的操作数类型:'float' 和 'str')
- php - 一些限制后不可能卷曲欺骗网址
- excel - 如何将我的 excel 工作表值传递给我的 sql 查询
- chatbot - 无法为 watson 连接器运行 botium
- xpath - python scrapy Xpath选择文本()没有得到