首页 > 解决方案 > Elasticsearch 中偶尔所有节点都失败

问题描述

我有一个四节点 ES 集群(64 个 VCore,60GB RAM),ES 堆有 28GB。我有 2100 万份文档需要编制索引。这些文档非常复杂,并且还有许多嵌套文档。

elasticsearch-hadoop在 Spark 应用程序中使用 140 个线程对这些文档进行批量索引,每个线程发送 2MB 数据。

我偶尔会遇到以下异常

Connection error (check network and/or proxy settings)- all nodes failed; tried [[10.132.15.200:9200, 10.132.15.201:9200, 10.132.15.202:9200, 10.132.15.199:9200]] 

我猜在这段时间内所有节点都忙于stop-the-world垃圾收集,因此无法响应索引请求。

此异常不会使应用程序失败,索引会在几秒钟后继续。

我还开始从 1 个节点监视我的集群日志以查看发生了什么。

[2019-03-13T07:19:52,377][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][16762] overhead, spent [583ms] collecting in the last [1s]
[2019-03-13T07:19:53,821][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][16763] overhead, spent [939ms] collecting in the last [1.4s]
[2019-03-13T07:20:56,995][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][16826] overhead, spent [395ms] collecting in the last [1s]
[2019-03-13T07:20:57,995][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][16827] overhead, spent [481ms] collecting in the last [1s]
[2019-03-13T07:23:54,591][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][17003] overhead, spent [303ms] collecting in the last [1s]
[2019-03-13T07:24:15,864][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][17024] overhead, spent [542ms] collecting in the last [1.2s]
[2019-03-13T07:24:25,866][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][17034] overhead, spent [266ms] collecting in the last [1s]
[2019-03-13T07:24:34,223][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][17042] overhead, spent [454ms] collecting in the last [1.3s]
[2019-03-13T07:25:35,255][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][17103] overhead, spent [264ms] collecting in the last [1s]
[2019-03-13T07:26:01,835][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][17129] overhead, spent [682ms] collecting in the last [1.5s]
[2019-03-13T07:26:04,915][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][17132] overhead, spent [326ms] collecting in the last [1s]
[2019-03-13T07:26:52,089][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][17179] overhead, spent [375ms] collecting in the last [1s]
[2019-03-13T07:27:38,249][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][17225] overhead, spent [277ms] collecting in the last [1s]
[2019-03-13T07:28:02,429][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][17249] overhead, spent [540ms] collecting in the last [1s]
[2019-03-13T07:28:03,430][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][17250] overhead, spent [415ms] collecting in the last [1s]
[2019-03-13T07:28:09,508][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][17256] overhead, spent [274ms] collecting in the last [1s]
[2019-03-13T07:28:43,642][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][17290] overhead, spent [660ms] collecting in the last [1s]
[2019-03-13T07:28:44,659][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][17291] overhead, spent [260ms] collecting in the last [1s]
[2019-03-13T07:29:18,766][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][17325] overhead, spent [284ms] collecting in the last [1s]
[2019-03-13T07:31:10,090][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][17436] overhead, spent [275ms] collecting in the last [1s]
[2019-03-13T07:31:59,359][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][17485] overhead, spent [252ms] collecting in the last [1s]
[2019-03-13T07:32:24,453][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][17510] overhead, spent [339ms] collecting in the last [1s]
[2019-03-13T07:33:08,570][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][17554] overhead, spent [411ms] collecting in the last [1s]
[2019-03-13T07:35:19,122][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][young][17684][9041] duration [881ms], collections [1]/[1.3s], total [881ms]/[14.1m], memory [75.4gb]->[74.7gb]/[117.6gb], all_pools {[young] [1.2gb]->[3.2mb]/[2.7gb]}{[survivor] [306.3mb]->[357.7mb]/[357.7mb]}{[old] [73.8gb]->[74.3gb]/[114.5gb]}
[2019-03-13T07:35:19,122][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][17684] overhead, spent [881ms] collecting in the last [1.3s]
[2019-03-13T07:35:26,209][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][17691] overhead, spent [346ms] collecting in the last [1s]
[2019-03-13T07:36:02,609][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][17727] overhead, spent [361ms] collecting in the last [1.3s]
[2019-03-13T07:36:15,642][INFO ][o.e.i.e.InternalEngine$EngineMergeScheduler] [elasticsearch-2-elastic-vm-3] [customers_201850][3] now throttling indexing: numMergesInFlight=10, maxNumMerges=9
[2019-03-13T07:36:19,649][INFO ][o.e.i.e.InternalEngine$EngineMergeScheduler] [elasticsearch-2-elastic-vm-3] [customers_201850][3] stop throttling indexing: numMergesInFlight=8, maxNumMerges=9

所以我在阅读日志后有几个问题。

以下日志是否意味着 ES 在 GCing 的最后 1000 毫秒中花费了 339 毫秒?

[2019-03-13T07:32:24,453][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][17510] overhead, spent [339ms] collecting in the last [1s]

这绝对是 GC 发生和内存被回收的地方。我对么 ?

[2019-03-13T07:35:19,122][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][young][17684][9041] duration [881ms], collections [1]/[1.3s], total [881ms]/[14.1m], memory [75.4gb]->[74.7gb]/[117.6gb], all_pools {[young] [1.2gb]->[3.2mb]/[2.7gb]}{[survivor] [306.3mb]->[357.7mb]/[357.7mb]}{[old] [73.8gb]->[74.3gb]/[114.5gb]} 

这是 ES 由于段合并而限制索引过程的地方。

[2019-03-13T07:36:15,642][INFO ][o.e.i.e.InternalEngine$EngineMergeScheduler] [elasticsearch-2-elastic-vm-3] [customers_201850][3] now throttling indexing: numMergesInFlight=10, maxNumMerges=9
[2019-03-13T07:36:19,649][INFO ][o.e.i.e.InternalEngine$EngineMergeScheduler] [elasticsearch-2-elastic-vm-3] [customers_201850][3] stop throttling indexing: numMergesInFlight=8, maxNumMerges=9 

我们如何确保最小化stop-the-worldGC,以及如何最小化这些合并限制索引过程。

标签: elasticsearchjvmheap-memory

解决方案


推荐阅读