首页 > 解决方案 > Elasticsearch 一个特定的分片在不同的数据节点中不断初始化

问题描述

我收到 ElasticsearchStatusWarning 说集群状态是黄色的。运行健康检查 API 后,我在下面看到

curl -X GET http://localhost:9200/_cluster/health/

{"cluster_name":"my-elasticsearch","status":"yellow","timed_out":false,"number_of_nodes":8,"number_of_data_nodes":3,"active_primary_shards":220,"active_shards":438,"relocating_shards":0,"initializing_shards":2,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":99.54545454545455}

initializing_shards 为 2。因此,我进一步运行以下调用

curl -X GET http://localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason |grep INIT

% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 33457  100 33457    0   graph_vertex_24_18549             0 r INITIALIZING ALLOCATION_FAILED
  0  79609      0 --:--:-- --:--:-- --:--:-- 79659

curl -X GET http://localhost:9200/_cat/shards/graph_vertex_24_18549

graph_vertex_24_18549 0 p STARTED      8373375 8.4gb IP1   elasticsearch-data-1
graph_vertex_24_18549 0 r INITIALIZING               IP2 elasticsearch-data-2

并在几分钟内重新运行相同的命令,现在显示它正在 elasticsearch-data-0 中初始化。见下文

graph_vertex_24_18549 0 p STARTED      8373375 8.4gb IP1   elasticsearch-data-1
graph_vertex_24_18549 0 r INITIALIZING               IP0   elasticsearch-data-0

如果我在几分钟内再次重新运行它,我可以看到它再次在 elasticsearch-data-2 中被初始化。但它永远不会开始。

curl -X GET http://localhost:9200/_cat/allocation?v

shards disk.indices disk.used disk.avail disk.total disk.percent host          ip            node
   147      162.2gb   183.8gb    308.1gb      492gb           37 IP1 IP1 elasticsearch-data-2
   146      217.3gb   234.2gb    257.7gb      492gb           47 IP2   IP2   elasticsearch-data-1
   147      216.6gb   231.2gb    260.7gb      492gb           47 IP3  IP3  elasticsearch-data-0

curl -X GET http://localhost:9200/_cat/nodes?v

ip            heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
IP1            7          77  20    4.17    4.57     4.88 mi        -      elasticsearch-master-2
IP2          72          59   7    2.59    2.38     2.19 i         -      elasticsearch-5f4bd5b88f-4lvxz
IP3           57          49   3    0.75    1.13     1.09 di        -      elasticsearch-data-2
IP4           63          57  21    2.69    3.58     4.11 di        -      elasticsearch-data-0
IP5            5          59   7    2.59    2.38     2.19 mi        -      elasticsearch-master-0
IP6            69          53  13    4.67    4.60     4.66 di        -      elasticsearch-data-1
IP7           8          70  14    2.86    3.20     3.09 mi        *      elasticsearch-master-1
IP8           30          77  20    4.17    4.57     4.88 i         -      elasticsearch-5f4bd5b88f-wnrl4

curl -s -XGET http://localhost:9200/_cluster/allocation/explain -d '{ "index": "graph_vertex_24_18549", "shard": 0, "primary": false }' -H '内容类型:应用程序/json'

{"index":"graph_vertex_24_18549","shard":0,"primary":false,"current_state":"initializing","unassigned_info":{"reason":"ALLOCATION_FAILED","at":"2020-11-04T08:21:45.756Z","failed_allocation_attempts":1,"details":"failed shard on node [1XEXS92jTK-wwanNgQrxsA]: failed to perform indices:data/write/bulk[s] on replica [graph_vertex_24_18549][0], node[1XEXS92jTK-wwanNgQrxsA], [R], s[STARTED], a[id=RnTOlfQuQkOumVuw_NeuTw], failure RemoteTransportException[[elasticsearch-data-2][IP:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [4322682690/4gb], which is larger than the limit of [4005632409/3.7gb], real usage: [3646987112/3.3gb], new bytes reserved: [675695578/644.3mb]]; ","last_allocation_status":"no_attempt"},"current_node":{"id":"o_9jyrmOSca9T12J4bY0Nw","name":"elasticsearch-data-0","transport_address":"IP:9300"},"explanation":"the shard is in the process of initializing on node [elasticsearch-data-0], wait until initialization has completed"}

事情是由于与上述相同的异常,我早些时候收到了未分配碎片的警报 - “CircuitBreakingException [[parent] 数据太大,[<transport_request>] 的数据将是 [4322682690/4gb],大于[4005632409/3.7gb]"

但当时堆只有2G。我把它增加到4G。现在我看到了同样的错误,但这次是关于初始化分片而不是未分配分片。

我该如何补救?

标签: elasticsearchelastic-stackshardingelk

解决方案


推荐阅读