elasticsearch索引和搜索慢问题分析解决

时间：2021-02-19 23:40:08 阅读：83 评论：0 收藏：0 [点我收藏+]

简述

Elasticsearch是一个分布式的免费开源搜索和分析引擎，能够实现近实时的数据搜索。在使用的过程中，由于各种原因可能导致集群写入或者查询缓慢，本文主要讲述集中常见的原因和解决方法。

写入拒绝或者慢

现象

当像索引（存储和使文档可被搜索）或者搜索数据的时候会出现类似如下429状态码的报错:

"status": 429, "error": {"type": "es_rejected_execution_exception", "reason": "rejected execution of org.elasticsearch.transport.TransportService$7@77c11b3c on EsThreadPoolExecutor[name = VM-1-1-1-1/write, queue capacity = 800, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@4349a9ab[Running, pool size = 32, active threads = 32, queued tasks = 800, completed tasks = 13026004]]"}}

排查

查看Indexing Rate监控指标确认是不是写入速率过高

1)通过kibana或者其它monitor监控查看
2)通过API自己取目前的值，然后存储自己计算:
http://192.168.1.12:9200/_stats

查看集群目前的ThreadpoolWriteQueue和ThreadpoolWriteRejected，确认是不是集群写入太慢或者并发写入过多，导致单node write队列满了。7.x版本每个node的ThreadpoolWriteQueue默认为10000，线程池大小默认为此node的cpu核心数。
```
1)通过kibana或者其它monitor监控查看
2)通过API自己取目前的值，然后存储自己计算：
http://192.168.1.12:9200/_cat/thread_pool/write?v&h=id,name,active,queue,rejected,completed
```

根据实际情况，打开慢写入日志，查看哪些写入比较消耗时间，从而进行优化

curl -X PUT "192.168.1.12:9200/_settings?pretty" -H ‘Content-Type: application/json‘ -d‘
{
"index.indexing.slowlog.threshold.index.warn": "10s",
"index.indexing.slowlog.threshold.index.info": "5s",
"index.indexing.slowlog.threshold.index.debug": "2s",
"index.indexing.slowlog.threshold.index.trace": "500ms",
"index.indexing.slowlog.level": "info",
"index.indexing.slowlog.source": "1000"
}
‘

解决

client写入端增加重试，重试时间设置为随机生成的时间
使用bulk批量写入
增大索引的refresh时间，降低开销
降低shard的replica数量
关闭swap
预留一半的系统内存给文件系统，因为io操作比较多
使用高速硬盘存储数据

搜索拒绝或者慢

排查

查看Search Rate监控指标确认是不是搜索速率过高

1)通过kibana或者其它monitor监控查看
2)通过API自己取目前的值，然后存储自己计算:
http://192.168.1.12:9200/_stats

查看集群目前的ThreadpoolSearchQueue和ThreadpoolSearchRejected，确认是不是集群查询太慢或者并发查询过多，导致单node search队列满了。7.x版本每个node的ThreadpoolSearchQueue默认为1000，线程池大小默认为int((node的cpu核心数 * 3) / 2) + 1。
```
1)通过kibana或者其它monitor监控查看
2)通过API自己取目前的值，然后存储自己计算：
http://192.168.1.12:9200/_cat/thread_pool/search?v&h=id,name,active,queue,rejected,completed
```

根据实际情况，打开慢查询日志，查看哪些查询比较消耗时间，从而进行优化

curl -X PUT "192.168.1.12:9200/_settings?pretty" -H ‘Content-Type: application/json‘ -d‘
{
"index.search.slowlog.threshold.query.warn": "10s",
"index.search.slowlog.threshold.query.info": "5s",
"index.search.slowlog.threshold.query.debug": "2s",
"index.search.slowlog.threshold.query.trace": "500ms",
"index.search.slowlog.threshold.fetch.warn": "1s",
"index.search.slowlog.threshold.fetch.info": "800ms",
"index.search.slowlog.threshold.fetch.debug": "500ms",
"index.search.slowlog.threshold.fetch.trace": "200ms",
"index.search.slowlog.level": "info"
}
‘

解决

根据慢查询日志，优化查询语句
关闭swap
文件系统缓存没有足够的内存来缓存索引中经常查询的部分。Elasticsearch 的查询缓存实现了LRU逐出策略：当缓存变满时，将逐出最近使用最少的数据，以便为新数据让路。需要预留一半的系统内存给文件系统，因为io操作比较多

集群cpu/mem等资源使用率高

现象

没有索引和搜索请求，但是集群仍然占用较多资源
操作系统的cpu和内存占用很高
集群偶发性的无响应
elasticsearch占用较高cpu资源

排查

每个节点上已配置的每个GB堆，根据官方建议将非冻结的分片数量保持在20以下。因为即使没有请求，分片也是需要消耗cpu和内存的。
查看系统上是否跑了非elasticsearch的其它进程，查看资源占用情况
查询elasticsearch的log，确认是否有gc。另外查看node的jvm状态，确保其在正常状态。因为elasticsearch默认情况下heap使用超过75%开始gc，这个时候表明此节点处于内存压力状态。如果超过90%，则会严重影响性能，一般会发生10到30s的gc，甚至出现OOM。
```
1)通过kibana或者其它monitor监控查看
2)通过API:
http://192.168.1.12:9200/_cat/nodes?v=true
```
查看Nodes hot threads API和task API来判断哪些线程或者操作比较消耗资源。

# curl http://192.168.1.12:9200/_nodes/hot_threads?human=true
::: {iz2zedw788ifnqbcj4wygzz}{-PPLeiJfSp-JMbh-_ONsHA}{gsak_FfmTmK361M7W5wTOw}{192.168.1.12}{192.168.1.12:9300}{ml.machine_memory=50476195840, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}
   Hot threads at 2021-02-19T08:29:41.502, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

   67.3% (336.6ms out of 500ms) cpu usage by thread ‘elasticsearch[iz2zedw788ifnqbcj4wygzz][search][T#16]‘
     4/10 snapshots sharing following 59 elements
       org.apache.lucene.util.PriorityQueue.downHeap(PriorityQueue.java:279)
       org.apache.lucene.util.PriorityQueue.updateTop(PriorityQueue.java:211)
       org.apache.lucene.index.OrdinalMap.<init>(OrdinalMap.java:261)
       org.apache.lucene.index.OrdinalMap.build(OrdinalMap.java:168)
       org.apache.lucene.index.OrdinalMap.build(OrdinalMap.java:147)

#  curl "http://192.168.1.12:9200/_tasks?detailed" |jq
{
  "nodes" : {
    "-PPLeiJfSp-JMbh-_ONsHA" : {
      "name" : "test",
      "transport_address" : "192.168.1.12:9300",
      "host" : "192.168.1.12",
      "ip" : "192.168.1.12:9300",
      "roles" : [
        "master",
        "data",
        "ingest"
      ],
      "attributes" : {
        "ml.machine_memory" : "50476195840",
        "xpack.installed" : "true",
        "ml.max_open_jobs" : "20",
        "ml.enabled" : "true"
      },
      "tasks": {
        "-PPLeiJfSp-JMbh-_ONsHA:675047607": {          
        "node": "-PPLeiJfSp-JMbh-_ONsHA",
        "id": 675047607,
        "type": "transport",
        "action": "indices:data/read/search",
        "description": "indices[test-log-web-*], types[], search_type[QUERY_THEN_FETCH], source[{\"size\":1000,\"query\":{\"function_score\":{\"query\":{\"bool\":{\"must\":[{\"term\":{\"tags\":{\"value\":\"parse_success\",\"boost\":1.0}}},{\"nested\":{\"query\":{\"bool\":{\"must\":[{\"match\":{\"test.perspective.domain\":{\"query\":\"cs.xunyou.com\",\"operator\":\"OR\",\"prefix_length\":0,\"max_expansions\":50,\"fuzzy_transpositions\":true,\"lenient\":false,\"zero_terms_query\":\"NONE\",\"auto_generate_synonyms_phrase_query\":true,\"boost\":1.0}}}],\"adjust_pure_negative\":true,\"boost\":1.0}},\"path\":\"test.perspective\",\"ignore_unmapped\":false,\"score_mode\":\"avg\",\"boost\":1.0}},{\"range\":{\"@timestamp\":{\"from\":\"2021-02-18T17:16:28+0800\",\"to\":\"2021-02-19T17:16:28+0800\",\"include_lower\":false,\"include_upper\":false,\"boost\":1.0}}}],\"adjust_pure_negative\":true,\"boost\":1.0}},\"functions\":[{\"filter\":{\"match_all\":{\"boost\":1.0}},\"random_score\":{}}],\"score_mode\":\"multiply\",\"max_boost\":3.4028235E38,\"boost\":1.0}},\"_source\":{\"includes\":[\"test.perspective\"],\"excludes\":[]}}]",
          "start_time_in_millis": 1613726202655,
          "running_time_in_nanos": 11466239140,
          "cancellable": true,
          "headers": {}
        }
      }
    }
  }
}

查看task management API返回的description字段，可以确定正在运行的特定查询。running_time_in_nanos字段指出查询运行的时长。要降低CPU使用率，可以取消正在占用较高CPU的搜索查询。task management API还支持对cancellable为true的任务进行_cancel调用，通过指定任务ID来取消，如上例子中的任务ID为”-PPLeiJfSp-JMbh-_ONsHA:675047607“。

解决

合理规划分片数量，做好容量规划。例如使用冷热架构，把常用的数据放在高配置的热集群，把不常用的冷数据放到低配置的冷集群做专门存储。
避免在硬件上与其他资源密集型应用程序一起运行elasticSearch。
对heap使用做监控，做好报警，出问题能够及时跟进处理。

cancel掉消耗资源的写入或者查询的task

# curl -X POST "http://192.168.1.12:9200/_tasks/-PPLeiJfSp-JMbh-_ONsHA:675047607/_cancel?pretty"
{
"nodes" : {
"-PPLeiJfSp-JMbh-_ONsHA" : {
  "name" : "iz2zedw788ifnqbcj4wygzz",
  "transport_address" : "192.168.1.12:9300",
  "host" : "192.168.1.12",
  "ip" : "192.168.1.12:9300",
  "roles" : [
    "master",
    "data",
    "ingest"
  ],
  "attributes" : {
    "ml.machine_memory" : "50476195840",
    "xpack.installed" : "true",
    "ml.max_open_jobs" : "20",
    "ml.enabled" : "true"
  },
  "tasks" : {
    "-PPLeiJfSp-JMbh-_ONsHA:675047607" : {
      "node" : "-PPLeiJfSp-JMbh-_ONsHA",
      "id" : 675047607,
      "type" : "transport",
      "action" : "indices:data/read/search",
      "start_time_in_millis" : 1613726202655,
      "running_time_in_nanos" : 40438340371,
      "cancellable" : true,
      "headers" : { }
    }
  }
}
}
}

总结

elasticsearch消耗cpu+内存+io资源，故当数据量到一定规模，会出现各种各样的问题。有的问题是由于查询语句造成的，有的是由于资源紧张造成的，遇见问题先定位到原因，就能慢慢解决掉。

参考

https://www.elastic.co/guide/en/elasticsearch/reference/7.11/tasks.html
https://www.elastic.co/cn/blog/implementing-hot-warm-cold-in-elasticsearch-with-index-lifecycle-management
https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-threadpool.html
https://aws.amazon.com/cn/premiumsupport/knowledge-center/resolve-429-error-es/
https://aws.amazon.com/cn/premiumsupport/knowledge-center/es-high-cpu-troubleshoot/
https://www.elastic.co/cn/blog/advanced-tuning-finding-and-fixing-slow-elasticsearch-queries
https://www.elastic.co/guide/cn/elasticsearch/guide/current/_monitoring_individual_nodes.html

elasticsearch索引和搜索慢问题分析解决

原文：https://blog.51cto.com/leejia/2631971

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)