首页 > 其他 > 详细

elasticsearch索引和搜索慢问题分析解决

时间:2021-02-19 23:40:08      阅读:75      评论:0      收藏:0      [点我收藏+]
简述

Elasticsearch是一个分布式的免费开源搜索和分析引擎,能够实现近实时的数据搜索。在使用的过程中,由于各种原因可能导致集群写入或者查询缓慢,本文主要讲述集中常见的原因和解决方法。

写入拒绝或者慢

现象

当像索引(存储和使文档可被搜索)或者搜索数据的时候会出现类似如下429状态码的报错:

"status": 429, "error": {"type": "es_rejected_execution_exception", "reason": "rejected execution of org.elasticsearch.transport.TransportService$7@77c11b3c on EsThreadPoolExecutor[name = VM-1-1-1-1/write, queue capacity = 800, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@4349a9ab[Running, pool size = 32, active threads = 32, queued tasks = 800, completed tasks = 13026004]]"}}

排查

  1. 查看Indexing Rate监控指标确认是不是写入速率过高
    1)通过kibana或者其它monitor监控查看
    2)通过API自己取目前的值,然后存储自己计算:
    http://192.168.1.12:9200/_stats
  2. 查看集群目前的ThreadpoolWriteQueue和ThreadpoolWriteRejected,确认是不是集群写入太慢或者并发写入过多,导致单node write队列满了。7.x版本每个node的ThreadpoolWriteQueue默认为10000,线程池大小默认为此node的cpu核心数。
    1)通过kibana或者其它monitor监控查看
    2)通过API自己取目前的值,然后存储自己计算:
    http://192.168.1.12:9200/_cat/thread_pool/write?v&h=id,name,active,queue,rejected,completed
  3. 根据实际情况,打开慢写入日志,查看哪些写入比较消耗时间,从而进行优化
    curl -X PUT "192.168.1.12:9200/_settings?pretty" -H ‘Content-Type: application/json‘ -d‘
    {
    "index.indexing.slowlog.threshold.index.warn": "10s",
    "index.indexing.slowlog.threshold.index.info": "5s",
    "index.indexing.slowlog.threshold.index.debug": "2s",
    "index.indexing.slowlog.threshold.index.trace": "500ms",
    "index.indexing.slowlog.level": "info",
    "index.indexing.slowlog.source": "1000"
    }
    ‘

解决

  1. client写入端增加重试,重试时间设置为随机生成的时间
  2. 使用bulk批量写入
  3. 增大索引的refresh时间,降低开销
  4. 降低shard的replica数量
  5. 关闭swap
  6. 预留一半的系统内存给文件系统,因为io操作比较多
  7. 使用高速硬盘存储数据

搜索拒绝或者慢

排查

  1. 查看Search Rate监控指标确认是不是搜索速率过高
    1)通过kibana或者其它monitor监控查看
    2)通过API自己取目前的值,然后存储自己计算:
    http://192.168.1.12:9200/_stats
  2. 查看集群目前的ThreadpoolSearchQueue和ThreadpoolSearchRejected,确认是不是集群查询太慢或者并发查询过多,导致单node search队列满了。7.x版本每个node的ThreadpoolSearchQueue默认为1000,线程池大小默认为int((node的cpu核心数 * 3) / 2) + 1。
    1)通过kibana或者其它monitor监控查看
    2)通过API自己取目前的值,然后存储自己计算:
    http://192.168.1.12:9200/_cat/thread_pool/search?v&h=id,name,active,queue,rejected,completed
  3. 根据实际情况,打开慢查询日志,查看哪些查询比较消耗时间,从而进行优化
    curl -X PUT "192.168.1.12:9200/_settings?pretty" -H ‘Content-Type: application/json‘ -d‘
    {
    "index.search.slowlog.threshold.query.warn": "10s",
    "index.search.slowlog.threshold.query.info": "5s",
    "index.search.slowlog.threshold.query.debug": "2s",
    "index.search.slowlog.threshold.query.trace": "500ms",
    "index.search.slowlog.threshold.fetch.warn": "1s",
    "index.search.slowlog.threshold.fetch.info": "800ms",
    "index.search.slowlog.threshold.fetch.debug": "500ms",
    "index.search.slowlog.threshold.fetch.trace": "200ms",
    "index.search.slowlog.level": "info"
    }
    ‘

解决

  1. 根据慢查询日志,优化查询语句
  2. 关闭swap
  3. 文件系统缓存没有足够的内存来缓存索引中经常查询的部分。Elasticsearch 的 查询缓存实现了LRU逐出策略:当缓存变满时,将逐出最近使用最少的数据,以便为新数据让路。需要预留一半的系统内存给文件系统,因为io操作比较多

集群cpu/mem等资源使用率高

现象

  1. 没有索引和搜索请求,但是集群仍然占用较多资源
  2. 操作系统的cpu和内存占用很高
  3. 集群偶发性的无响应
  4. elasticsearch占用较高cpu资源

排查

  1. 每个节点上已配置的每个GB堆,根据官方建议将非冻结的分片数量保持在20以下。因为即使没有请求,分片也是需要消耗cpu和内存的。
  2. 查看系统上是否跑了非elasticsearch的其它进程,查看资源占用情况
  3. 查询elasticsearch的log,确认是否有gc。另外查看node的jvm状态,确保其在正常状态。因为elasticsearch默认情况下heap使用超过75%开始gc,这个时候表明此节点处于内存压力状态。如果超过90%,则会严重影响性能,一般会发生10到30s的gc,甚至出现OOM。
    1)通过kibana或者其它monitor监控查看
    2)通过API:
    http://192.168.1.12:9200/_cat/nodes?v=true
  4. 查看Nodes hot threads API和task API来判断哪些线程或者操作比较消耗资源。
# curl http://192.168.1.12:9200/_nodes/hot_threads?human=true
::: {iz2zedw788ifnqbcj4wygzz}{-PPLeiJfSp-JMbh-_ONsHA}{gsak_FfmTmK361M7W5wTOw}{192.168.1.12}{192.168.1.12:9300}{ml.machine_memory=50476195840, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}
   Hot threads at 2021-02-19T08:29:41.502, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

   67.3% (336.6ms out of 500ms) cpu usage by thread ‘elasticsearch[iz2zedw788ifnqbcj4wygzz][search][T#16]‘
     4/10 snapshots sharing following 59 elements
       org.apache.lucene.util.PriorityQueue.downHeap(PriorityQueue.java:279)
       org.apache.lucene.util.PriorityQueue.updateTop(PriorityQueue.java:211)
       org.apache.lucene.index.OrdinalMap.<init>(OrdinalMap.java:261)
       org.apache.lucene.index.OrdinalMap.build(OrdinalMap.java:168)
       org.apache.lucene.index.OrdinalMap.build(OrdinalMap.java:147)

#  curl "http://192.168.1.12:9200/_tasks?detailed" |jq
{
  "nodes" : {
    "-PPLeiJfSp-JMbh-_ONsHA" : {
      "name" : "test",
      "transport_address" : "192.168.1.12:9300",
      "host" : "192.168.1.12",
      "ip" : "192.168.1.12:9300",
      "roles" : [
        "master",
        "data",
        "ingest"
      ],
      "attributes" : {
        "ml.machine_memory" : "50476195840",
        "xpack.installed" : "true",
        "ml.max_open_jobs" : "20",
        "ml.enabled" : "true"
      },
      "tasks": {
        "-PPLeiJfSp-JMbh-_ONsHA:675047607": {          
        "node": "-PPLeiJfSp-JMbh-_ONsHA",
        "id": 675047607,
        "type": "transport",
        "action": "indices:data/read/search",
        "description": "indices[test-log-web-*], types[], search_type[QUERY_THEN_FETCH], source[{\"size\":1000,\"query\":{\"function_score\":{\"query\":{\"bool\":{\"must\":[{\"term\":{\"tags\":{\"value\":\"parse_success\",\"boost\":1.0}}},{\"nested\":{\"query\":{\"bool\":{\"must\":[{\"match\":{\"test.perspective.domain\":{\"query\":\"cs.xunyou.com\",\"operator\":\"OR\",\"prefix_length\":0,\"max_expansions\":50,\"fuzzy_transpositions\":true,\"lenient\":false,\"zero_terms_query\":\"NONE\",\"auto_generate_synonyms_phrase_query\":true,\"boost\":1.0}}}],\"adjust_pure_negative\":true,\"boost\":1.0}},\"path\":\"test.perspective\",\"ignore_unmapped\":false,\"score_mode\":\"avg\",\"boost\":1.0}},{\"range\":{\"@timestamp\":{\"from\":\"2021-02-18T17:16:28+0800\",\"to\":\"2021-02-19T17:16:28+0800\",\"include_lower\":false,\"include_upper\":false,\"boost\":1.0}}}],\"adjust_pure_negative\":true,\"boost\":1.0}},\"functions\":[{\"filter\":{\"match_all\":{\"boost\":1.0}},\"random_score\":{}}],\"score_mode\":\"multiply\",\"max_boost\":3.4028235E38,\"boost\":1.0}},\"_source\":{\"includes\":[\"test.perspective\"],\"excludes\":[]}}]",
          "start_time_in_millis": 1613726202655,
          "running_time_in_nanos": 11466239140,
          "cancellable": true,
          "headers": {}
        }
      }
    }
  }
}  

查看task management API返回的description字段,可以确定正在运行的特定查询。running_time_in_nanos字段指出查询运行的时长。要降低CPU使用率,可以取消正在占用较高CPU的搜索查询。task management API还支持对cancellable为true的任务进行_cancel调用,通过指定任务ID来取消,如上例子中的任务ID为”-PPLeiJfSp-JMbh-_ONsHA:675047607“。

解决

  1. 合理规划分片数量,做好容量规划。例如使用冷热架构,把常用的数据放在高配置的热集群,把不常用的冷数据放到低配置的冷集群做专门存储。
  2. 避免在硬件上与其他资源密集型应用程序一起运行elasticSearch。
  3. 对heap使用做监控,做好报警,出问题能够及时跟进处理。
  4. cancel掉消耗资源的写入或者查询的task
    # curl -X POST "http://192.168.1.12:9200/_tasks/-PPLeiJfSp-JMbh-_ONsHA:675047607/_cancel?pretty"
    {
    "nodes" : {
    "-PPLeiJfSp-JMbh-_ONsHA" : {
      "name" : "iz2zedw788ifnqbcj4wygzz",
      "transport_address" : "192.168.1.12:9300",
      "host" : "192.168.1.12",
      "ip" : "192.168.1.12:9300",
      "roles" : [
        "master",
        "data",
        "ingest"
      ],
      "attributes" : {
        "ml.machine_memory" : "50476195840",
        "xpack.installed" : "true",
        "ml.max_open_jobs" : "20",
        "ml.enabled" : "true"
      },
      "tasks" : {
        "-PPLeiJfSp-JMbh-_ONsHA:675047607" : {
          "node" : "-PPLeiJfSp-JMbh-_ONsHA",
          "id" : 675047607,
          "type" : "transport",
          "action" : "indices:data/read/search",
          "start_time_in_millis" : 1613726202655,
          "running_time_in_nanos" : 40438340371,
          "cancellable" : true,
          "headers" : { }
        }
      }
    }
    }
    }

总结

elasticsearch消耗cpu+内存+io资源,故当数据量到一定规模,会出现各种各样的问题。有的问题是由于查询语句造成的,有的是由于资源紧张造成的,遇见问题先定位到原因,就能慢慢解决掉。

参考

https://www.elastic.co/guide/en/elasticsearch/reference/7.11/tasks.html
https://www.elastic.co/cn/blog/implementing-hot-warm-cold-in-elasticsearch-with-index-lifecycle-management
https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-threadpool.html
https://aws.amazon.com/cn/premiumsupport/knowledge-center/resolve-429-error-es/
https://aws.amazon.com/cn/premiumsupport/knowledge-center/es-high-cpu-troubleshoot/
https://www.elastic.co/cn/blog/advanced-tuning-finding-and-fixing-slow-elasticsearch-queries
https://www.elastic.co/guide/cn/elasticsearch/guide/current/_monitoring_individual_nodes.html

elasticsearch索引和搜索慢问题分析解决

原文:https://blog.51cto.com/leejia/2631971

(0)
(0)
   
举报
评论 一句话评论(0
关于我们 - 联系我们 - 留言反馈 - 联系我们:wmxa8@hotmail.com
© 2014 bubuko.com 版权所有
打开技术之扣,分享程序人生!