How to analyze the query Burr phenomenon of Elasticsearch 07/16 Update SLTechnology News&Howtos

How to analyze the query Burr phenomenon of Elasticsearch

2025-07-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How to analyze the Elasticsearch query burr phenomenon, in view of this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.

If the business is very sensitive to query delay, the burr phenomenon in Elasticsearch query delay is a kind of perplexing problem. Because the time point of burr has passed, it can not be stably reproduced, so it is difficult to analyze the root cause, and it is impossible to use the idea of systematic debugging to gradually reason from the phenomenon and locate the problem. What we can usually do is to look at the indicators of the monitoring system corresponding to the time point, while in es There are many factors that lead to the fluctuation of query delay. Today we will list the possible factors and try to locate and solve them with corresponding methods.

Usually, there are many different queries in a system at the same time, and their normal query delay may be quite different, so even if the system is in an ideal state, the curve of query delay may fluctuate greatly, especially when the query conditions are not fixed, and some queries themselves take a long time. We only discuss that a particular query statement produces a large delay at some point, that is, the query statement should not take that long normally.

In addition, query caching at the es and lucene levels is only an optimization, and query caching itself does not guarantee query latency.

The influence of GC

Query delay is affected by GC is one of the common factors, a query is forwarded related fragments, any node to generate a long GC will cause the whole query to take longer.

Positioning method:

View the node GC metrics at the corresponding time point, and refer to kibana or gc log

Solution:

There are many possible factors for insufficient heap memory, such as small configured JVM memory and too many indexes of open, resulting in excessive FST footprint (when offheap is not enabled), aggregation takes up a lot of memory, netty layer takes up a lot of memory, and cache occupies a lot of memory, mainly according to its own business characteristics, to find out who is occupied by memory, and then reasonably plan the JVM memory space. You can analyze memory through REST API or MAT, refer to the command:

Curl-sXGET "http://localhost:9200/_cat/nodes?h=name,port,segments.memory,segments.index_writer_memory,segments.version_map_memory,segments.fixed_bitset_memory,fielddata.memory_size,query_cache.memory_size,request_cache.memory_size&v"

HBase reduces the JVM footprint by offheap to prevent FGC,es from significantly reducing the JVM footprint after FST offheap. However, FST offheap may be cleaned up by the system, and io will occur when querying FST again, which will also cause query latency instability, but this probability is very small. In es, aggregation, scroll and other operations may lead to the significant occupation of JVM, which increases the uncertainty.

System cache failure

Queries, as well as aggregation, need to access different files on disk. Es recommends that you reserve half of the physical memory space for system cache. When the system cache fails, disk io occurs, which has a significant impact on query latency. When will pagecache expire? There are many places to use pagecache. By default, linux caches most of the file reads and writes, such as queries, writing logs, writing segment files into the library, reading and writing files when merge, and other programs deployed by the node where es resides, script files that perform operations on io, etc. will preempt pagecache. Linux cleans up pagecache according to certain policies and thresholds, and the application layer has no control over which files are not cleaned.

So we need to understand the requirements of a query statement on io, mainly the following two questions:

Which files need to be read in real time during the query process?

How many times does a query take io? How many bytes are read? How long does it take?

Which files need to be read in real time during the query process

Query in es is a complex process. Different query types need to access different lucene files. I organize the files that may be accessed by common types of queries as follows:

During the real query process, not all files are read in real time. Some files have been read in memory when open indexes are finished, and some meta-information files are parsed once in open. In order to verify whether the files actually accessed by the search process are consistent with expectations, I wrote a systemtap script to hook the read and pread functions called by the system, and printed out the call, verified that the sample data of the process was indexed by geonames. For ease of demonstration, the index forcemerge is a single segment, and store is set to niofs.

Query only, do not retrieve

Distributed search consists of two phases. When size=0 is in the request, only the query phase is executed and no retrieval is needed. Therefore, term or match queries, so the query process generally requires only inverted indexes, so the following types of queries are used:

_ search?size=0

{

"query": {

"match": {

"name": {

"query": "Farasi"

}

You just need to read the tim file. Because tip is resident in memory, and only the number of hit needs to be returned when size=0, es has an optimization that terminates prematurely. DocFreq is directly taken from tim as hit, and there is no need to access postings list.

However, when the query contains post_filter, custom terminate_after and so on, the optimization process will not be terminated in advance. In addition, when multiple query conditions are similar to the following, lucene needs to intersect and merge the query results of each field, so you need to get the postings list:

_ search?size=0

{

"query": {

"bool": {

"must": {"match": {"name": "Farasi"}}

"must_not": {"match": {"feature_code": "CMP"}}

}

So the .doc file is read:

The term query and the match query need to read different files when sizechecked 0, so I'll discuss them separately below.

Term query, add and retrieve

With the fetch phase, the files that the original query process needs to access remain unchanged, and the fetch process needs to be taken from the stored fields, because the _ source field itself is stored in the stored fields.

_ search?size=1

{

"query": {

"term": {

"country_code.raw": {

"value": "CO"

}

Therefore, you need to access more fdt,fdx files than the query-only process.

Match query, add and retrieve

Because the match query needs to calculate the score and needs to use Norms information, it is necessary to access more Norms files on the basis of term query plus retrieval.

_ search?size=10

{

"query": {

"match": {

"name": {

"query": "Farasi"

}

You need to read the nvd file in Norms:

Numeric type query

Fields of numeric type are indexed using BKD-tree and are not stored in reverse, so the query process needs to read Point Value. The retrieval process is the same as the term query.

_ search?size=0

{

"query": {

"range": {

"geonameid": {

"gte": 3682501

"lte": 3682504

}

The query process only needs to read the dim file:

Polymerization

For metric and bucket aggregations, the same files need to be accessed, and when size=0, only the dvd file needs to be read.

_ search?size=0

{

"aggs": {

"name": {

"terms": {"field": "name.raw"}

}

The following is a partial screenshot, omitting more than 30,000 records.

GET API

When using GET API to get a single document, it is not the same as the fetch process

_ doc/IrOMznAB5onF36XmwY4W

The following results must have been unexpected:

The index _ id field is indexed. This _ id is a concept at the es level, not a docid in the lucene inversion table. Therefore, when you use a single GET of _ id, you need to execute a lucene query (termsEnum.seekExact) to get the numeric docid in the lucene. Naturally, the query process needs to find the FST and read the tim.

Then read the _ source in the stored field according to this docid, so you need to read the fdx,fdt file

Finally, in addition to returning _ source, GET API also returns the meta-information fields of the document, including: _ version, _ seq_no, and _ primary_term, which are saved in docvalue, so you need to read the dvd file.

In the two-phase query process, the docid returned by the query phase can be obtained directly when it is the id,fetch of the internal numeric type of lucene.

How many times does the query take io?

After knowing which files will be read dynamically, we also need to know how much it costs on io. In order to verify the io of the actual search process, we write a new systemtap script to print out the number of bytes and time spent on each file read by the query process:

In order to observe the impact of queries on io, we need to eliminate some distractions:

Testing without pagecache:

Use vmtouch to expel the index's cache in pagecache

Execute _ cache/clear to clean the cache at the es level

Tests with pagecache:

Execute _ cache/clear to clean the cache at the es level

Execute the second time using the same query

In addition, the system environment is clean, single node, no write operations, no other irrelevant process impact. Then statistics are made on several common types of queries, and the results are as follows:

You may not want to look at this list, let me sum up: most queries require a small number of read calls and a small amount of data to be read, but there are two cases that require more io because they are related to the amount of data:

When aggregating, the io required depends on the amount of data that participates in the aggregation.

A numeric type of range query, and the required io depends on the size of the result set hit.

The business should pay special attention to the above two types of queries. You can consider ways to optimize, such as minimizing the result set participating in the aggregation through query conditions before aggregation, and minimizing the scope of range queries. Second, there are two scenarios that require relatively more io, but an order of magnitude less than the above:

When querying with multiple conditions, you need to merge the result sets of multiple fields. When the result set is large, you need to read the doc file more times, dozens of times in this example.

Turn the page deeply, depending on the amount of data you want to retrieve. Because it is a little more expensive to read so many files on a single GET.

Conclusion: in the query-only scenario, there may be a large number of visits to doc and dim files, and usually the business query statements are complex and mixed with a variety of query conditions. Although the amount of io is very large, when the disk is busy and page cache misses, the query delay may be large.

The influence after FST offheap

The offheap of FST is realized by transferring the memory space occupied by FST from the heap to pagecache through mmap tip files. Since in pagecache, when expelled by pagecache, io will be generated, resulting in significant query latency. To put it simply, the effect of this offheap may cause the FST to be no longer in heap.

Although the chance of tip being expelled from pagecache is small, as the size of the cluster becomes larger, accidental factors become inevitable.

Solution: self-development of an offheap is also very simple, the FST search process is to jump around in the array to find, so it is much simpler than HBase's offheap. If you don't want to change the code, refer to the previous one for the solution.

How to observe the delay of query on io

Since the search requires an uncontrollable number of io, the search delay is doomed to be unguaranteed. For example:

Index write will take up io, although not much, but there is a moment of instant flushing.

If there is a update, it will consume more io util than the index operation.

If there is a large shard, the query may take up a large io util

There may be uneven load among multiple disks of a single node.

Merge,recovery, or even updating cluster status, requires io

The problems caused by disk io can be solved by using ssd or memory. In HDFS, storage types are divided into several types, such as RAM,SSD,DISK, and the distribution of replicas in different storage media is controlled according to different storage policies. This is a similar mechanism in es:

The first is the hot and cold separation of the index level, which is realized by node.attr with the index-level allocation strategy, so that the hot index is stored in ssd, and the index writing and query process are unchanged.

However, because memory data is easy to be lost, it is best to set wait_for_active_shards to all during writing, and preference is used to control the priority of reading hot nodes when reading. If you just want a low-latency search, load all the lucene files into memory!

There is also the simplest way to let the lucene file be occupied by the system cache, such as vmtouch, but when it will be cleaned up is unpredictable. The pagecache hit rate can be viewed with cachestat and is valid for mmapfs.

Search Queue accumulation

If the query concurrency sent by the client is too high, the search thread pool is full and the query request enters the queue and waits, which will also lead to a high delay in the query process.

Positioning method:

There is no indicator of thread pool in kinaba, so you need to monitor it yourself.

Solution:

Control the query concurrency of the client. If a query request of the client involves three shards of a data node, it will occupy 3 search threads in that node. At present, there is no indication of the time spent in queuing for requests.

Digression:

Es uses max_concurrent_shard_requests parameters to control the query concurrency of a single query request on a node to prevent a single request from filling up the query resources of the entire cluster. After constructing the destination shard list involved in this query request, the coordinator node carries out concurrency control according to max_concurrent_shard_requests, and those that exceed concurrency will be put into the queue. However, this queue does not occupy search queue, so even if the concurrency is limited. Its query delay is not affected by this factor.

Summary

Lucene is not a system designed for low latency. Query burr is mainly affected by GC and IO. GC level lies in reasonable planning of JVM memory to avoid frequent GC and FGC,IO levels. Consider using SSD,RAMDISK or reserving enough pagecache to solve the problem.

The answer to the question on how to analyze the burr phenomenon of Elasticsearch query is shared here. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel to learn more about it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.