How to solve the gc problem caused by scroll query in Es 07/01 Update SLTechnology News&Howtos

How to solve the gc problem caused by scroll query in Es

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)05/31 Report--

Most people do not understand the knowledge points of this article "Es how to solve gc problems caused by scroll query", so the editor summarizes the following contents, detailed contents, clear steps, and has a certain reference value. I hope you can get something after reading this article. Let's take a look at this "Es how to solve gc problems caused by scroll query" article.

Question:

One afternoon I was happily shopping in the supermarket when I suddenly received a fgc phone alarm from an online es machine, followed by a wave of es reject execution, and the traffic jitter occurred in the cluster where the es machine was located.

Troubleshooting:

When you get home and open the monitoring page, the memory occupancy rate has increased significantly. Usually, the server will not burst memory for no reason. First of all, it is most likely to check the changes in read and write traffic.

Through the monitoring page, it is found that the ingress traffic does not jitter obviously. Considering the different indexes in the cluster and different query types, the total ingress traffic may cover up some problems, so continue to check the sub-operation traffic monitoring of each index. It is found that the scroll traffic of index A fluctuates obviously when the failure occurs, rising from the normal 10qps to the highest 100qps, which is not high for ordinary queries. There seems to be something wrong with the scroll query.

Cause 1:

Let's first come to a conclusion: the memory cost of scroll query is much higher than that of ordinary query. Considering the scenario of traversing data, the amount of security is controlled around 10qps.

Compared with the ordinary query,scroll query, the backend needs to retain the context of the traversal request. Specifically, when an init scroll request arrives, the index searcher will hold the handle of all the index segments until the end of the scroll request. If not handled properly, such as segment cache, it is easy to take up a lot of memory on the server side. In addition, the scroll query also needs to save the request context on the server side, such as page turning depth, scroll context, etc., which will also take up memory resources.

In subsequent tests, the client single thread uses scroll queries to traverse millions of index data, and the CPU occupancy rate on the server side is as high as 70%. Observing the CPU usage of the process, it is found that most of the CPU time is spent on gc, which makes server do not have enough CPU time to schedule other tasks, resulting in normal read and write requests cannot be responded in a timely manner.

# stress testing machine configuration: 1c2g x 1 inch index configuration: 5 number_of_shards x 1 number_of_replica, totaling about 1.8 million data

Cause 2:

Continue to troubleshoot the contents of queries executed by scroll and find two main types.

One is:

{"query": {"bool": {"must": [{"terms": [11pr 22,... 2003]}]}}, "size": 200} # terms clause contains 200 id

The above example query omits some other filtering conditions, so let's talk about the meaning of this query:

Query the id field from the index for the 200 records contained in the array

Several features that can be seen are:

There is no filter clause, the terms condition is in the must clause

This query returns a maximum of 200 records, and you can get all the data in one query.

Second:

{"query": {"bool": {"must": [{"range": {"create_time": {"gt": 0, "lte": 604800}, {"term": {"shop_id": 1}, "size": 200} # range condition contains about 1000 pieces of data # the full index contains about 10 million pieces of data # create_time is not fixed, but the interval is fixed at 1 week

Some other interference conditions are also omitted here, leaving only the most important, vernacular meaning:

Query the data of shop_id=1 and create_time in the qualified interval from the 10 million full index, and the condition interval is changed every 10 seconds, that is, every 10 seconds to query the new data one week before the current time.

Several conclusions can be drawn:

Size is 200. it takes at least 5 queries to access all data.

The change in create_time is very small, similar to (0, 603800] = > (5, 604805], so the number of records hit by the sub-condition does not change much each time, there are millions of records.

There is no filter clause

There is no filter context condition clearly indicated in the official documents such as filter or must_not, but in fact, during the occurrence of scroll, the filter cache has gradually increased from about 500 MB to 6 GB (the maximum configured filter cache space). In theory, it doesn't make sense, just look for the answer from the code.

Tracking the query process, it is found that there is no essential difference between must and filter in the bool clause after it is finally rewrite. The condition for judging whether you can enter filter cache or not is:

Whether the maximum number of documents in a segment is within the threshold (Es's filter cache is in segments)

Query whether the occurrence frequency exceeds the threshold

In the frequency section, Lucene caching strategy will also have the judgment of isCostly, which aims to cache as early as possible the high consumption queries and improve the query performance. Queries that meet the isCostly judgment, including terms and range queries, will be cached as long as they are repeated twice and analyzed together:

Terms query does not need scroll query, ordinary query can be used to solve the demand, using scroll query increases server load

The repetition number of range queries has reached the isCostly threshold, that is, millions of cached value will be thrown into the filter cache every time the data is traversed, and the hit rate is very low (the range start and end conditions of the next scroll query have slightly changed), increasing the gc burden of server.

Resolve:

From the above analysis, we can see that there are two factors that lead to the rejection response of server:

A large number of scroll concurrency

Improper range request, which can be split into:

High frequency, once every 10 seconds

The change is fast, and the start and end range of each query has a delay of 10 seconds.

A large number of hits, millions of hits.

To break down each of the above points is our solution:

Scroll request:

Correct improper use of terms+scroll queries and use normal queries

It is recommended to use search_after to replace the scroll request. Although it is less efficient, it has two advantages:

You can try again. Scroll may lose some data if you try again.

Low resource consumption, CPU occupancy rate is only about 10% in the same test environment.

Improper range request:

High frequency: reduce the request frequency to at least once a minute, which is certainly not the fundamental solution. It is recommended to change similar traversal data requests to media such as db or hbase.

Fast change: the solution to the rough point is to limit the time unit to the hour level, if elegant:

The time condition is divided into a combination of coarse-grained and fine-grained, with coarse-grained in several hours and fine-grained to minutes or seconds.

Fine-grained conditions are executed in script mode. The principle is that the frequency of filter cache uses LinkedHashMap as the key container to accumulate the number of queries. In the hash calculation of key, ordinary query is input as hash based on the conditions and values of the query, while script queries use references from the current instance, which can prevent the query from being accumulated (because the hashcode is different each time).

Large number of hits: the cost can be reduced through the division of coarse and fine grain.

The above is about the content of this article on "how to solve the gc problems caused by Es scroll query". I believe we all have a certain understanding. I hope the content shared by the editor will be helpful to you. If you want to know more related knowledge, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.