Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to parse ElasticSearch paging scheme

2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article shows you how to analyze the ElasticSearch paging scheme, the content is concise and easy to understand, it will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

1:from + size shallow pagination

"shallow" paging is the simplest paging scheme. Es will extract from+size documents in each DataNode fragment according to the query criteria, then aggregate and sort them in MasterNode, and then intercept the size-from documents and return them to the caller. The lower the number of pages, that is, the larger the from+size, the larger the data that es needs to read, and the greater the amount of data processed during aggregation and sorting, which will increase the server CPU and memory consumption.

GET test_dev/_search {"query": {"bool": {"filter": [{"term": {"age": 28}}]}}, "size": 10, "from": 20, "sort": [{"timestamp": {"order": "desc"} "_ id": {"order": "desc"}]}

Where from defines the offset value of the target data, and size defines the number currently returned. The default from is 0 and the size is 10, which means that all queries return only the first 10 pieces of data by default.

It is necessary to understand the principles of from/size here:

Because es is based on sharding, suppose there are five shards, from=100,size=10. According to the sorting rules, 100 pieces of data are retrieved from each of the 5 shards, and then summarized into 500 pieces of data, and then the last 10 pieces of data are selected.

After the test, the later the paging, the less efficient the execution. Overall, as the from increases, so does the elapsed time. And the larger the amount of data, the more obvious it is!

2:scroll deep pagination

From+size queries are OK when there are less than 10000-50000 pieces of data (1000 to 5000 pages), but if there is too much data, there will be deep paging problems.

To solve the above problem, elasticsearch proposes a method of scroll scrolling.

Scroll is similar to cursor in sql. With scroll, you can only get one page at a time, and then return a scroll_id. According to the returned scroll_id, you can constantly get the content of the next page, so scroll is not suitable for scenarios with page skips.

GET test_dev/_search?scroll=5m {"query": {"bool": {"filter": [{"term": {"age": 28}}]}}, "size": 10, "from": 0, "sort": [{"timestamp": {"order": "desc"} "_ id": {"order": "desc"}]}

Scroll=5m means to set scroll_id to be available for 5 minutes.

To use scroll, you must set from to 0.

Size determines the number of returns for each subsequent call to _ search search

Then we can read the next page through the _ scroll_id returned by the data. Each request will read the next 10 pieces of data until the data is read or the scroll_id retention time expires:

GET _ search/scroll {"scroll_id": "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAJZ9Fnk1d.", "scroll": "5m"}

Note: instead of using the index name, the requested interface is _ search/scroll, where both the GET and POST methods can be used.

Scroll deletion

According to the official documentation, the search context of scroll will be automatically cleared after the retention time of scroll expires, but we know that scroll is very resource-consuming, so one suggestion is to explicitly delete scroll_id as soon as possible when scroll data is not needed.

Clear the specified scroll_id:

DELETE _ search/scroll/DnF1ZXJ5VGhlbkZldGNo.

Clear all scroll:

DELETE _ search/scroll/_all

3:search_after deep pagination

Scroll approach, official recommendations are not used for real-time requests (usually for data export), because each scroll_id will not only consume a lot of resources, but also generate a historical snapshot, and changes to the data will not be reflected in the snapshot.

The way of search_after paging is to determine the location of the next page according to the last piece of data on the previous page. At the same time, in the process of paging request, if there are additions, deletions, changes and queries of index data, these changes will also be reflected on the cursor in real time. However, it is important to note that because the data on each page depends on the last piece of data on the previous page, the page skip request cannot be made.

In order to find the last piece of data on each page, each document must have a globally unique value. It is officially recommended to use _ uid as the global unique value. In fact, you can also use id in the business layer.

GET test_dev/_search {"query": {"bool": {"filter": [{"term": {"age": 28}}]}}, "size": 20, "from": 0, "sort": [{"timestamp": {"order": "desc"} "_ id": {"order": "desc"}]}

From=0 must be set to use search_after.

Here I use timestamp and _ id as the only values to sort.

We get the value of the sort attribute in the last piece of data returned and pass it to search_after.

Use the value returned by sort to search the next page:

GET test_dev/_search {"query": {"bool": {"filter": [{"term": {"age": 28}}]}}, "size": 10, "from": 0, "search_after": [1541495312521, "d0xH6GYBBtbwbQSP0j1A"] "sort": [{"timestamp": {"order": "desc"}, "_ id": {"order": "desc"}]}

4: modify the default paging limit value of 10000

You can use the following ways to change the index.max_result_window maximum window value for ES default depth pages

Curl-XPUT http://127.0.0.1:9200/my_index/_settings-d'{"index": {"max_result_window": 500000}}'

Where my_index is the name of the index to be modified and 500000 is the number of new windows to be adjusted. After adjusting the window, you can solve the problem that you can't get 10000 pieces of data.

Matters needing attention

Through the above way to solve our problem, but also introduced another problem that we need to pay attention to. After the window value is adjusted, although more pieces of data are requested to be paged, it is exchanged at the expense of more server memory and CPU resources. Consider whether excessive paging requests in business scenarios will cause OutOfMemory problems for cluster services.

5: get the total amount of data

Modify the maximum limit value can indeed make from+size query to the later page of the data, but the maximum number of each query is still 10000, if you want to obtain more than 10, 000 query data, you can be divided into two steps to query, the first step is to use scroll query to get the total data; the second part uses from+size to query the data of each page, and set up paging. This not only solves the problem that from+size can not query data after 10000, but also solves the problem that scroll can not skip pages.

Problems that you may encounter when using scroll:

Caused by: org.elasticsearch.ElasticsearchException: Trying to create too many scroll contexts. Must be less than or equal to: [500]. This limit can be set by changing the [search.max_open_scroll_context] setting.

This error is found in the es log file, which roughly means that the attempt to create more scroll objects has failed, and the total number of scroll objects should be limited to 500. The value of search.max_open_scroll_context can be modified to change the threshold of 500.

Reason: through scroll deep paging, we know that the es server will generate a scroll_id object in memory, specify an expiration time for this value, and use scroll_id to get the data of the next page when turning pages. By default, a maximum of 500 scroll context objects, or 500 scroll_id, can be created under an instance. The reason for reporting this error is the failure to create the scroll context object, because there are already 500 such objects.

Solution:

1: through observation, we can find that even if no processing is done, the scroll request can be issued again after a while, because the time has exceeded the scroll lifecycle time, and the scroll object itself has died.

2: change the value of search.max_open_scroll_context as prompted

Put http://{{es-host}}/_cluster/settings

{

"persistent": {"search.max_open_scroll_context": 5000}, "transient": {"search.max_open_scroll_context": 5000}

}

[failed to upload picture... (image-4dc354-1583253824871)]

Image.png

3: immediately after using scroll_id, call the delete API to delete the scroll object

Delete a single scroll

DELETE http://{{es-host}}/_search/scroll

{

"scroll_id": "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAdsMqFmVkZTBJalJWUmp5UmI3V0FYc2lQbVEAAAAAAHbDKRZlZGUwSWpSVlJqeVJiN1dBWHNpUG1RAAAAAABpX2sWclBEekhiRVpSRktHWXFudnVaQ3dIQQAAAAAAaV9qFnJQRHpIYkVaUkZLR1lxbnZ1WkN3SEEAAAAAAGlfaRZyUER6SGJFWlJGS0dZcW52dVpDd0hB"

}

Delete all scroll

Delete http://{{es-host}}/_search/scroll/_all

The above content is how to analyze the ElasticSearch paging scheme. Have you learned the knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report