The method of optimizing ElasticSearch Index data 04/22 Update SLTechnology News&Howtos

The method of optimizing ElasticSearch Index data

2025-04-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)05/31 Report--

In this article Xiaobian for you to introduce in detail the "ElasticSearch index data optimization method", the content is detailed, the steps are clear, the details are handled properly, I hope this "ElasticSearch index data optimization method" article can help you solve your doubts, following the editor's ideas slowly in-depth, together to learn new knowledge.

1. Index data optimization

Search engine (taking ES as an example), after dividing a query into the finest unit condition, retrieves the inverted index according to the unit condition to get the unit result set, and then intersects all the unit results to get the final query result, that is to say, although a query appears to return only 10 records, it is possible that the intermediate result is (100w) ∩ (100w) ∩ (100w) = 10. So it seems that there are few results that meet the conditions, but the query performance can not go up.

To optimize performance at this time, what we need to do is to reduce the size of the intermediate result set as much as possible so that the time to intersect is as short as possible:

Cold and heat isolation

Query inverted table is what the search engine must do when executing the query. The smaller the result set (id set) obtained from a single condition, the shorter the time for loop execution to obtain the intersection, so generally speaking, the query performance is proportional to the amount of index data.

When the amount of index data becomes larger, according to the 28 law, 80% of the queries fall on the hottest 20% data, so putting the 20% data on a single hot index can effectively reduce the size of the single conditional result set and improve query performance.

ElasticSearch will also use cache to improve sorting performance, such as fielddata. If a query hits an uncached cold field, the system will automatically load the field contents (fielddata) into memory, so for cold queries, the sorted queries are usually much slower than ordinary queries. If hot and cold isolation is achieved, the loading time of fielddata for cold queries that hit hot indexes will be greatly reduced, and even cold queries can basically meet the needs of low-rt queries.

Horizontal split

Sometimes hot and cold isolation may not be a perfect solution to business needs, such as in-store search, a large number of product editors, frequent hot and cold alternations, and 80% of stores have a small volume of goods.

For this kind of data, an obvious feature is that all queries have store attributes, that is, only the data in a single store can be queried. At this time, the index can be split horizontally, and all the commodity data can be divided into n sub-indexes according to the store dimension.

In this way, the original query needs to load all the field data (fielddata), it can be changed to load only the field data of a sub-index where the store is located (1), and the resources consumed can be reduced by several orders of magnitude. In addition, the result set obtained by the single condition matching inverted index can also be reduced to the original 1, which can filter out the data of many other stores (useless data for this query).

Of course, the split strategy can depend on the specific business, for example, it can also be split according to the time frame.

In addition, there is no vertical split because there is no way for search engines to do online join operations. To achieve join, you need to intersect data from different indexes. If the span is large or with sorting conditions, then the cross-index query is basically unsolved.

Engine configuration

Configuration tuning is generally the first step in search engine performance optimization, which can be divided into two aspects: server configuration and index configuration:

Server configuration

Lucene search engines run on jvm, so the appropriate jvm startup parameters have an important impact on the performance of search engines, especially if the allocated heap memory is very large. Here, I will list some of the jvm parameters we are currently using. The idea is to try to control the ygc in the garbage and then recycle it. Control that temporary large objects do not enter the old area (of course, optimizing the query so that these temporary large objects are generated less is also part of it, which will be discussed below):

-XX:MaxGCPauseMillis=2000

-XX:+PrintGCDateStamps

-XX:+G1PrintHeapRegions

-XX:+UnlockDiagnosticVMOptions

-XX:+UnlockExperimentalVMOptions

-XX:+PrintAdaptiveSizePolicy

-XX:G1HeapRegionSize=32m

-XX:G1ReservePercent=15

-XX:InitiatingHeapOccupancyPercent=60

After the expansion of the cluster size, it is also necessary to split each node into master/data/client by role.

Split the cluster-wide tasks such as status synchronization / election process to master, and split the tasks such as result aggregation (high memory overhead) / client connection interaction (http protocol, if there are a large number of short connections created / destroyed) to client to reduce the load on data node as much as possible (data node is responsible for query / index execution) to improve service performance.

For ElasticSearch, its cache configuration and breaker configuration also need to be adjusted according to business application scenarios. For example, scenarios with more writes, less reads and larger indexes can appropriately reduce the size of filter cache and increase the size of field data (try to keep the contents of fields loaded into memory as much as possible. Cold loading of field data is expensive, and invalid field data eviction will also increase the burden of gc)

Scenarios with more reads and less writes and less indexes can reduce the size of field data and increase the proportion of filter (increase cache reuse rate)

It is best to set the proportion of the breaker configuration so that the cache does not run on each other in the heap memory as far as possible, so as to avoid adding to the burden of gc.

Index configuration

The configuration of the index is relatively flexible and the granularity is relatively fine. When we query the index, we query a snapshot data at a certain time. Only when the index file is reloaded by index searcher, the operation on the index will be visible (between two reopen index searcher). This time is also called refresh time (refresh_interval).

It should be noted that reloading index files (reopen index searcher) is very expensive, so most search engines provide near-real-time query services to reduce the number of times of reloading index files and reduce the system load. There is a case: the refresh time of an index has been adjusted from 1s to 5s, and the entire search response time has been reduced from 200ms to 20ms.

Field configuration is one aspect of index configuration. In short, if it can not be indexed, it will not be saved if it can not be saved to the engine, and a large area of sparse data distribution should be avoided. The purpose is to reduce resource consumption / reduce the size of index files, so as to improve memory utilization and reduce merge time (index files need regular merge to clean up fragmented files).

Conditional query routing can also be specified so that a query can directly hit a specific shard without having to go to all shard to collect data and reduce waiting time

To version 5.x, ES can still configure an index that contains multiple type. In fact, multiple type of the same index are physically stored in the same index file directory, that is, sharing the same batch of index files, distinguished only by the hidden _ uid/_type field.

Then the problem arises: if the amount of data of one type is much larger than that of other type, the type with the largest amount of data will become the bottleneck of the performance of other type (merge is affected, if the fields are different, it will also lead to sparse data problems, a waste of valuable mem resources).

Therefore, in production, we forbid an index to contain multiple type, and the ES6.x version forecast also indicates that the default type will be used in version 7.0, and multiple type will no longer be allowed for the same index.

Incidentally: multi-type also has restrictions on field mapping (mapping), and fields with the same name must use the same type.

Read here, this "ElasticSearch index data optimization method" article has been introduced, want to master the knowledge of this article also need to practice and use to understand, if you want to know more about the article, welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.