Performance optimization of hundreds of millions of Elasticsearch 07/15 Update SLTechnology News&Howtos

Performance optimization of hundreds of millions of Elasticsearch

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

Preface

In the past year, we have completed a 100 million-level log search platform "ELK" and a 100-million-level distributed tracking system using Elasticsearch. In the process of designing these systems, the bottom layer uses Elasticsearch to store data, and the amount of data is more than 100 million, even up to 10 billion.

So take advantage of the free time, take some time to sort out how to optimize the performance of Elasticsearch, hoping to help students who are interested in Elasticsearch.

Background

Elasticsearch is a Lucene-based search server. It provides a full-text search engine with distributed multi-user capability, based on RESTful web interface. Elasticsearch is developed in Java and released as open source under the Apache license terms, and is currently a popular enterprise search engine. Designed for cloud computing, can achieve real-time search, stable, reliable, fast, easy to install and use.

As an out-of-the-box product, we may not be able to guarantee its performance and stability after the production environment is put online. In fact, there are many skills on how to improve the performance of the service according to the actual situation.

Next, I will explain the performance of the optimization service from three aspects:

Index efficiency optimization query efficiency optimization JVM configuration optimization index efficiency optimization

Index optimization is mainly optimized at the Elasticsearch insertion level. If the bottleneck is not in this area, but in the part that generates data, such as DB or Hadoop, then the optimization direction needs to be changed. At the same time, the index speed of Elasticsearch itself is actually quite fast. For specific data, we can refer to the official benchmark data.

Batch submission

When a large amount of data is submitted, batch submission is recommended.

For example, in the process of doing ELK, Logstash indexer submits data to Elasticsearch, batch size can be used as an optimization function point. However, optimizing size size depends on document size and server performance.

For example, if the size of the submitted document in Logstash exceeds 20MB, Logstash will split a batch request into multiple batch requests.

If an EsRejectedExecutionException exception is encountered during the commit process, the index performance of the cluster has reached its limit. In this case, you can either increase the resources of the server cluster or reduce the speed of data collection according to business rules, such as collecting only logs above the Warn or Error level.

Optimize hardware

Optimizing hardware equipment has always been the most rapid and effective means.

Use solid state drives (SSD) as much as possible under economic pressure. Compared with the machine hard disk, SSD is greatly improved no matter random write or sequential write. RAID0 is used for disk backup. Because Elasticsearch already provides backup function through replicas at its own level, it does not need to make use of disk backup function. At the same time, if you use disk backup function, it will have a great impact on write speed. Increase the Refresh interval

In order to improve indexing performance, Elasticsearch uses a delayed write strategy when writing data, that is, the data is written to memory first, and a write operation will be performed when the default 1 second (index.refresh_interval) is exceeded, that is, the segment data in memory will be refreshed to the operating system, then we can search out the data, so this is why Elasticsearch provides near-real-time search function instead of real-time search function.

Of course, if our internal system does not require high data latency, we can effectively reduce the pressure of segment merging and provide indexing speed by extending the refresh interval. In the process of full-link tracking, we set index.refresh_interval to 30s to reduce the number of refresh.

At the same time, when doing full indexing, you can temporarily turn off the number of refresh, that is, set index.refresh_interval to-1, and then turn it on to normal mode after successful data import, such as 30s.

Reduce the number of copies

The default number of Elasticsearch replicas is 3, although this will improve the availability of the cluster and increase the number of concurrency of the search, but it will also affect the efficiency of writing indexes.

During the indexing process, you need to send the updated document to the replica node and return to the end after the replica node takes effect. When using Elasticsearch for business search, the recommended number of replicas is still set to 3, but like the internal ELK log system and distributed tracking system, the number of replicas can be set to 1.

Query efficiency to optimize routing

When we query a document, how does Elasticsearch know which shard a document should be stored in? It is actually calculated by the following formula

Shard = hash (routing)% number_of_primary_shards

The default value of routing is the id of the document, or you can use a custom value, such as user id.

Query without routing

When querying, because you don't know which shard the data to be queried is on, the whole process is divided into two steps.

Distribution: after the request reaches the orchestrating node, the coordinating node distributes the query request to each shard. Aggregation: the orchestration node collects the query results on each shard, sorts the query results, and then returns the results to the user. With routing query

When querying, you can locate an allocation query directly according to the routing information. There is no need to query all the assignments, and it is sorted by the coordination node.

If you query the user defined above, if routing is set to userid, you can query the data directly, which is much more efficient.

Filter VS Query

Ebay once shared their experience with Elasticsearch and said:

Use filter context instead of query context if possible.

Use filter context (Filter) instead of query context (Query) whenever possible

Query: how well does this document match this query clause? Filter: does this document match the query clause?

Elasticsearch only needs to answer "yes" or "no" for the Filter query, instead of calculating the correlation score like the Query query, and the Filter results can be cached.

Turn the page

In the process of using Elasticsearch, you should try to avoid the occurrence of flip pages.

Normal paging queries start with Size data from From, so you need to query the From + Size bar data ranked first in each shard. The cooperative node collects the pre-From + Size data of each allocation. The cooperative node will receive a total of N * (From + Size) data, then sort them, and then return the data from From to From + Size.

If the From or Size is very large, the number of people participating in sorting will increase much synchronously, which will eventually lead to an increase in CPU resource consumption.

You can solve this problem by using Elasticsearch scroll and scroll-scan to scroll efficiently. For more information, please refer to Elasticsearch: authoritative Guide-scroll query

JVM setting 32G phenomenon

The heap memory set by default for Elasticsearch after installation is 1 GB. This setting is too small for any business deployment.

For example, if the machine has 64 gigabytes of memory, should we set it as large as possible?

Actually, it's not.

The underlying layer of the main Elasticsearch uses Lucene. Lucene is designed to make use of the underlying mechanism of the operating system to cache in-memory data structures. Segments of Lucene are stored separately in a single file. Because segments are immutable, none of these files will change, which is cache-friendly, and the operating system will cache these segments for faster access.

If you allocate all the memory to Elasticsearch's heap memory, there will be no remaining memory for Lucene. This will seriously affect the performance of full-text retrieval.

The standard recommendation is to use 50% of the available memory as Elasticsearch heap memory and retain the remaining 50%. Of course, it won't be wasted, and Lucene will be happy to use the rest of the memory.

At the same time, students who have learned about ES have heard the saying "do not exceed 32G".

In fact, the main reason is that JVM will use a memory object pointer compression technology when the memory is less than 32 GB.

In Java, all objects are allocated on the heap and referenced by a pointer. Normal object pointers (OOP) point to these objects, usually the size of CPU words: 32-bit or 64-bit, depending on your processor. The pointer refers to the byte position of the OOP value.

For a 32-bit system, this means that the heap memory size is up to 4 GB. For 64-bit systems, you can use more memory, but 64-bit pointers mean more waste because your pointers themselves are larger. To make matters worse, larger pointers consume more bandwidth when moving data between main memory and all levels of cache (such as LLC,L1, etc.).

So in the end, we all use the 31 G setting.

-Xms 31g-Xmx 31g

Suppose you have a machine with 128 GB of memory, you can create two nodes, each node memory allocation does not exceed 32 GB. That is to say, no more than 64 GB of memory is given to ES's heap memory, and more than 64 GB of memory is left to Lucene.

Refer to Elasticsearch: authoritative Guide-Heap memory: size and Exchange several questions about Elasticsearch performance Optimization beside the point

I started a column on the 51CTO blog, "take you to High availability". I hope it can help distributed system architecture practitioners who have some knowledge of distributed system architecture and strive for advanced architecture, so as to improve architecture availability and achieve the goal of high availability.

"take you to play with high availability."

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.