Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the Reindex performance optimization method?

2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "what is the method of Reindex performance optimization". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Add Reindex after version 5.x. Reindex can rebuild the data directly in the Elasticsearch cluster. If your mapping needs to be rebuilt because of modification, or if the index setting needs to be rebuilt, it is convenient to rebuild asynchronously with Reindex, and supports data migration between clusters.

Analysis on the causes of slow Reindex

The core of reindex does data migration across indexes and clusters. The reasons for the slowness and the ideas for optimization include:

The batch size value may be too small (default is 1000)

The bottom layer of reindex is scroll implementation, which improves efficiency with the help of scroll parallel optimization.

The core of cross-index and cross-cluster is to write data. Consider the optimization of writing to improve efficiency.

Reindex to improve efficiency cut-in point 1. Increase the bulk write size

By default, _ reindex uses 1000 for bulk operations, and you can adjust the batch_size in source.

POST _ reindex {"source": {"index": "source", "size": 5000}, "dest": {"index": "dest", "routing": "= cat"} bulk size setting is based on the use of bulk index requests for best performance. Batch size depends on data, analysis, and cluster configuration, but a good starting point is 5-15 MB per batch. Note: this is the physical size. The number of documents is not a good indicator of batch size. For example, if you index 1000 documents per batch,: 1) 1000 documents per 1kb are 1mb. 2) the 1000 documents per 100kb is 100 MB. These are completely different sizes. Tuning by gradually increasing the size of the document. Start with a large capacity of about 5-15 MB and slowly increase until you don't see an improvement in performance. Then start to increase the concurrency of bulk writes (multithreading, etc.). Use kibana, cerebro, or tools such as iostat, top, and ps to monitor nodes to see when resources start to have bottlenecks. If you start receiving EsRejectedExecutionException, your cluster will no longer be able to keep up: at least one resource has reached capacity. Either reduce concurrency, provide more limited resources (such as switching from mechanical hard drives to ssd solid state drives), or add more nodes. 2. Set the number of ES copies to 0

If you want to do bulk imports, consider disabling the copy: 0 by setting index.number_of_replicas.

The main reason is that when copying a document, the entire document is sent to the copy node and the indexing process is repeated word for word. This means that each copy will perform analysis, indexing, and potential merge processes. Conversely, if you index with zero copies and then enable replicas when the extraction is complete, the recovery process is essentially a byte-by-byte network transfer. This is more efficient than copying the index process.

PUT / my_logs/_settings {"number_of_replicas": 0} for example: PUT / regroupmembers-20.11.12-151612/_settings {"number_of_replicas": 0} indicates that it takes 85 minutes for the 920000 data test environment to be normal, and 30 minutes after removing the replica fragments. Improve write efficiency with scroll's sliced

Reindex supports Sliced Scroll to parallelize the re-indexing process. This parallelization improves efficiency and provides a convenient way to break down requests into smaller parts.

Sliced principle (from medcl)

Have you ever used the Scroll interface, is it slow? If you have a large amount of data, it is really unacceptable to use Scroll to traverse the data, and now the Scroll interface can be sent concurrently for data traversal.

Each Scroll request can be divided into multiple Slice requests, which can be understood as slicing, each Slice is independently parallel, and it is many times faster to reconstruct or traverse using Scroll.

Examples of slicing usage

There are two ways to set slicing: manually setting shards and automatically setting shards.

Set the fragments manually see the official website

Automatically set shards as follows:

POST _ reindex?slices=5&refresh {"source": {"index": "twitter"}, "dest": {"index": "new_twitter"} slices size setting Note: 1) the setting of slices size can be specified manually, or the setting of slices to auto,auto means: for a single index, slices size = number of shards; for multiple indexes, the minimum value of slices= shards. 2) query performance is most efficient when the number of slices is equal to the number of shards in the index. If the size of slices is larger than the number of fragments, it will not improve efficiency, but will increase overhead. 3) if the slices number is large (for example, 500), it is recommended to choose a lower number, because too large slices will affect performance. 4. Increase the refresh interval or disable it altogether

If your search results don't need to be close to real-time accuracy, consider not rushing to index refresh refresh. The default value is 1s, and you can refresh_interval each index to 30s or disable (- 1) when doing reindex.

If you are doing a large amount of data import and reindex is this scenario, set this value to-1 to disable refresh. Reset back to the desired value when you are finished!

Setting method: PUT / index_name/_settings {"refresh_interval":-1} restore method: PUT / index_name/_settings {"refresh_interval": "30s"} Reindex practice optimization

Index data volume: 71460992

Duration: 55 minutes

1. Set Refresh:PUT / regroupmembers-20.11.23-000000/_settings {"refresh_interval": "30s"} 2. Set Batch_size:POST _ reindex {"source": {"index": "regroupmembers-20.05.28-142940", "size": 4000}, "dest": {"index": "regroupmembers-20.11.23-000000"} 3. Set up replica fragments: 0

This is the end of the content of "what is the Reindex performance optimization method?" Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report