Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to improve the speed of ElasticSearch index

2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly shows you "how to improve ElasticSearch index speed", the content is easy to understand, clear, hope to help you solve your doubts, the following let the editor lead you to study and learn "how to improve ElasticSearch index speed" this article.

I Google, and the general answer is as follows:

Use bulk API

When you first index, set replica to 0

Increase threadpool.index.queue_size

Increase indices.memory.index_buffer_size

Increase index.translog.flush_threshold_ops

Increase index.translog.sync_interval

Increase index.engine.robin.refresh_interval

This article will describe the principles of the above parameters, as well as some other ideas. These parameters are generally optimized in two directions:

Reduce disk writes

Increase the resources for building index processing

Generally speaking, the need to use the second way carefully will have a great impact on the cluster query function.

There are also two forms of solutions:

Turn off some features that are not needed in a particular scenario, such as Translog or Version.

Moving part of the computation to other parallel computing frameworks, such as data fragmentation, can be calculated in advance on Spark.

What are the above parameters related to?

Among them, 5 and 6 belong to TransLog-related.

4 is related to Lucene

3 because thread pools are widely used in ES, when building an index, there is a separate thread pool to deal with.

7. Personally, I don't think it will have much impact.

2, the number of scenarios that can be used is limited. Personally, I think Replica can use Kafka's ISR mechanism. All data is still written and read from Primary. Replica is used as backup data as much as possible.

Translog

Why is there a Translog? Because Translog logs sequentially is more efficient than building an index. It is impossible to Commit every additional record, so a large number of files and disk IO will be generated. But we also want to avoid data loss caused by program failure or hardware failure, so with Translog, this kind of log is usually called Write Ahead Log.

In order to ensure the integrity of the data, ES defaults to a sync operation after the end of each request. For more information, please see the following methods:

This method calls the IndexShard.sync method to land the file.

You can also set index.translog.durability=async to land asynchronously. The async here can actually be a little misleading. The front is that the sync will be done after each request, and the sync here is just landing the Translog. Whether you set up async or not, will perform the following operations: according to the conditions, mainly every sync_interval (5s), if flush_threshold_ops (Integer.MAX_VALUE), flush_threshold_size (512m), flush_threshold_period (30m) meet the corresponding conditions, then flush operation is carried out, here in addition to Commit for Translog, the index is also Commit.

So if you have a huge log and can tolerate the loss of a certain amount of data in the event of a failure, you can set index.translog.durability=async and increase the parameters related to flush* mentioned earlier.

In extreme cases, you have two choices:

Set index.translog.durability=async, and then set index.translog.disable_flush=true to disable timing flush. Then you can control the flush manually through the application itself.

Remove Translog log-related functions by rewriting ES.

Of course, if you remove the Translog log, there are two risk points:

There will be some problems with the Get*** data. Because according to the ID Get*** data, it was taken from Translog.

We know that ES ensures data integrity in the event of a Node node failure through Shard Replication. In Relocating, when Replica Recover from Primary, Primary will first Snapshot Lucene, and then copy the data to Replica,*** to ensure data consistency by playing back Translog.

Version

Version allows ES to implement concurrent modifications, but the performance impact is also great. There are two main parts here:

You need to access the version number in the index to trigger disk read and write

Locking mechanism

At present, there seems to be no way to shut down the Version mechanism directly. You can use self-growing ID and set the index type to create when building the index. This skips version checking.

This scenario is mainly used for immutable log import. As ES is more and more used for log analysis, logs do not have a primary key ID, so it is appropriate to use self-increasing ID and will not be updated, and a fixed version number is also appropriate. The immutable log is often the pursuit of throughput.

Of course, we can also disable versioning by rewriting ES-related code if necessary.

Distribution agent

ES Shard the index, and then the data is distributed to different Shard. There is actually a problem with querying and building indexes:

If you are building an index, you need to sort the data and distribute it to different Node nodes according to the Shard distribution.

If it is a query, the external Node needs to collect the data of each Shard for Merge.

This will cause great pressure on the nodes provided to the outside, thus affecting the speed of the whole bulk/query.

A feasible scheme is that the Node nodes that directly provide customers with building indexes and querying API adopt client mode and do not store data, which can achieve certain optimization results.

Another more troublesome but seemingly better solution is that if you use a streaming program like Spark Streaming, you can do the following when exporting to ES:

Get the information of all primary shard and give all shard a sequence number to get the mapping relationship between partition (sequence number)-> shardId.

Repartition the data. After partitioning, each partition corresponds to one shard data.

Iterate through these partions and write to ES. The method is to send the data in batches to the corresponding Node nodes containing the corresponding ShardId directly through RPC, similar to transportService.sendRequest.

This has three advantages:

All the data is directly distributed to each Node for direct processing. Avoid centralizing all data on one server first

Avoid secondary distribution and reduce primary network IO

Prevent the barrel short board effect caused by too much Node pressure in data processing.

Scene

Because I happen to do log analysis applications, the pursuit of high throughput, so that the above three optimizations can actually be done. For a typical log entry operation that only adds but does not update, you can use the following scheme:

Interface with Spark Streaming, slice the data in Spark, and push it directly to each node of ES.

Disable automatic flush operation, manual flush after the end of each batch.

Avoid using Version

We can anticipate how many new Segment files ES will produce, and prejudge the size and Merge of ES Segment index files by controlling the cycle and size of batch. * some additional consumption of ES may be reduced

The above is all the contents of the article "how to improve the speed of ElasticSearch indexing". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report