Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How many shards should be set for Elasticsearch?

2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

Many novices are not very clear about the number of fragments to be set by Elasticsearch. In order to help you solve this problem, the following editor will explain it in detail. People with this need can come and learn. I hope you can get something.

0. Introduction

If the cluster sharding setting is not reasonable at the initial stage of building an Elasticsearch cluster, performance problems may occur in the middle and later stages of the project.

Elasticsearch is a very general-purpose platform that supports a wide variety of use cases and provides great flexibility for data organization and replication strategies. This flexibility makes it difficult for you, as a novice to ELK, to organize your data into indexes and fragments. Although problems do not necessarily occur at first startup, performance problems can occur due to the amount of data over time. The more data the cluster has, the more difficult it is to correct the problem, and sometimes it may even need to re-index large amounts of data.

When we encounter users with performance problems, it is not uncommon to trace back to the data indexed and the number of clusters. This is especially true for users involved in multi-tenancy or using time-based indexes. When discussing this issue with users (meetings, forums), some of the most common questions raised are:

1) "how many slices should I have?"

2) "how big should my slice be"?

The following helps you answer these questions and provides practical guidance for use cases (logging or security analysis) that use time-based indexes.

1. What is slicing?

Before we begin, let's agree on some of the concepts and terminology used in the article.

The data in Elasticsearch is organized into indexes. Each index consists of one or more fragments. Each shard is an instance of a Luncene index, which you can think of as a self-managed search engine that can be used to index parts of data and process queries in an Elasticsearch cluster.

[refresh] when data is written to a shard, it is periodically published to a new immutable Lucene segment on disk, where it can be used for query. This is called a refresh. For more detailed interpretation, please refer to: http://t.cn/R05e3YR

[merge] as the number of segments (segment) increases, these segment are regularly integrated into larger segments. This process is called merging.

Because all segments are immutable, because the new merge segment needs to be created, the old segment is deleted, which means that the disk space used usually fluctuates when indexing. Merging can be quite resource-intensive, especially in terms of disk Igamo.

Sharding is the unit in which Elasticsearch distributes data around the cluster. The speed at which Elasticsearch moves shards when rebalancing data, such as after a failure, depends on the size and number of shards, as well as network and disk performance.

Tip: avoid having very large shards, as large shards may have a negative impact on the cluster's ability to recover from failures. There is no fixed limit on the size of the shard, but the shard size of 50GB is usually defined as a limitation applicable to various use cases.

2. Index validity (retention period)

Because segments are immutable, updating the document requires Elasticsearch to first find the existing document, mark it as deleted, and add an updated version. Deleting a document also requires finding the document and marking it as deleted. As a result, deleted documents will continue to occupy disk space and some system resources until they are merged, which consumes a lot of system resources.

Elasticsearch allows full indexes to be deleted directly from the file system without explicitly having to delete all records individually. This is by far the most efficient way to delete data from Elasticsearch.

Tip: use time-based indexes to manage data whenever possible. Data is grouped according to the retention period (retention period, which can be understood as the validity period). Time-based indexes can also easily change the number of primary and replica shards over time (for the next index to be generated). This simplifies adapting to changing data volumes and needs. 3. Aren't indexes and fragments free?

[cluster status] for each Elasticsearch index, the mapping and status information is stored in the cluster state. This cluster state information is stored in memory for quick access. Therefore, if you have a large number of indexes in the cluster, it may result in a large cluster state (especially if the mapping is large). All update cluster status operations need to be done through a single thread in order to ensure consistency in the cluster, so the update speed will be slower.

Tip: to reduce the number of indexes and avoid large or even very large mappings, consider storing data with the same index structure in the same index, rather than splitting the data into separate indexes based on the source of the data. It is important to find a good balance between the number of indexes and the mapping size of each index. **

Each shard has data that needs to be stored in memory and uses heap space. This includes data structures that hold information at the fragment level, as well as data structures at the segment level to define where the data resides on disk. The size of these data structures is not fixed and will vary depending on the use case.

However, an important feature of segment-related overhead is that it is not proportional to the size of the segment. This means that each amount of data in a larger segment has less overhead than a smaller segment, and there is a big difference.

[importance of heap memory] in order to store as much data as possible per node, it is important to manage as much heap memory usage and reduce its overhead as possible. The more heap space a node has, the more data and shards it can handle.

Therefore, indexes and shards are not idle from a cluster perspective, because each index and shard has a certain degree of resource overhead.

Tip 1: small fragments will lead to small segments (segment), which increases the overhead. The aim is to keep the average fragment size between a few GB and dozens of GB. For use cases with time-based data, you usually see shards of size between 20GB and 40GB.

Tip 2: since the cost of each shard depends on the number and size of segments, forcing smaller segments to merge into larger segments can reduce overhead and improve query performance. Once no more data is written to the index, this should be ideal. Note that this is a resource-consuming (expensive) operation, and the ideal processing period should be performed during off-peak hours.

Tip 3: the number of shards you can save on a cluster node is proportional to the amount of heap memory you can use, but there is no fixed limit in Elasticsearch. A good rule of thumb is to ensure that the number of shards per node is less than 20-25 shards per 1GB heap memory corresponding cluster. Therefore, a node with 30GB heap memory can have up to 600-750 shards, but further below this limit, you can keep it better. This usually helps the group stay healthy.

4. How does the size of the shard affect performance?

In Elasticsearch, each query is executed in a single thread for each shard. However, multiple shards can be processed in parallel, and multiple queries and aggregations can be executed on the same shard.

[pros and cons of small shards] this means that when no cache is involved, the minimum query latency will depend on the data, the type of query, and the size of the shard. Querying a large number of small shards will make the processing of each shard faster, but as more tasks need to be queued and processed sequentially, it is not necessarily faster than querying a smaller number of larger shards. If there are multiple concurrent queries, many small fragments can also reduce query throughput.

Tip: the best way to determine the maximum fragment size from a query performance perspective is to benchmark (real data rather than simulated data) with realistic data and queries. Always benchmark using query and index loads to represent what the node needs to process in production, because the optimization of a single query can produce misleading results.

5. How to manage the shard size?

When using time-based indexes, each index is traditionally associated with a fixed time period. Daily indexes are very common and are often used to hold data with short time intervals or large amounts of data per day. These allow data periods to be managed at good granularity, and volumes can be easily changed and adjusted on a daily basis.

Data with a long period of time, especially if daily index data is not saved daily, is usually used to increase the size of the saved fragments on a weekly or monthly basis. This reduces the number of indexes and fragments that need to be stored in the cluster over time.

Tip: if you use fixed-term time index data, you can adjust the covered time range according to the time period and the expected amount of data to achieve the target shard size.

[uniform update-fast-changing index data comparison] time-based indexes with fixed time intervals work well when the amount of data is reasonably predicted and changes slowly. If the index rate can change rapidly, it is difficult to maintain a uniform target slice size.

In order to better deal with this situation, Rollover and Shrink API are introduced. These increase the flexibility in how to manage indexes and sharding, especially for time-based indexes.

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report