What is the experience of large-scale Elasticsearch cluster management? 04/24 Update SLTechnology News&Howtos

What is the experience of large-scale Elasticsearch cluster management?

2025-04-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail what you have learned about large-scale Elasticsearch cluster management, and the content of the article is of high quality, so the editor will share it with you for reference. I hope you will have a certain understanding of the relevant knowledge after reading this article.

At present, ElasticSearch is mainly used in two application scenarios in Internet companies, one is to build business search function modules and most of them are vertical search, the data level is generally at the level of tens of millions to billions; the other is for real-time OLAP for large-scale data, such as ELKStack, the data scale may reach hundreds of billions or more. The data indexing and application access modes of the two scenarios are quite different, and the emphasis on hardware selection and cluster optimization will also be different. Generally speaking, the latter scenario belongs to big data, which has larger data magnitude and cluster scale, and is more challenging in management.

At the invitation of Medcl, I would like to start this year's Advent for the ES Chinese community to share my experience in managing my own company's ES cluster for log analysis. I hope I can provide some help in the general direction.

The home here is Ctrip. Com. Since we started to contact ES in 2013, our team has successively practiced the intermediate versions of 0.9.x-> 5.0.0, from the initial analysis which was only used for internal IIS logs of operation and maintenance, to supporting real-time retrieval and analysis of more than 200 kinds of log data from IT, call center, security, testing, business R & D and other departments. Along the way, I amused everyone and banged myself.

Currently, our largest log single cluster has 120 data node, running on 70 physical servers. The data size is as follows:

The number of index data in a single day is 60 billion, and the new index file 25TB (including one copy is 50TB)

The peak indexing rate during the peak period of business is maintained at millions of entries per second.

The retention period of historical data is based on business requirements, ranging from 10 days to 90 days.

The cluster has 3441 indexes, 17000 shards, the total data is about 930 billion, and the total disk consumption is 1PB.

There are more than 600 Kibana users and 630000 API calls per day from Kibana and third parties.

Query response time percentile 75% 0.160s 90% 1.640s 95% 6.691s 99% 14.0039s

What is worth paying attention to in such a large-scale ES cluster?

one。 Essential tools

From the beginning, even if there are only a few node, you should use distributed configuration management tools to deploy the cluster. With the maturity of the application and the gradual expansion of the cluster scale, the improvement of efficiency will be highlighted. ES Puppet Module and Chef Cookbook are officially provided, which can be used directly by students who are familiar with these two tools. We use Ansible ourselves and write a set of Playbook to achieve a similar effect. If you are familiar with such tools, it will be much faster and safer for the initial deployment of the cluster, batch configuration changes, cluster version upgrades, and restarting failed nodes.

The second necessary weapon is the sense plug-in. The restful API of the cluster is called directly through this plug-in, which is very convenient when doing cluster and index status check and index configuration changes. Syntax prompts and automatic completion functions are more practical, reducing the frequency of flipping through documents. In Kibana5, sense has become a built-in console without additional installation.

two。 Hardware configuration

We use 32vcoreCPU + 128GB RAM servers. In disk configuration, most of the servers are Raid0 made of 12 4TB SATA mechanical disks, and a small number of machines are 6 800GB SSD raid0 recently installed. The main purpose is to separate hot and cold data. When we talk about the cluster architecture, we will further explain how to make use of hardware resources.

three。 Cluster management

First of all, it is necessary to divide and isolate the roles of ES nodes. As we all know, in addition to putting data, ES's data node can also play the roles of master and client, and most students will mix these roles into data node. However, for a large-scale cluster with more users, master and client may have performance bottlenecks or even memory overflows in some extreme use cases, resulting in coexistence of data node failures. The fault recovery of data node involves data migration, which consumes cluster resources to a certain extent, which can easily cause data write delay or query slowdown. If you separate master and client, once there is a problem, it will be restored almost instantly after restart, with almost no impact on users. In addition, when these roles are independent, the corresponding computing resource consumption is also separated from data node, which makes it easier to grasp the relationship between data node resource consumption and the amount of writing and query, which is convenient for capacity management and planning.

Avoid excessive concurrency, including controlling the number of shard and threadpool. On the premise that the number of writes and query performance can be satisfied, allocate as few fragments to the index as possible. Too much sharding will bring about many negative effects, such as: more data needs to be summarized and sorted after each query; too much concurrent thread switching causes too much CPU consumption; index deletion and configuration updates slow Issue#18776; too much shard also brings more small segment, and too much small segment will lead to very significant heap memory consumption, especially if the query thread is configured too much. Configuring too much threadpool can lead to a lot of weird performance problems. The problems described in Issue#18161 are the ones we've experienced. The default Theadpool size generally works pretty well.

It is best to separate hot and cold data. For log-based applications, it is common to create a new index every day, and there will be more queries when the hot index of the day is written. If there is still cold data from a long time ago, when users do large-span historical data queries, too much disk IO and CPU consumption can easily slow down the write, resulting in data delay. So we use some machines to store cold data, use ES to configure custom attributes for nodes, add "boxtype": "weak" to cold nodes, and update the index route of cold data every night through maintenance scripts to set index.routing.allocation. {require | include | exclude}, so that the data is automatically migrated to cold nodes. The characteristic of cold data is that it is no longer written, and the frequency of user check is low, but the order of magnitude may be very large. For example, we have an index 2TB every day, and users want to keep the data available for the past 90 days. Keeping such a large number of indexes in open does not just consume disk space. In order to quickly access the index file on disk, ES needs to reside some data in memory (the index of the index file), which is called segment memory. Students who are a little familiar with ES know that the allocation of JVM heap cannot exceed 32GB. For our machines with 128GB RAM and 48TB disk space, if we only run an instance of ES, we can only make use of heap that is less than 32GB. When heap is nearly saturated, the index file saved on disk is less than 10TB, which is obviously uneconomical. So we decided to run three ES instances on the cold node, each allocating 31GB heap space, so that we can store more than 30 TB of index data on a physical server and keep the open state for users to search at any time. In practice, due to the low frequency of cold data search and no writing, only 35GB memory is left for os to do file system cache, and the query performance can still meet the demand.

Shard with different data levels is best isolated to different groups of nodes. You know that ES will balance the distribution of shard in the cluster on its own, and this automatic balancing logic takes into account three main factors. First, the shard under the same index is distributed to different nodes as far as possible; second, the number of shard on each node is as close as possible; and the disk of the three nodes has enough remaining space. This strategy can only ensure a uniform distribution of the number of shard, but not a uniform distribution of data size. In practical application, we have more than 200 indexes, and the data levels vary greatly, from several TB in a day to only a few GB in a month, and the retention time of each type of data varies greatly. The problem thrown out is how to balance and make full use of the resources of all nodes. To solve this problem, we still add attribute tags to the nodes to do grouping, combined with index routing control to do some fine control. Try to make different levels of data use different groups of nodes, so that the amount of data on the nodes in each group can be easily balanced automatically.

Force merge that is indexed on a regular basis, and preferably a segment per shard merge. As mentioned earlier, heap consumption is also related to the number of segment, and force merge can significantly reduce this consumption. Another advantage if merge becomes a segment is that for terms aggregation, there is no need to construct a Global Ordinals during search, which can speed up the aggregation.

four。 Version selection

We have been running steadily for a long time on version 2.4. More conservative students can go to 2.4, and radical and energetic students can consider the latest 5.0. Our cluster upgraded from v2.4.0 to v5.0.0 two weeks ago. Apart from an unstable problem encountered in the first week of the upgrade, we feel that the following features brought by the new version are worth upgrading:

The Bootstrap process started by the node adds verification of many key system parameter settings, such as Max File Descriptors, Memory Lock, Virtual Memory settings, etc., and rejects startup and throws an exception if the setting is incorrect. Instead of starting with the wrong system parameters and causing performance problems in the future, it is a good design for startup failure to inform the user of the problem.

Index performance improved. After upgrading, under the same index rate, we see a significant decrease in cpu consumption, which not only helps to improve the index rate, but also increases the search rate to a certain extent.

New numerical data structure, smaller storage space, faster Range and geolocation calculation

Instant Aggregation can do cache for range query aggregation like now to now-7d. In practice, the effect is obvious. Users run the aggregation of data in the past week on Kibana, refresh slowly the first two times, and then brush it out almost instantly with cache!

More protection measures are taken to ensure the stability of the cluster, such as limiting the number of shard for a search of hit, enhancing the characteristics of circuit breaker, and better protection against cluster resources being exhausted by bad queries.

In the first week of the upgrade, our cold data nodes experienced intermittent non-response problems, and 3 issue were unearthed and submitted to the government:

Issue#21595 Issue#21612 Issue#21611

The first problem is identified as Bug and will be fixed at 5.0.2. The root causes of the other two are not clear yet, and it seems that they will only be encountered in our application scenario. Even if the problems have corresponding circumvention measures, after the implementation of these measures, our cluster returned to the stable state of the previous 2.4 version in the most recent week.

five。 Monitor and control

The suggestion that there is no lack of money and no time to toss around is to buy the official xpack to worry about and have the energy to use ES's various rich stats api, use their own familiar monitoring tools to collect data, and visualize it. Of the many monitoring indicators, the most important ones are the following:

The usage of all kinds of Thread pool is visualized by active/queue/reject. You can basically know whether there is a performance bottleneck in the cluster to see if all kinds of queue are very high and whether reject occurs frequently during peak business periods.

The frequency of heap used% and old GC of JVM. If the frequency of old GC is very high, and the heap used% can hardly come down after many times of GC, it means that the pressure of heap is too high, so we should consider expanding the capacity. (it may also be caused by problematic queries or aggregations, which need to be judged in combination with user access records).

Segment memory size and number of Segment. These two metrics are worth paying attention to when there are many indexes stored on the node. It is important to know that segment memory is resident in heap and will not be reclaimed by GC. Therefore, when the heap pressure is too high, you can combine this indicator to determine whether it is because there is too much data stored on the node and needs to be expanded. The number of Segement is also critical. If there are a large number of small segment, for example, there are thousands, even if there are not many segment memory itself, it will still eat a considerable amount of heap in the case of a large number of search threads, because lucene records status information in the thread local for each segment, and the memory cost of this block is related to (number of segment * number of thread).

It is necessary to record the user's access records. We only open http api to users, and a front nginx is used as a http agent to record all the access records of the user's third-party api through access log. By analyzing the access records, we can quickly find the root cause of the problem when there is a performance problem in the cluster, which is very helpful for problem troubleshooting and performance optimization.

On the large-scale Elasticsearch cluster management experience is shared here, I hope that the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.