Elasticsearch cluster operation and maintenance of 400 + nodes 07/03 Update SLTechnology News&Howtos

Elasticsearch cluster operation and maintenance of 400 + nodes

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Author: Anton H ä gerstrand

Yang Zhentao

Table of contents:

Data volume version node configures index structure performance

Meltwater processes millions of post data every day, so it needs a storage and retrieval technology that can handle this level of data.

We have been loyal users of Elasticsearch since version 0.11.X. After going through some twists and turns, we finally think that we have made the right technology selection.

Elasticsearch is used to support our major media monitoring application, which enables customers to retrieve and analyze media data, such as news articles, (public) Facebook posts, Instagram posts, blogs and Weibo. We use a hybrid API to collect this content and crawl and slightly process it so that it can be retrieved by Elasticsearch.

This article will share what we've learned, how to tune Elasticsearch, and some pitfalls to get around.

If you want to know more about our Elasticsearch, please refer to numad issues and batch percolator in the previous blog post.

1. Amount of data

A sizeable amount of news and Weibo is generated every day; at its peak, about 3 million editorial articles and nearly 100 million social posts need to be indexed. Among them, editorial data are kept for retrieval for a long time (dating back to 2009), and social post data have been preserved for nearly 15 months. The current primary shard data uses about 200 TB of disk space, and the replica data is about 600 TB.

Our business has 3,000 requests per minute. All requests go through a service called "search-service", which in turn completes all interactions with the Elasticsearch cluster. Most of the retrieval rules are complex, including in panels and news streams. For example, a customer may be interested in Tesla and Elon Musk, but want to exclude all information about SpaceX or PayPal. Users can use a flexible syntax similar to the Lucene query syntax, as follows:

Tesla AND "Elon Musk" NOT (SpaceX OR PayPal)

Our longest query of this type is more than 60 pages. The point is: apart from 3,000 requests per minute, no query is as simple as querying "Barack Obama" in Google; this is a terrible beast, but the ES node must try to find a matching set of documents.

two。 Version

We are running a custom version based on Elasticsearch 1.7.6. The only difference between this version and the 1.7.6 backbone version is that we backport roaring bitsets/bitmaps as a cache. This feature is migrated from Lucene 5 to Lucene 4 and corresponding to ES version 1.x. The default bitset cache is used in Elasticsearch 1.x, which is very expensive for sparse results, but has been optimized in Elasticsearch 2.x.

Why not use a newer version of Elasticsearch? The main reason is the difficulty of upgrading. Rolling upgrades between major releases are only available from ES 5 to 6 (rolling upgrades from ES 2 to 5 should also be supported, but have not been tried). Therefore, we can only upgrade by restarting the entire cluster. Downtime is almost unacceptable to us, but we may be able to cope with about 30-60 minutes of downtime caused by a reboot; what is really worrying is that there is no real rollback process in the event of a failure.

So far, we have chosen not to upgrade the cluster. Of course we hope to upgrade, but there are more urgent tasks at present. In fact, it is not clear how to implement the upgrade, and it is likely that you will choose to create another new cluster rather than upgrade the existing one.

3. Node configuration

We have been running the main cluster on AWS since June 2017, using i3.2xlarge instances as data nodes. We used to run the cluster in COLO (Co-located Data Center), but then migrated to AWS Cloud to gain time when the new machine goes down, making us more flexible when we expand and scale down.

We run three candidate master nodes in different availability zones and set discovery.zen.minimum_master_nodes to 2. This is a very common strategy for split-brain problem to avoid brain fissure problems.

Our dataset requires 80% capacity and more than 3 replicas in terms of storage, which allows us to run 430 data nodes. At first, we planned to use different levels of data to store older data on slower disks, but because we only had data of a lower order of magnitude older than 15 months (only editing data because we discarded old social data), but it didn't work. The monthly hardware cost is much higher than running in COLO, but the cloud service allows you to expand the cluster to twice as much without taking much time.

You may ask why you chose to manage and maintain the ES cluster yourself. In fact, we considered the hosting scheme, but in the end we chose to install it ourselves for the reason: AWS Elasticsearch Service

The control exposed to users is so poor that the cost of Elastic Cloud is 2-3 times higher than running a cluster directly on EC2.

To protect ourselves in the event of a downtime in an availability zone, the nodes are scattered across all three availability zones of the eu-west-1. We use AWS plugin to complete this configuration. It provides a node property called aws_availability_zone, and we set cluster.routing.allocation.awareness.attributes to aws_availability_zone. This ensures that copies of ES are stored in different availability zones as much as possible, while queries are routed to nodes in the same availability zone as much as possible.

These instances are running Amazon Linux, temporarily mounted as ext4, and have about 64GB memory. We allocated 26GB for the heap memory of the ES node and the rest for disk caching. Why 26GB? Because JVM is built on a dark magic.

At the same time, we use Terraform automatic expansion group to provide instances, and use Puppet to complete all installation and configuration.

4. Index structure

Because our data and queries are based on time series, we use time-based indexing, which is similar to ELK (elasticsearch, logstash, kibana) stack. At the same time, different types of data are stored in different index libraries, so that data such as editorial documents and social documents end up in different daily index libraries. This allows you to discard only the social index when needed and add some query optimization. Each day the index runs in one of the two shards.

This setting produces a large number of fragments (close to 40k). With so many shards and nodes, cluster operations sometimes become more special. For example, deleting an index seems to be a bottleneck in the ability of the cluster master, which needs to push the cluster status information to all nodes. Our cluster status data is about 100 MB, but it can be reduced to 3 MB through TCP compression (you can view your own cluster status data through curl localhost:9200/_cluster/state/_all). The Master node still needs to push 1.3 GB data (430 nodes x 3 MB state size) with each change. In addition to these 1.3 GB data, there are about 860 MB that must be transferred between availability zones (such as, basically, over the public Internet). This can be time-consuming, especially when hundreds of indexes are deleted. We hope that the new version of Elasticsearch will optimize this, starting with ES 2.0's support for sending only differential data for cluster status.

5. Performance

As mentioned earlier, our ES cluster needs to handle some very complex queries in order to meet the retrieval needs of our customers.

In order to cope with the query load, we have done a lot of work on performance over the past few years. We must try to fairly share the performance tests of the ES cluster, as you can see from the following citations.

Unfortunately, less than 1/3 of queries are completed successfully when the cluster goes down. We believe that the test itself caused the cluster downtime.

Excerpt from the first performance test on the new ES cluster platform using real query

In order to control the query execution process, we have developed a plug-in to implement a series of custom query types. These query types are used to provide features and performance optimizations that are not supported by the official version of Elasticsearch. For example, we have implemented wildcard queries in phrases, which can be executed in SpanNear queries; another optimization is to support "*" instead of match-all-query; and a series of other features.

The performance of Elasticsearch and Lucene is highly dependent on specific queries and data, without silver bullets. Even so, some references from basic to advanced can be given:

Limit the scope of your search to only relevant data. For example, for a daily index library, it is retrieved only by the relevant date range. For indexes in the middle of the retrieval range, avoid using range queries / filters.

Ignore the prefix wildcards when using wildcards-unless you can index term inversely. Double-ended wildcards is difficult to optimize.

Pay attention to the relevant signs of resource consumption is the CPU footprint of data nodes continuing to soar? Is IQ waiting to go higher? Take a look at GC statistics. These can be obtained from the profilers tool or through the JMX agent. If ParNewGC consumes more than 15% of your time, check the memory log. If there is any SerialGC pause, you may have a real problem. You don't know much about this?

It doesn't matter, this series of blog posts provides a good introduction to JVM performance. Remember, ES and the G1 garbage collector are not the best together.

If you encounter garbage collection problems, do not try to adjust the GC settings. This often happens because the default settings are reasonable. Instead, focus on reducing memory allocation. How do you do it exactly? Refer to the following.

If you encounter a memory problem but do not have time to resolve it, consider querying Azul Zing. This is a very expensive product, but just using their JVM can increase the throughput by two times. But in the end, we didn't use it because we couldn't prove it was worth it.

Consider using caching, including external Elasticsearch caching and Lucene-level caching. Caching can be controlled by using filter in Elasticsearch 1.x. Later versions look a little more difficult, but you seem to be able to implement the type of query you use for caching. We may do similar work when we upgrade to 2.x in the future.

Check to see if there is any hot spot data (for example, a node bears all the load). You can try to balance the load, use the shard allocation filtering policy shard allocation filtering, or try rerouting cluster rerouting through the cluster to migrate shards from rows. We have used linear optimization to reroute automatically, but it is also helpful to use a simple automation strategy.

Build a test environment (I prefer laptops) to load some representative data from the online environment (at least one shard is recommended). Use online queries to play back pressurized (more difficult). Use local settings to test the requested resource consumption.

Combining the above, enable a profiler on the Elasticsearch process. This is the most important item in this list.

We use flight recorders through both Java Mission Control and VisualVM. People who try to speculate on performance issues (including paid consultants / technical support) are wasting their (and yourself) time. Check out which parts of JVM consume time and memory, and then explore the Elasticsearch/Lucene source code to see which parts of the code are executing or allocating memory.

Once you know which part of the request is causing the slow response, you can optimize it by trying to modify the request (for example, changing the execution prompt for the term aggregation, or switching the query type). Changing the query type or query order can have a great impact. If it doesn't work, you can also try to optimize the ES/Lucene code. This may seem exaggerated, but it can reduce our CPU consumption by 3 to 4 times and memory usage by 4 to 8 times. Some changes are subtle (such as indices query), but others may require us to completely rewrite query execution. The final code relies heavily on our query pattern, so it may or may not be suitable for others to use. So so far we have not opened up this part of the code. But this may be good material for the next blog post.

Chart shows: response time. Lucene query execution is / is not overridden. It also shows that there are no more nodes that run out of memory several times a day.

By the way, because I knew there would be a problem: from the last performance test we knew that upgrading to ES 2.x would improve performance slightly, but it wouldn't change anything. Having said that, if you have migrated from an ES 1.x cluster to an ES 2.x cluster, we would be happy to hear practical experience about how you completed the migration.

If you read this, it means that you really love Elasticsearch (or at least you really need it). We would be happy to learn from your experience and anything you can share. You are welcome to share your feedback and questions in the comments section.

Original English link: http://underthehood.meltwater.com/blog/2018/02/06/running-a-400+-node-es-cluster/

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.