How to carry out Elasticsearch cluster operation and maintenance 07/11 Update SLTechnology News&Howtos

How to carry out Elasticsearch cluster operation and maintenance

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article is about how to carry out Elasticsearch cluster operation and maintenance. Xiaobian thinks it is quite practical, so share it with everyone to learn. I hope you can gain something after reading this article. Let's not say much. Let's take a look at it together with Xiaobian.

Meltwater processes millions of posts per day, so it needs a storage and retrieval technology that can handle this magnitude of data.

We've been a loyal Elasticsearch user since version 0.11.X. After some twists and turns, we finally think we have made the right technology selection.

Elasticsearch is used to support our primary media monitoring app, where customers can retrieve and analyze media data such as news articles,(public) Facebook posts, Instagram posts, blogs and tweets. We collect this content by using a hybrid API, crawling and slightly processing it so that it can be retrieved by Elasticsearch.

We'll share what we learned, how to tune Elasticsearch, and some pitfalls to get around.

amount of data

A considerable amount of news and tweets are generated daily; at peak times, it is necessary to index about 3 million editorial articles and nearly 100 million social post data. Editorial data is stored for long periods of time for retrieval (dating back to 2009), and social post data is stored for nearly 15 months. The current primary shard data uses approximately 200 TB of disk space and the replica data approximately 600 TB.

Our business gets 3,000 requests per minute. All requests go through a service called search-service, which in turn performs all interactions with the Elasticsearch cluster. Most of the search rules are complex and included in panels and news streams. For example, a customer might be interested in Tesla and Elon Musk but want to exclude all information about SpaceX or PayPal. Users can use a flexible syntax similar to Lucene query syntax, as follows:

Tesla AND "Elon Musk" NOT (SpaceX OR PayPal)

Our longest such query was over 60 pages. The point is: no query is as simple as "Barack Obama" in Google, except for 3,000 requests per minute; it's a terrible beast, but ES nodes have to struggle to find a matching document set.

version

We are running a custom version of Elasticsearch 1.7.6. The only difference between this release and the 1.7.6 backbone release is that we backported roaming bitsets/bitmaps (http://suo.im/5bE6od) as caches. This feature was ported from Lucene 5 to Lucene 4 and ported to ES 1.X. Elasticsearch 1.X uses the default bitset as a cache, which is expensive for sparse results, but has been optimized in Elasticsearch 2.X.

Why not use a newer version of Elasticsearch? The main reason is the difficulty of upgrading. Rolling upgrades between major releases only works from ES 5 to 6 (rolling upgrades from ES 2 to 5 should also be supported, but not tried). Therefore, we can only upgrade by restarting the entire cluster. Downtime is almost unacceptable to us, but maybe we can handle about 30-60 minutes of downtime from a restart; what's really worrying is that there's no real rollback once a failure occurs.

So far we have chosen not to upgrade the cluster. Of course we hope to upgrade, but there are more urgent tasks at the moment. In fact, how to implement the upgrade is not yet decided, and it is likely that you will choose to create another new cluster rather than upgrade the existing one.

node configuration

We have been running primary clusters on AWS since June 2017, using i3.2xlarge instances as data nodes. We previously ran clusters in COLO (Co-located Data Center), but later migrated to AWS Cloud to gain time when new machines went down, making us more resilient when scaling up and down.

We run 3 candidate master nodes in different availability zones and set discovery.zen.minimum_master_nodes to 2. This is a very general strategy for avoiding the split-brain problem (https://qbox.io/blog/split-brain-problem-elasticsearch).

Our dataset required 80% capacity and more than 3 replicas in terms of storage, which allowed us to run 430 data nodes. The original intention was to use different tiers of data, storing older data on slower disks, but since we only had relevant lower-order data older than 15 months (only edited data because we discarded older social data), this didn't work. The hardware overhead per month is much higher than running in COLO, but cloud services support scaling clusters up to 2x in almost no time.

You may ask why you chose to manage and maintain ES clusters yourself. In fact, we considered the hosting solution, but finally chose to install it ourselves. The reason is that AWS Elasticsearch Service (http://suo.im/4PLuXa) is too controllable for users. The cost of Elastic Cloud (https://www.elastic.co/cn/cloud) is 2-3 times higher than running the cluster directly on EC2.

To protect ourselves in the event of an availability zone outage, nodes are scattered across all three availability zones of eu-west-1. We use the AWS plugin (http://suo.im/5qFQEP) to complete this configuration. It provides a node attribute called aws_availability_zone, we put cluster.routing.allocation.awareness.attributes

Set to aws_availability_zone. This ensures that copies of ES are stored in different availability zones as much as possible, and queries are routed to nodes in the same availability zone as much as possible.

These instances run Amazon Linux, are temporarily mounted as ext4, and have about 64GB of memory. We allocated 26GB of heap memory for ES nodes and the rest for disk cache. Why 26GB? Because JVM is built on top of a black magic (https://www.elastic.co/blog/a-heap-of-trouble).

We also use Terraform (https://www.terraform.io/) autoscaling groups to provide instances and Puppet (https://puppet.com/) to complete all installation configurations.

index structure

Because our data and queries are based on time series, we use

time-based indexing（http://suo.im/547GbE），

Similar to ELK (elasticsearch, logstash, kibana) stack (https://www.elastic.co/elk-stack). It also allows different types of data to be stored in different index repositories, so that data such as editorial documents and social documents end up in different daily index repositories. This allows you to discard only social indexes when needed and add some query optimization. Each day index runs on one of two shards.

This setting produces a large number of fragments (close to 40k). With so many shards and nodes, cluster operations sometimes become more ad hoc. For example, dropping indexes seems to be a bottleneck for the cluster master, which needs to push cluster state information to all nodes. Our cluster state data is about 100 MB, but TCP compression reduces it to 3 MB

(You can view your own cluster status data via curl localhost:9200/_cluster/state/_all). Master nodes still need to push 1.3 GB of data per change (430 nodes x 3 MB state size). In addition to this 1.3 GB of data, there are about 860 MB that must be transferred between available areas, such as the most basic over the public Internet. This can be time-consuming, especially if hundreds of indexes are deleted. We hope the new version of Elasticsearch optimizes this, starting with ES 2.0's support for sending only differential data for cluster state (http://suo.im/547UyM).

Performance

As mentioned earlier, our ES cluster needs to handle some very complex queries in order to meet customer retrieval needs.

We've done a lot of performance work over the last few years to cope with query loads. We must try to share the ES cluster performance test fairly, as can be seen from the following quotation.

Unfortunately, less than one-third of queries complete successfully when the cluster goes down. We believe the testing itself caused the cluster outage.

--Excerpt from the first performance test on the new ES cluster platform using real queries

To control query execution, we developed a plug-in that implements a number of custom query types. These query types are used to provide features and performance optimizations that are not supported in the official version of Elasticsearch. For example, we implemented wildcard queries in phrases to support execution in SpanNear queries; another optimization is to support "*" instead of match-all-query; and a number of other features.

The performance of Elasticsearch and Lucene is highly dependent on specific queries and data, and there are no silver bullets. Even so, there are still some references from basics to advancements:

Limit your search to relevant data only. For example, for a daily index library, only the relevant date range is searched. Avoid range queries/filters for index retrieval in the middle of a range.

Ignore the prefix wildcards when using wildcards -unless you can index term inversely. Double-ended wildcards are difficult to optimize.

Watch for signs of resource consumption Is CPU usage on the data node continuing to spike? Is IQ waiting to go high? Look at GC statistics. These can be obtained from profilers tools or through JMX agents. If ParNewGC takes more than 15% of the time, check the memory log. If there are any Serial GC pauses, you may have a real problem. Don't know much about this stuff?

Never mind, this blog series provides a good introduction to JVM performance (http://suo.im/4AJgps).

Remember that ES and G1 garbage collectors together are not optimal (http://suo.im/4WBTA5).

If you experience garbage collection problems, do not try to adjust GC settings. This often happens because the default settings are reasonable. Instead, focus on reducing memory allocation. How exactly? See below.

If you have a memory problem that you don't have time to resolve, consider querying Azul Zing. This is an expensive product, but the JVMs that use them alone can increase throughput by a factor of 2. But ultimately we didn't use it because we couldn't prove value for money.

Consider using caches, including Elasticsearch external caches and Lucene-level caches. In Elasticsearch 1.X you can control caching by using filter. Later versions look a bit more difficult, but it seems possible to implement the query types you use for caching. We may do something similar in the future when we upgrade to 2.X.

Check to see if there is hot data (e.g. a node carrying all the load). You can try load balancing using shard allocation filtering.

(http://suo.im/4IfruL), or try rerouting through the cluster

cluster rerouting（http://suo.im/5ja7cU）

to migrate fragments on its own. We already use linear optimization for automatic rerouting, but using simple automated strategies can also help.

Set up a test environment (I prefer a laptop) to load some representative data from an online environment (at least one shard is recommended). Pressurize (harder) with inline query playback. Use local settings to test the resource consumption of the request.

Combining the above points, enable a profiler on the Elasticsearch process. This is the most important item on this list.

We use flight recorders both via Java Mission Control (http://suo.im/4zYEsP) and VisualVM (http://suo.im/4AJeIM). People who try to speculate (including paid consultants/tech support) on performance issues are wasting their (and your) time. Find out which parts of the JVM consume time and memory, and then explore

Search the Elasticsearch/Lucene source code to see which part of the code is executing or allocating memory.

Once you know which part of the request is causing the slow response, you can try to optimize it by modifying the request (for example, modifying the execution prompt for term aggregation (http://suo.im/4WBUJx), or switching query types). Changing the query type or query order can have a greater impact. If that doesn't work, try optimizing ES/Lucene code. This may seem exaggerated, but it can reduce CPU consumption by 3 to 4 times and memory usage by 4 to 8 times. Some modifications are minor (such as indices query (http://suo.im/4WBUR7)), but others may require us to rewrite query execution entirely. The resulting code depends heavily on our query patterns, so it may or may not be suitable for others to use. So far we haven't open-sourced this code. But it might be good material for the next post.

Chart Description: Response Time. Lucene query execution with/without rewrite. It also means that there are no more nodes that run out of memory multiple times per day.

By the way, because I knew there was a problem: from the last performance test we knew that upgrading to ES 2.X would improve performance slightly, but it wouldn't change anything. That being said, if you've migrated from ES 1.X clusters to ES 2.X, we'd love to hear about your hands-on experience on how to complete the migration.

If you're reading this, you're in love with Elasticsearch (or at least you really need it). We would love to learn from your experience and anything else we can share. Feel free to share your feedback and questions in the comments section.

The above is how to carry out Elasticsearch cluster operation and maintenance. Xiaobian believes that some knowledge points may be seen or used in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.