Analysis of ElasticSearch platform Architecture upgrade 04/06 Update SLTechnology News&Howtos

Analysis of ElasticSearch platform Architecture upgrade

2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly introduces "ElasticSearch platform architecture upgrade analysis". In daily operation, I believe many people have doubts about ElasticSearch platform architecture upgrade analysis. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful for you to answer the doubts of "ElasticSearch platform architecture upgrade analysis"! Next, please follow the editor to study!

Background: Sisyphus pushing the stone

The Didi ElasticSearch team began building the ElasticSearch platform in 2016 and began providing services in June 2016, when it chose the latest version 2.3 of ElasticSearch.

Now, three years later, ElasticSearch Ecology has experienced rapid growth, Elastic has completed its listing, and ElasticSearch's score in db-engines has risen from 88 to 148, rising from 11th to 7th.

During this period, ES has released 3 large versions and dozens of medium versions, while ElasticSearch has recently released version 7.x.

In the past three years, Didi Elasticsearch platform has launched four major services based on ElasticSearch, such as log retrieval, MySQL real-time database snapshot, distributed document database and search engine.

At present, Didi ElasticSearch platform serves 1200 applications in the group, among which: real-time scenarios such as order, customer service, finance, pulse checking, New deal and other business core scenarios are also running on Elasticsearch, operation and maintenance ES cluster 30W, peak TPS writing reaches 1500W, query QPS reaches 2W.

The rapid development of business is the affirmation of the work of Didi ElasticSearch team, but there are also great challenges and pressure, in which the low version is the biggest constraint to the development of ElasticSearch platform in the future, including the following points.

The ① community no longer maintains the old version: ElasticSearch 2.3.3 is so old that it is no longer maintained by the ES community, the problem community encountered on 2.x is not resolved, the submitted issue is not processed, and the submitted code is not accepted.

Based on 2.3.3, we also solved many problems of ES itself, such as memory leak caused by Master update metadata timeout, TCP protocol field overflow and so on.

Due to the inability to interact with the community, the value of the team students is not recognized by the community, in the long run, it will only be farther and farther away from the ES ecology, and our voice in the ES technology circle will become weaker and weaker.

The new version of ② features are difficult to use: the last three years are the three years of great ecological development of ES, and ES itself has greatly improved in terms of function and performance.

For example, BM25 scoring algorithm is used by default to achieve better results; lucene docvalues sparse areas are improved to save disk space; and new Frozen indices capabilities can significantly reduce ES memory overhead.

Many features are also very suitable for ElasticSearch platform scenarios, but the large version gap has been restricting us, unable to enjoy the dividend of technological progress.

On the one hand, the rapid development of business requires richer functions, stronger performance, lower costs and more stable services; on the other hand, it is getting farther and farther away from the latest technology in the industry, and the value of the team is getting weaker and weaker. gradually reduced to a pseudo-engine team that can only do business, and the status quo of the whole team is like Sisyphus pushing a stone.

Either we meet the difficulties, overcome the difficulties, upgrade the whole cluster to the latest version in one breath, push the stone over the top of the mountain, and then travel light, or we can continue to struggle to support ourselves and stumble under the dual pressure of business and engine.

The Didi ElasticSearch team finally chose to restructure the Didi ElasticSearch platform and upgrade all maintained ES clusters to the latest version.

Difficulty: I was really at a loss when I drew my sword and looked around

The ideal is full, the reality is bony, and it is easy to make up your mind, but it is difficult to implement it in practice:

2.3.3 and 6.6.1 protocols are not compatible, 6.6.1 does not support TCP protocol, what about those users who check through TCP, let them change the code one by one, when will it be changed?

Some of the returned fields in 2.3.3 and 6.6.1 are different, and some query syntax is not compatible, how to be transparent to the user, or directly force the user to accept the change?

2.3.3 and 6.6.1 lucene file format is not the same, there is no way to upgrade directly, to create a cluster to write all double again.

The Mapping format of 2.3.3 and 6.6.1 is not uniform, 6.6.1 does not support multi-type, and the existing data cannot be moved.

Didi ElasticSearch platform does not support indexing multi-version queries at the same time, and users have strange query habits, and many queries with * are out of your control.

There are so many users, the use is very different, how to communicate and promote with users, how to shield user influence and management for expectations?

Even if we have to relocate and upgrade, where to find so many machines, now we still have to abolish the computer room and take the machines out.

Dozens of clusters, thousands of nodes have to be deployed, built, restarted, and thousands of machines have to be moved. Slow down. When do we have to do this? hurry up. What if something goes wrong?

Even if it is a double write upgrade, how to know if there is any problem in the middle, whether the data is lost, whether the user's query is consistent, whether the function and performance meet the expectations, how to verify this?

With so much data, so many people using it, so few resources and great pressure on business stability, it is estimated that it will not be finished this year.

At the beginning of the decision to carry out cross-version upgrade, we are faced with problems, any of which is not resolved, will greatly hinder the upgrade process.

Thinking: the talent that heaven and earth make me must have its use.

In the initial stage, there are many problems mixed together, it is necessary to clarify the importance, urgency, impact level and interdependence of each problem. Through analysis and induction, we summarize it into four major problem areas:

After summarizing the problem domain, we discuss the specific implementation plans and steps, and summarize the following four links that can actually be promoted:

First of all, carry on the architecture upgrade: solve the incompatibility problem of 2.x/6.x on the engine side, and all the incompatibility processing such as protocol, query syntax, Mapping and so on are processed on the platform side.

At the same time, we have developed an ES java SDK to solve the problem that 6.x does not support TCP interface, using the same way as the original es java client, users only need to modify the pom.

It includes: multi-version support of Arius platform, multi-version compatibility of Gateway, user SDK development, AMS data acquisition and so on.

Secondly, solve the problem of operation and maintenance: solve the management and control problems of multi-cluster construction, deployment and restart in the process of operation and maintenance, improve the convenience of operation and upgrade operation efficiency, as described in the following details.

Once again solve the problem of resources: solve the problem of a large number of machine resources needed for relocation and upgrading, make adequate preparations for a large number of cluster upgrades, and at the same time meet the requirements of the computer room to abolish and return machines.

It includes: index storage cycle optimization, hot and cold data separation, Mapping optimization, fastIndex and so on.

Finally, the actual progress begins: after all the preparatory work has been done, the upgrade process begins to be actually promoted. Details include: performance stress testing, resource evaluation, batch double writing, query playback, as well as some unexpected pit mining and filling processes, as described in the following details.

Actual combat: broken iron clothes on the white sand battlefield

After sorting out the dependency, resource consumption and bottleneck points of each link in the whole upgrade process, we have designed corresponding solutions to the problems in three aspects, such as architecture, resources, and practical operation, mainly as follows:

Architecture

Architecture Transformation supported by multiple versions of ES on ① Didi ElasticSearch platform

First of all, we have completed the architecture upgrade supported by multiple versions of ES on the Didi ElasticSearch platform, which focuses on:

Arius Gateway's compatibility with cross-version query differences and index access across high and low versions of clusters under multiple clusters make the query results transparent to users during the upgrade process.

Elasticsearch-didi-interanl-client SDK development, to shield users from ES TCP/HTTP query differences, to solve the problem that ES 6.x version does not support TCP interface, the original 2.x users only need to modify one line of pom to switch to high version access.

Didi ElasticSearch platform architecture and Arius admin multi-version support.

② ES Multi-Cluster Management and Control Scheme based on Elastic Cloud

At present, the operation and maintenance of Didi ElasticSearch team has more than 30 ES clusters and 5000 + ES nodes. The cluster scale is large, the scene is complex, and the cost of operation and control is relatively high.

For this reason, we have designed and developed an ECM (ElasticSearch Cluster Manager) system for a series of operations such as deployment, restart, expansion, configuration management and control of ES clusters.

And 80% of our ES nodes run on the elastic cloud, combined with the flexible and efficient characteristics of the elastic cloud, which makes us more efficient in the process of relocation and upgrading.

Construction of ③ ES Service metadata system

We have built a set of AMS (Arius MetaData Service) services to collect and analyze all kinds of data from all clusters, nodes and indexes of ES.

Including: capacity information (clusters, nodes, templates, indexes, tenants), TPS/QPS information (clusters, nodes, templates, indexes, tenants), operation information, query statements, query template information, query results and hit rate analysis information and so on.

On the basis of these basic index data, we build a comprehensive ES operation index system, which can comprehensively understand and monitor the operation information of cluster, node, index and tenant level.

Detailed data provide the basis for subsequent ES cost optimization, see-- based on data-driven ES hierarchical storage system, the construction of hierarchical storage system enables us to build a set of systematic ES cost-saving system.

Detailed data provide a basis for query traffic playback comparison during subsequent upgrades. For more information, please see-- query traffic playback comparison system based on ES service metadata, which enables us to quickly verify the upgrade results and improve upgrade efficiency and stability during the upgrade process.

At the same time, AMS is also responsible for the reliability of the data, ensuring that the generated data is timely and accurate, so it depends on the data analysis service of AMS.

For example, hierarchical storage, capacity planning, playback system, cost bill, cluster health check, index health score, etc., only need to focus on the implementation of their own logic.

Resources

After resolving the architecture and compatibility issues, we are confident that a cluster will be upgraded online to a new version.

However, because the version span is too large to directly carry out a rolling upgrade on the original cluster, it is necessary to move and upgrade with double data writes, so the buff resources required for the upgrade become the most important factor restricting the progress of the upgrade, so next we focus on saving resources and improving resource utilization.

Through internal and external potential tapping and technical transformation, it not only supports the machine resources needed for version upgrade (three clusters are upgraded at the same time at peak time), but also returns nearly 400 machines, saving 80W + / month.

① data-driven ES hierarchical Storage system

Based on the statistics and analysis of AMS corresponding index size, data volume, query quantity, query condition, query time, and return results, we can accurately analyze the scenarios in which each index is used and the way it is queried.

Such as: the high-frequency query time interval of the index, the fields to be retrieved by the index and so on. On the basis of data analysis, we optimize the Mapping, storage cycle and hot and cold data storage for each index.

On the premise of not affecting the user's demand, accumulative data 1PB savings and cold data 700TB relocation not only ensured sufficient buff machines in the upgrade process, but also returned nearly 400 physical machines, saving 70W + / month.

② ES FastIndex offline data Import system

The original intention of ES FastIndex is to solve the efficiency and resource problems of offline import of the group label system. Every day, more than 30 TB of data in the group label system need to be synchronized to ES in a short time, otherwise it will affect the business results of the day. The previous scheme used a lot of machine resources to meet the efficiency.

After using the ES fastIndex offline data import system based on Hadoop, the same data import time is reduced from 8 hours to 2 hours.

The machine cost has been reduced from 40 physical machines (ES 27, Kafka 3, Dsink 10) to 30 elastic cloud nodes (10 physical machines), saving 7W + per month in tag scenarios alone.

③ ES Cluster capacity Planning Scheme based on Resource Quota Management and Control

Improving the utilization rate of ES cluster resources is also a problem that Didi ElasticSearch team has been facing and committed to solve.

The total capacity of the ES machine maintained by the Didi ElasticSearch team is close to 5PB, and a 10% increase in resource utilization can save 500TB space, either to return the machine or to serve new requirements.

At present, the overall disk utilization of the ES cluster is about 50%, with a peak of 60%, and the log cluster disk utilization reaches 69.5% (2019.05.01). However, at this time, the cluster resources are very uneven, the disk alarm is also very serious, and the pressure of operation and maintenance is very high, and occasionally there is the problem of data loss.

For this reason, we add resource Qutoa control to the original ES machine capacity planning algorithm, and go deep into the engine to improve the capacity planning and resource uniformity of ES nodes at the engine level. It is expected to increase the overall disk utilization of the ES cluster by 10%, with a daily average of 60% and a peak of 70%, and there are no disk alarm and stability problems.

Real fuck

After the preliminary preparatory work has been completed, the cluster upgrade has become a step-by-step process, although it has also encountered some unexpected situations and stepped on some holes, but the overall process is still going smoothly.

① query Traffic playback comparison system based on ES Service metadata

On the AMS (Arius Meta Service) system built earlier, we record and analyze the user query conditions and query results.

In the process of double-write relocation and upgrade, we replay the user's query conditions on the high and low versions of the cluster, and compare and analyze the results and performance parameters returned by the query.

Only when the comparison is consistent, and the performance is not much different, we think that the upgrade is effective, so that we can be sure of it.

② ECM-based ES multi-cluster upgrade process

Due to the need for double-write relocation and upgrading, in the actual upgrade process, intensive cluster building, relocation, restart and other operations are required. Thanks to the cluster management and control capabilities of ECM and the flexible characteristics of elastic cloud, we work closely with OPS students to complete the upgrade of multiple clusters in a short time.

Analysis of the Features and upgrade performance of the New version of ③ ES

ES 6.6.1 provides a lot of new features, and there is also a great improvement in query writing performance. Some of the projects we have upgraded have also been verified. We will conduct a detailed analysis of these features and performance improvements and share them with you.

Analysis of Mining Pit for ④ ES version upgrade

In the process of upgrading, we also stepped on some pitfalls, such as the problem of OOM caused by the unlimited use of out-of-heap memory in the high version of SDK. We recorded all the problems in detail and shared them with you.

Harvest: I believe that one day I can take advantage of the long wind to break the waves.

After nearly half a year of development and reconfiguration, in the process of upgrading the domestic cluster to a high version, we have also made great improvements in architecture, products, cost, performance, characteristics, and our own capabilities.

The architecture is clearer

After refactoring, the service architecture of the entire Didi ElasticSearch platform becomes clearer, mainly converging into four major applications:

Gateway is responsible for querying the access of write requests. The user's current limit, permission verification and version compatibility are completed here.

ECM is responsible for the management and control of all clusters. Cluster construction, upgrade, restart, cluster-level monitoring and operation and maintenance analysis are completed here.

AMS is responsible for collecting and analyzing the runtime information of all clusters, nodes and indexes, ensuring data quality, and supporting other data analysis applications. Hierarchical storage, index health score, cluster health check, query playback and so on are completed here.

Arius Admin is responsible for core competencies such as indexing, permissions, and resource control. Relying on the core capabilities of Admin and the data collection capabilities of AMS, it also provides two well-designed and pluggable extension services, capacity planning and intelligent alarm.

Four applications complete functional abstraction, dependency decoupling and service transformation, compared with the previous offline arius-watch, arius-dsl, arius-tools, arius-monitor, arius-mark and other five small applications, the overall development efficiency and operation and maintainability have been greatly improved after refactoring.

The product is easier to use.

Based on ES version 6.5.1, we have completely reconstructed the Didi ElasticSearch user console, which provides users with some high-frequency operations, such as Mapping settings / changes, data cleaning, index expansion and reduction, index transfer, cost billing, etc., to improve the self-service operation of users.

In the future, we will upgrade the Kibana in the Didi ElasticSearch user console to the latest version and carry out customized development to provide richer and more powerful features for users.

Lower cost

Before, Didi ElasticSearch platform had a set of capacity planning algorithm based on index creation rules. Compared with no planning at all, the old capacity planning algorithm can increase the overall cluster resource utilization from 30% to about 50%.

But there are also some problems, such as: uneven distribution of resources, hot spots can not be quickly found, low dynamic adaptive ability, insufficient abstraction of planning algorithms can not take effect in the index cluster, poor convenience of operation and maintenance.

The following figure shows the disk utilization of the new and old capacity planning of a log cluster. After the new capacity planning is launched, the cluster resources will develop in two directions:

The resources being used are more concentrated, the resource utilization rate between nodes is more average, and the overall resource utilization rate is also higher.

Free resources are completely released. Based on elastic cloud deployment, you can quickly remove from the cluster, join the backup resource pool or join other resource-strapped clusters.

After a series of storage optimization and resource utilization transformation, on the basis of meeting the cluster upgrade and business needs, the resource cost of domestic ES has dropped from 339w in February 2019 to 259w in June 2019, and the number of machines has also dropped from 1658 to 1321.

With the gradual completion of the domestic cluster upgrade and the improvement of Ceph cold storage, more machines will be returned gradually, and the cost of using Didi ElasticSearch platform will drop step by step, and we will also consider further price reduction in pricing.

More powerful performanc

The performance after the upgrade of the new version is mainly reflected in the following two points:

① query performance improvement

The following figure shows the comparison of the customer service order list query before and after the upgrade. The 50 quartile time is reduced from 300ms to 50ms. The 99th quartile dropped from 600ms to 300ms.

For a detailed analysis of the performance improvement, see the new version of ES features and upgrade performance analysis.

Improved write performance of ② cluster

Upgrading to a higher version will only increase the write capacity of the ES 6.6.1 cluster by 30% compared with the ES 2.3.3 cluster with the same resource consumption.

Compared with the log cluster before and after writing TPS, the following figure shows that the cluster write capacity has been improved from 240w/s to 320w/s.

Prospect: raise the cloud sail high and march forward bravely in the sea

So far, Didi ElasticSearch team has completed the upgrading of all domestic log clusters and 90% of vip clusters, and the architecture of the entire Didi ElasticSearch platform has also been reconstructed and upgraded, thus there is more room for development at the ES engine level.

In the future, we will focus more on engine construction and more fundamentally solve the problems encountered at present. In the future, we will continue to make efforts in the following directions:

① 's larger cluster

In the log scenario, we try to break the maximum number of nodes supported by a single cluster of ES, and increase the number of nodes that a single cluster can support, from 200nodes supported by a single cluster to 1000 nodes.

It is expected that under the large cluster, we can reduce the number of our clusters and improve the efficiency of operation and maintenance. at the same time, the larger cluster can improve the resource utilization rate more conveniently and flexibly, and solve the sudden increase of traffic and resource hot issues.

Lower cost of ②

Reducing the cost of using ES and improving the utilization rate of resources have always been our goal. In the first half of the year, while completing the cluster upgrade and serving good business, we also achieved cost savings of 80w per month. The overall cost of ES has been reduced by about 25%, and we will strive to reduce costs by another 10% in the second half of the year.

Some new features provided by ES 6.6.1, such as Frozen mechanism and Indexing sort, will further reduce resource consumption.

Faster iterations of ③

The interaction between multi-tenant queries in ES clusters has always been a difficult problem for Didi ElasticSearch team to solve. Before, it was mostly solved through physical resource isolation, query audit and flow restriction at the platform level, resulting in low resource utilization and high human operation and maintenance costs.

Later, we will build a set of ES's own query optimizer, similar to MySQL's Explain, which can perform performance analysis and query optimization at the query statement level, and ensure the stability of the query at the engine level through query resource isolation at the index template level and the separation of general query and heavy query.

Closer ties with ④

On the basis of the new version of ES, we will maintain closer contact with the community, actively follow up the new features and development directions provided by the community, and introduce Didi for everyone's use.

We will also participate more actively in community building, feedback the problems we encounter and solve within Didi to the community, contribute more PR and generate more ES Contributor.

At this point, the study of "ElasticSearch platform Architecture upgrade Analysis" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.