Case Analysis of one-stop search in ElasticSearch 04/24 Update SLTechnology News&Howtos

Case Analysis of one-stop search in ElasticSearch

2025-04-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "one-stop search case Analysis of ElasticSearch". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn the "one-stop search case Analysis of ElasticSearch".

Application scenario of ElasticSearch in Didi

Didi began to form a team in April 2016 to solve the performance problems encountered in the use of ElasticSearch. The construction of the search platform has gradually evolved with the development of the volume of business, and now it has grown to more than 3500 + ElasticSearch instances, 5PB data storage, peak TPS writing exceeds the super-large scale of 2000W/S, with nearly 1 billion queries per day.

ElasticSearch has a very rich application scenario in Didi:

Provide engine support for online core search business

As a RDS slave database, massive data retrieval requirements

Solve the problem of massive log retrieval in the company

Provide data analysis capabilities for security scenarios.

Business parties in different scenarios have different requirements on the timeliness of writing, RT of queries, and overall stability. The services provided by the platform are abstracted as index template services, and users can activate the corresponding services by themselves.

After internal pressure testing, online tuning and some engine optimization, we have precipitated the best practices into the standard Docker image. Personalized requirements are set and controlled at the service level of the index template. Some of the optimizations are as follows:

Risks and challenges to platform stability

The super large cluster size and rich scenes bring great risks and challenges to Didi ElasticSearch platform. The Lord has the following aspects:

Online business scenario

The stability requirement is at least 99.99%, and it is sensitive to the 90 quartile jitter of the query.

The architecture level needs to support the requirement of multi-activity, and there are requirements for the consistency and timeliness of the data. The final consistency of the data must be ensured, and the data update can be seen in seconds.

Plug-in requirements and index fragmentation rules are diversified in different online businesses.

How many independent clusters carry out rolling upgrades quickly and smoothly will not affect the online business.

Quasi-online business scenario

The timeliness of offline rapid import requires minutes, it takes 5 hours to import 1 billion pieces of data in real time, the online resources are consumed seriously, the online services are basically unavailable, and the import cost is too high.

Query diversity, 14W + query template, single index up to 100 + applications query at the same time, in multi-tenant scenarios, how to ensure the stability of the query.

Security and logging scenarios

The real-time writing of 10 million-level data per second and the storage of PB-level log data demand large-scale ElasticSearch clusters, but ElasticSearch has its own meta-information bottleneck. For more information, please see the sharing of team students: https://www.infoq.cn/article/SbfS6uOcF_gW6FEpQlLK.

Query scenarios are not fixed, with tens of billions of data in a single index, it is necessary to ensure that unreasonable queries can control the stability risk of clusters and indexes.

PB-level storage, low query frequency, but the timeliness of the query requires S-level return, all based on SSD disk, the cost is too high, it is necessary to reduce the overall storage cost without much change in the query experience.

So, how to solve these problems? Welcome to the QCon Global Software Development Conference (Guangzhou Station) to communicate with me face to face.

How to build a search center with low storage cost

At present, in the log and security analysis scenario, the storage cost is under great pressure, which is a typical "write more and check less" scenario. We have conducted an in-depth analysis of the dissipation point of storage cost. The overall situation is as follows:

Aiming at the resource dissipation point, we optimized it at the architecture level, reduced the overall cost by 30%, and saved the storage of 2PB cumulatively, which were optimized from the following aspects.

Storage index separation: log text and index are stored separately

Unreasonable index field Mapping automatic optimization

Hot and cold data are stored in different levels.

ES On Docker&Ceph transformation

Future development planning

The benefits brought to users by searching Zhongtai based on ElasticSearch

Served more than 1200 + platform business parties, including 20 + online P0 applications, 200 + quasi-real-time applications

The access efficiency of indexing service has been reduced from two weeks to 5 minutes.

Service stability is guaranteed: 99.99% for online scenarios and 99.95% for log scenarios

The operation of high-frequency operation and maintenance can be completed by one-click self-help, 90% of the problems are completed in 5 minutes.

The overall storage cost is 1Accord 3 of cloud vendors in the industry.

Insufficient point

At present, 90% of Didi's clusters are still in ElasticSearch 2.3.3. The internal fixed BUG and optimization cannot be synchronized with the community.

At present, the multi-cluster solution is well supported by ES-GateWay to meet the needs of business development, but after more clusters, version maintenance and upgrade, overall resource utilization and capacity planning become very difficult.

Development planning

Solve the "Entropy" of Architecture

Break through engine metadata bottleneck, improve operation and maintenance efficiency, reduce cost-> ES-Federation

GateWay capability plug-in sinking engine, reduce intermediate links, integrate with the community, and optimize performance.

Lift engine iterative efficiency

100 node cluster rolling restart time increased from 2 days to 1 hour

Architecture level to solve the "pain" of cross-version upgrade 2.2.3-> 6.6.1 http restful.

Focus on the problem of value

Construction of query optimizer for multi-tenant query, CBO and RBO

Data systematization-> data intellectualization

Transform ElasticSeach based on Ceph and Docker to support storage and computing separation of Cloud Native.

At this point, I believe you have a deeper understanding of the "one-stop search case analysis of ElasticSearch". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.