Architecture Evolution of like search system 04/25 Update SLTechnology News&Howtos

Architecture Evolution of like search system

2025-04-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

The search platform is a PaaS product for internal search applications and some NoSQL storage applications. It helps to apply reasonable and efficient search and multidimensional filtering functions. The search platform currently supports more than 100 search services, large and small, serving nearly 10 billion data.

While providing advanced retrieval and big data interaction capabilities for traditional search applications, likeable search platforms also need to support mass data filtering such as commodity management, order retrieval, fan screening, etc. From an engineering perspective, how to expand the platform to support diverse retrieval needs is a huge challenge.

I am the first employee of the search team, and I am also lucky to be responsible for designing and developing most of the features of the search platform so far. Our search team is mainly responsible for the performance, scalability and reliability of the platform, and reduces the operation and maintenance costs of the platform and the development costs of the business as much as possible.

Elasticsearch

Elasticsearch is a highly available distributed search engine. On the one hand, the technology is relatively mature and stable, on the other hand, the community is also active. Therefore, we also chose Elasticsearch as our basic engine in the process of building the search system.

Architecture 1.0

Back in 2015, the likes search system running in the production environment was an Elasticsearch cluster composed of several high-profile virtual machines, mainly running commodity and fan indexes. Data was synchronized from DB to Elasticsearch through Canal. The general architecture was as follows:

In this way, when the service volume is small, synchronous applications can be quickly created for different service indexes at low cost, which is suitable for the period of rapid service development. However, each synchronous program is a single application, which is not only coupled with the service database address, but also needs to adapt to the rapid changes of the service database, such as database migration and database table division. Moreover, multiple canals subscribe to the same database at the same time, which will also cause the database performance to decline.

In addition, Elasticsearch cluster did not do physical isolation, there was a promotion because the fan data volume was too large, resulting in Elasticsearch process heap memory exhaustion and OOM, so that all indexes in the cluster could not work properly, which gave me a deep lesson.

architecture 2.0

In the process of solving the above problems, we also naturally precipitated the 2.0 version of the search architecture, which is roughly structured as follows:

First, the data bus synchronizes the data change message to mq, and the synchronization application synchronizes the service library data by consuming mq messages. The decoupling with the service library can be realized by using the data bus. The introduction of the data bus can also avoid the waste of multiple canals monitoring and consuming the same table binlog.

Advanced Search

Along with business development, we gradually have some centralized traffic portals, such as distribution, selection, etc. At this time, ordinary bool queries cannot meet our requirements for fine-grained sorting control of search results. It is obviously an undesirable option to hand over the writing and optimization of highly professional advanced queries such as complex function_score to business development. Here, we consider intercepting business query requests through an advanced query middleware. After parsing out the necessary conditions, reassemble it into an advanced query and hand it to the engine for execution. The general structure is as follows:

Another optimization done here is to join the search results cache. The regular text retrieval query match needs real-time calculation every time it is executed. In the actual application scenario, this is not necessary. Users within a certain period of time (For example, 15 or 30 minutes) It is completely acceptable to access the same search results through the same request. Doing a result cache in the middleware can avoid the waste of repeated query execution and improve the response speed of the middleware. Students interested in advanced search can read another article,"Search Engine Practice (Engineering)"(see Technical Blog), which will not be described in detail here.

big data integration

Search application and big data are inseparable. In addition to mining more value of user behavior through log analysis, offline calculation of ranking comprehensive score is also an indispensable part of optimizing search application experience. In Phase 2.0, we built an interactive channel between hive and Elasticsearch through open source es-hadoop components. The general architecture is as follows:

Collect search logs through flume and store them in hdfs for subsequent analysis. They can also be exported as search prompt words after hive analysis. Of course, big data provides much more for search services. Here are just a few simple functions.

problem

This architecture has supported the search system for more than a year, but it has also exposed many problems. The first is the increasingly high maintenance cost. In addition to the configuration and field changes of Elasticsearch cluster maintenance and index itself, although it has been decoupled from the business library through the data bus, the business code coupled in the synchronization program still brings a great maintenance burden to the team. Message queues reduce the coupling between us and the business to some extent, but they also cause message ordering problems that are difficult for us who are unfamiliar with the state of the business data.

In addition, the business traffic flowing through Elasticsearch cluster is in a semi-black box state for us, which can be perceived but unpredictable. Therefore, there is a failure that the on-line cluster is called by internal large traffic error and the CPU is full and cannot be serviced.

Current architecture 3.0

In response to the problems of the 2.0 era, we have made some targeted adjustments in the 3.0 architecture, listing the main points:

Receive user calls through an open interface, completely decoupled from business code;

Add proxy for external services, preprocess user requests and perform necessary flow control, caching and other operations;

The evolution of providing a management platform to simplify index changes and cluster management has gradually platformized the search system, and has begun to take shape as a search platform architecture:

Proxy

As the gateway of external services, proxy not only provides standardized interfaces compatible with different versions of Elasticsearch calls through ESLoader, but also embeds functional modules such as request verification, caching, template query, etc.

Request verification is mainly used to preprocess user write and query requests. If field mismatch, type error, query syntax error, suspected slow query and other operations are found, the request will be rejected in a fast fail mode or executed at a lower flow control level to avoid the impact of invalid or inefficient operations on the entire Elasticsearch cluster.

Cache and ESLoader are mainly to integrate the general functions of the original advanced search, so that advanced search can focus on its own query analysis and rewrite sorting functions, more cohesive. We did a little optimization on the cache. Since the query result cache is usually large with source document content, in order to avoid traffic peaks and frequent access to the codis cluster network congestion, we implemented a simple local cache on the proxy, which automatically degrades during traffic peaks.

Here, template query is mentioned. In the case that the query structure (DSL) is relatively fixed and lengthy, such as product category screening and order screening, it can be realized through template query. On the one hand, it simplifies the burden of service layout DSL, on the other hand, it can also edit the query template to optimize the online query performance at the service end by using default values and optional conditions.

management platform

In order to reduce the maintenance cost of daily index addition and deletion, field modification, and configuration synchronization, we implemented the initial version of the search management platform based on Django, which mainly provides a set of index change approval flow and the function of synchronizing index configuration to different clusters. We managed index metadata in a visual way, reducing our time cost in daily maintenance of the platform.

Due to the unfriendliness of the open source head plugin in terms of presentation, and the exposure of some rough features:

As shown in the figure, you can click the field to make the index sort by the specified field to display the results. In earlier versions of Elasticsearch, the field content to be sorted will be loaded through fielddata. If the field data volume is relatively large, it is easy to cause heap memory to fill up and cause full gc or even OOM. To avoid such problems, we also provide customized visual query components to support users 'needs for browsing data.

ESWriter

Since es-hadoop can only adjust read and write traffic by controlling the number of map-reduce, es-hadoop actually adjusts its behavior by whether Elasticsearch rejects requests, which is quite unfriendly to clusters working online. In order to solve this uncontrollable offline read and write traffic, we developed an ESWriter plug-in based on the existing DataX, which can achieve second-level control of the number of records or traffic size.

challenges

Platformization and the perfection of supporting document system lower the access threshold for users. With the rapid growth of business, the operation and maintenance cost of Elasticsearch cluster itself gradually makes us unbearable. Although there are multiple physically isolated clusters, it is inevitable that multiple business indexes share the same physical cluster. The production standards that differ between different businesses are not well supported. Deploying too many indexes in the same cluster is also a hidden danger for stable operation of production environment.

In addition, Auto Scaling of cluster service capability is relatively difficult. Horizontal expansion of a node requires steps such as machine application, environment initialization, software installation, etc. If it is a physical machine, it requires a longer machine procurement process and cannot respond to the lack of service capability in time.

Future Architecture 4.0

The current architecture accepts users 'data synchronization requirements through open interfaces. Although it realizes decoupling from business and reduces the development cost of our team itself, the relative user development cost has also become higher. From database to index, data needs to go through three steps: obtaining data from data bus, synchronously applying and processing data, and calling the open interface of search platform to write data. The two steps of acquiring data from the data bus and writing to the search platform will be repeatedly developed in the synchronization program of multiple services, resulting in waste of resources. Here we are also preparing to integrate with DTS (Data Transporter, Data Synchronization Service) developed by PaaS team, and realize automatic data synchronization between Elasticsearch and multiple data sources through configuration.

To solve the problem of sharing clusters to cope with different production standard applications, we hope to further upgrade the platform-based search service to a cloud-based service application mechanism, and cooperate with the classification of services, and deploy core applications independently into isolated physical clusters, while non-core applications apply for Elasticsearch cloud services based on k8s through different application templates. Application templates define service configurations for different application scenarios to address differences in production standards for different applications, and cloud services can scale services in a timely manner based on application running conditions.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.