Search platform based on Elasticsearch 07/19 Update SLTechnology News&Howtos

Search platform based on Elasticsearch

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Background

With the rapid development of the company's business and the explosive growth of data, the current production lines of the company have search requirements, but the previous search service system can not meet the expectations of each business line because of its architecture and business design, which mainly reflects the following three problems:

Can not support the sentence-level search, a large number of business-related attributes can not be achieved without any search-related indicators evaluation system scalability and maintenance is particularly poor

Based on the current situation, make a full survey of the search services in the industry, confirm the use of Elasticsearch as the underlying index storage, and redesign the existing search services to meet the business needs of maintainability and customized search sorting.

Overall technical architecture

The bottom layer of Hujiang search service based on distributed search engine ElasticSearch,ElasticSearch is an open source, distributed, Restful search engine based on Lucene, which can meet the requirements of near real-time search, stability, reliability and fast response.

The search service is divided into five subsystems.

Search service (Search Server): provide search and query function update service (Index Server): provide incremental update and full update function Admin console: provide UI interface to facilitate index-related maintenance operation ElasticSearch storage system: underlying index data storage service monitoring platform: provides monitoring external system interface design based on ELK log and zabbix

The query API provides the calling method of http. When access occurs across data centers, use the http API. The rest of the API can be accessed by providing MQ by calling the incremental update data API using dubbo RPC. When a data update occurs on the business side, you only need to push the data to the corresponding MQ channel. The update service will monitor each business side channel and update the data to the Elasticsearch in a timely manner. The full index update service will call the full Http interface provided by the business side (which needs to provide paging query and other functions).

As we all know, the full update function is an essential part of the search service. It can mainly solve the following three problems

The failure of the business side's own system, the loss of a large amount of data, the rapid development of the business, such as adding or subtracting fields or modifying the word segmentation algorithm, and other related requirements, the cold start of the business will have the need to import large quantities of data at once.

Based on the problems mentioned above, we have worked with the business side to achieve a full index. But in the process, we also found a general problem. In the full update, in fact, incremental updates are also carried out at the same time, if these two kinds of updates are carried out at the same time, there will be a small amount of incremental updates of data loss. For example, the following scene

The business side found that a large amount of data was lost in its search business alias_A data, so it rebuilt the index. Where alias_A is an alias, that is, we usually say alias, but the underlying real index is index_201701011200 (suggestion: the index contains a time attribute, so you can know what created it) first create a new index index_201706011200, then pull the data from the data and insert it into ES, record the timestamp T1, and finally the index completes with a timestamp of T2, and switch the search alias index_1 to point to index_201706011200. The latest data after the successful creation of the index is T1, but the data from T1 to T2 has not been obtained. At the same time, the old index_201701011200 index continues to consume data from MQ, including missing data from T1 to T2. So every time the index is rebuilt, there is a lack of data from T1 to T2.

Finally, in view of the above scenario, we propose to suspend the consumption of index consumer through zookeeper distributed locks. The specific steps are as follows

Create new_index to obtain the alias corresponding to the index, modify the status of the distributed lock to stopindex consumer monitoring stop status, pause the update of index data new_index index data has been created, update the distributed lock status to startindex consumer monitoring start status, and continue to update index data

In this way, we don't have to worry about the lack of data during the time the index is created. I believe you have some experience in this way to solve the problem of updating data in full and increment.

Seamless expansion of cluster

With the explosive increase in the amount of data, our ES cluster finally encountered the problem of insufficient capacity. Under this background, combined with the seamless expansion function provided by ES itself, we finally decided to expand the online ES cluster seamlessly, from three machines to five machines. The specific steps are as follows.

Preparation before capacity expansion currently we have three machines running online, of which node1 is a master node, node2 and node3 are data nodes, and node communication is in the form of unicast rather than broadcast. Prepare 2 (node4 and node5) machines, in which the configuration of the machine itself needs to be consistent with the ES configuration parameters. Add node startup node4 and node5 in the expansion (note one startup). After the startup is completed, check the status of the node1,2,3,4,5 node. Normally, the node1,2,3 node has found node4 and node5, and the state between the nodes should be consistent. Restart master node to modify the node1,2,3 node configuration to be consistent with node4,5. Then restart node2 and node3 sequentially, be sure to restart data node first, and finally restart node1 (master node). So far, our online ES cluster has been seamlessly expanded online.

Deploy optimized query and update service separate query service and update service physically isolate the impact of instability of update service on query service reserve half memory ES underlying storage engine is based on Lucene, Lucene inverted index is first generated in memory, and then periodically asynchronously flushed to disk in the form of segments, at the same time, the operating system will cache these segments of files For faster access. So the performance of Lucene depends on the interaction with OS, if you allocate all the memory to Elasticsearch, leaving nothing for Lucene, then your full-text search performance will be very poor. All official recommendations, when setting aside more than half of the memory for Lucene to use no more than 32 gigabytes across 32 gigabytes, there will be some phenomena that make the memory utilization lower than 32 gigabytes. For specific reasons, please refer to the official article Don't Cross 32 GB! Try to avoid using wildcard to actually use wildcard queries, which are similar to using left-and right-hand wildcard queries in a database. (such as: * foo*z) set a reasonable refresh time. The default index.refresh_interval parameter in ES is 1s. For most search scenarios, the effective time of the data does not need to be so timely, so you can adjust the summary according to the tolerance of your business.

This chapter mainly introduces the overall structure of the company's search service, focusing on the problem of data consistency in the full update and the online expansion of ES, and enumerates some optimizations made by some companies in the deployment of ES. The main purpose of this article is to give readers some suggestions on building a set of general search by reading the practice of Hujiang search.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.