Example Analysis of elasticsSearch 07/04 Update SLTechnology News&Howtos

Example Analysis of elasticsSearch

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

Editor to share with you the example analysis of elasticsSearch, I believe that most people do not know much about it, so share this article for your reference, I hope you can learn a lot after reading this article, let's go to know it!

I. Preface

In distributed search engines, elasticsSearch has gradually become a standard, which makes full-text search simple and conceals the complexity of Lucene through simple and coherent RESTful API. But the underlying layer still uses Lucene to implement the search function.

II. Core concepts

Index: an index is an abstraction of a class of data.

Type: type, which is a concrete abstraction of a class of data. In more cases, an index only corresponds to a type,type similar to a table in a database, and it is often a 1-to-1 relationship in logical definition. For example, the type of elasticsSearch stores the field in which the order data needs to be searched, and one field is the order number. After we find the order number through the field, we usually check the database again and return the details.

Document: like Document in Lucene, it represents a piece of data that can be searched.

Field: like field in Lucene, it represents each field of document.

The basic unit of data stored in a shard:elasticsSearch cluster, where an index has multiple shard and cannot be separated again in the cluster.

Coordination node: any node in the cluster can accept client requests, and the node that accepts the request is called the coordination node.

SegmentFile: a disk file for data persistence in shard. One shard corresponds to multiple segmentFile.

Fsync:Unix system calls functions to store data in the memory buffer buffer to the file system. This specifically refers to the operation of flushing all the segment in the file cache cache to disk.

Third, the basic principle 1. Distributed strategy (1) data distribution

Index creation can specify the number of shards and the number of replicas. The number of shards cannot be changed after creation, and the number of replicas can be changed later. With the increase and deletion of nodes in the cluster, each shard and replica will be redistributed to each node. Shards and replicas are not allocated to one node. Shards are distributed evenly among nodes through the hash algorithm. You can also customize shard distribution rules (let shards be created on some nodes and certain nodes in the cluster), such as using custom shard distribution rules to achieve hot and cold separation to improve performance. Because of this fragmentation mechanism, we can ensure that the fragmentation of a machine will not improve the search performance too much by adding nodes in the cluster.

(2) High availability

A master node is automatically elected in the cluster, and the main role of the master node is to manage the cluster, maintain index metadata, and so on. The master hangs, the cluster re-elects the master node, and the master node then switches the identity of the node to master.

(3) Writing and reading

Write requests are routed to write-only to primaryShard, and then automatically synchronized to replicaShard. Both primaryShard and replicaShard can be read.

two。 Basic principles (1) Writing process

The coordinator node receives the write request and routes the write request data to the primaryShard of the corresponding shard through the hash algorithm. When the node of primaryShard receives the request data, it first writes segment fiel and transLog (transaction log) into its own application memory buffer, and then defaults to refresh data from buffer to osCache (file system cache) every 1s. At this point, the client can query the data. This process is very fast because it does not involve data persistence (so it is quasi-real-time). When the translog file is too large or reaches a certain time (default is 30 minutes), the flush operation will trigger. The flush operation will unify the segmentfile to the disk file, generate a commitpoint, record the generated segmentfile, and then empty the translog.

Note:

When the failure recovers, elasticsSearch loads the segmentFile (restore search function) based on the current commitpoint file, and then redoes all the operations to recover the data through the translog transaction log.

Data may be lost when the data is still in buffer or osCache or translog is also in osCache. You can also set parameters to ensure that data is not lost, but at the expense of throughput and performance. After Elasticsearch 2.0, every time a write request (such as index, delete, update, bulk, etc.) is completed, fsync will trigger the segment in translog to be brushed to disk, and then a 200 OK response will be returned.

(2) the process of deleting data

Deletion is a bit like pseudo-deletion, first by writing the corresponding deleted record to the .del file on disk, marking that those document have been deleted (if the search will find these documents at this time, but will not return). When there is a certain amount of segment File, ES will perform a physical delete operation to completely erase these documents.

(3) the process of modifying data

To modify the data, delete it first and then add it, and then write the original data flag bit deleted status to a new document.

(4) the process of reading data (id passed into document)

Id hash to the specified shard through document, and then route to one of the shard nodes to read data according to the load balancing algorithm (default polling).

(5) the process of searching data

The coordinator node sends the request to all nodes that own the index, but only one of them is checked for the main parimaryShard and replicaShard, and each shard returns the docId of the query result to the coordinator node. Then the coordinating node pulls the docment according to the docId to the node where the data is actually stored, and the coordinating node carries out operations such as merging, sorting, paging and so on, and then returns it to the client.

Fourth, how to optimize the performance 1. Increase osCache coverage

The high performance of elasticsSearch depends largely on the size of osCache. After all, memory is definitely faster than hard drive, so you can increase the size of filesystemCache and overwrite as many segment files as possible to improve performance.

two。 Data preheating

Do a subsystem and search for hot data every other segment. Because osCache is actually based on LRU caching.

3. Hot and cold separation

Write a special index for hot data and separate index for cold data, and divide them into different machines by controlling sharding rules. Because the amount of hot data is small, if there is no cold data, you can ensure that as much data as possible is in the osCache. Because the cold data does not go to the hot data node, it avoids the overhead of oscache frequently switching data.

4. Model design

What is written into the es model completes the association between Type and establishes redundant fields (don't join in es), because the association between indexes is very inefficient if it is used in the search.

5. Avoid deep paging

Assuming a query of 100 pages, 1-100 pages of data will come to the orchestration node, and then the orchestration node will complete sorting, filtering, and paging, which is deep paging. There are two ways to deal with it. One is that our system design does not allow the page to be turned so deep, or the deeper the page is turned by default, the worse the performance. Second, using elasticsSearch's ScrollAPI,ScrollAPI allows us to do an initial search and continue to pull results from Elasticsearch until there are no results left. The disadvantage is that we can only turn back one page at a time, not jump through it.

The above is all the content of this article "sample Analysis of elasticsSearch". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.