Introduction of Elasticsearch distributed search engine 04/21 Update SLTechnology News&Howtos

Introduction of Elasticsearch distributed search engine

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Introduction to Elasticsearch

Elasticsearch is a near real-time distributed search and analysis engine, which is often used for full-text search, structured search, analysis and so on. It is written in Java and open source, and it internally uses Lucene for indexing and searching, but its goal is to make full-text retrieval simple, by hiding the complexity of Lucene and instead providing a set of simple and consistent RESTful API.

Lucene is a full-text information retrieval tool library based on Java. It is not a complete search application, but provides indexing and search functions for your application. Lucene is currently an open source project in the Apache Jakarta family. It is also the most popular open source full-text retrieval tool library based on Java.

However, Elasticsearch is not only a Lucene, but also more than just a full-text search engine. It can be accurately described as follows:

A distributed real-time document storage, each field can be indexed and searched. A distributed near real-time analysis search engine is capable of expanding hundreds of service nodes and supports the basic concepts of structured or unstructured data at the PB level.

Cluster cluster: a collection of one or more nodes that are uniquely identified by specifying a name at startup. Default cluster-state

Node node: a single instance of the launched ES, which saves data and has the ability to index and search, and is uniquely identified in the cluster by name. Default node-n

Index index: a collection of documents with similar characteristics that can correspond to databases in a relational database and are uniquely identified within the cluster by name. Can correspond to a database in Mysql.

Type document category: logical classification within an index. In version 6.x of ES, only one type is allowed for an index, and multiple type is no longer supported. In version 7.x, type will be obsolete. Can correspond to tables in the Mysql database.

Document document: the smallest unit that makes up an index, belonging to a category of an index and uniquely identified within the Type by id. Can correspond to rows in a table in an Mysql database.

Field field: the unit that makes up the document. Can correspond to columns in a table in an Mysql database.

Mapping index mapping: used to constrain the types of document fields, which can be understood as the internal structure of the index. You can correspond to the type of each column in the table of the Mysql database.

Shard sharding: divides the index into multiple blocks, each called a shard. When defining an index, you need to specify the number of shards and cannot change them. By default, an index has 5 shards, each of which is a fully functional Index. Sharding brings improvement in scale (horizontal data sharding) and performance (parallel execution), which is the smallest unit of ES data storage.

Backup of replicas shards: each shard defaults to one shard, which can improve the availability of nodes and the concurrent performance of search (search can be performed in parallel on all shards)

Elasticsearch characteristic inverted index

Index is the core of modern search engines. the process of establishing index is to process the source data into index files which are very convenient to query.

Why is indexing so important? imagine that you want to search for documents containing certain keywords in a large number of documents now, then if you don't build an index, you need to read these documents into memory sequentially, and then check whether this article contains the keywords you want to find. This will take a lot of time. Think that the search engine can find out the search results in milliseconds. This is because the index is built, you can think of the index as a data structure that enables you to quickly and randomly access the keyword stored in the index and find the document associated with that keyword.

Lucene uses a mechanism called inverted indexing (inverted index). Reverse indexing means that we maintain a list of words / phrases, and for each word / phrase in this table, there is a linked list that describes which documents contain this word / phrase. In this way, when the user enters the query criteria, the search results can be obtained very quickly. We will describe the indexing mechanism of Lucene in detail in the second part of this series. Because Lucene provides easy-to-use API, readers can easily use Lucene to index your documents, even if they don't know much about the mechanism for indexing the full text at the beginning.

Once the document is indexed, you can search on these indexes. The search engine will first parse the search keywords, then search on the established index, and finally return the documents associated with the keywords entered by the user.

Let's take a look at how the following two documents are inverted:

Document 1 (Doc 1): Insight Data Engineering Fellows Program

Document 2 (Doc 2): Insight Data Science Fellows Program

DataDoc1,Doc2engineeringDoc1fellowsDoc1,Doc2insightDoc1,Doc2programDoc1,Doc2scienceDoc2 full-text search of lexical item documents

Full-text retrieval first extract the words from the target document to be queried and form an index to achieve the purpose of searching the target document through the query index. This process of building an index and then searching the index is called full-text search (Full-text Search).

The two most important aspects of full-text search are:

Correlation (Relevance)

It is an ability to evaluate the degree of correlation between a query and its results, and rank the results according to this correlation. This calculation method can be TF/IDF method (search term frequency / reverse document frequency), geographical proximity, fuzzy similarity, or some other algorithm.

Analysis (Analysis)

It is a process of converting blocks of text into differentiated, normalized token in order to create inverted indexes and query inverted indexes.

The analysis mainly includes the following processes:

1. Divide a piece of text into separate entries suitable for inverted indexing (such as meeting spaces and punctuation to segment words)

2. Unify these entries into a standard format (for example, lowercase Quick, delete useless words such as aheline and leap, and add synonyms such as jump and leap) to improve their "searchability".

Structured search

Structured search (Structured search) refers to the process of querying data that has an inherent structure. For example, dates, times, and numbers are all structured: they have precise formats, and we can logically manipulate them, such as comparing sizes, and so on.

In structured queries, the results we get are always either yes or no, either in the collection or outside the collection. A structured query does not care about the relevance or score of the document; it simply includes or excludes the document.

Document oriented

Elasticsearch is document-oriented, which means it stores the entire object or document, where the document can refer to a HTML page, an email, or a text file. A Document object consists of multiple Field objects. Think of a Document object as a record in the database, and each Field object is a field of the record. Elasticsearch not only stores documents, but also indexes the contents of each document so that it can be retrieved. In Elasticsearch, we index, retrieve, sort, and filter documents rather than row data. This is a completely different way of thinking about data, and that's why Elasticsearch can support complex full-text retrieval.

Elasticsearch uses JavaScript Object Notation (or JSON) as the serialization format of the document. JSON serialization is supported by most programming languages and has become the standard format in the NoSQL domain. It is simple, concise and easy to read.

The following JSON document represents a simple user object:

{

"email": "john@smith.com"

"first_name": "John"

"last_name": "Smith"

"join_date": "2014-05-01"

}

Near real-time retrieval of massive data

Although changes in Elasticsearch are not immediately visible, it provides a near-real-time search engine. Changing Lucene to disk is an expensive operation. To avoid committing changes to disk while the document is still valid for queries, Elasticsearch provides a file system cache between the memory buffer and the disk. The memory cache (by default) is refreshed every 1 second, creating a new segment using an inverted index in the file system cache. This paragraph is open and valid for search.

The file system cache can have a file handle, and the file can be open, readable, or closed, but it exists in memory. Because the refresh interval is 1 second by default, the changes are not immediately visible, so it is said to be near real-time. Because translog is a change persistence record that has not yet been closed, it can contribute to the near real-time performance of CRUD operations. For each request, any recent changes can be searched from translog before looking for relevant segments, so the client can access all near real-time changes.

You can explicitly refresh the index after the create / update / delete operation so that the changes are immediately visible, but I do not recommend you to do so, as this will create a lot of small segment and affect search performance.

Elasticsearch node type

An Elasticsearch instance is a node, and a set of nodes make up a cluster. Nodes in an Elasticsearch cluster can be configured with 3 different roles:

Master node: controls the Elasticsearch cluster, responsible for operations in the cluster, such as creating / deleting an index, tracking nodes in the cluster, and assigning shards to nodes. The master node processes the status of the cluster and broadcasts it to other nodes and receives acknowledgements from other nodes. Each node can become the master node by setting the node.master attribute in the configuration file elasticsearch.yml to true (the default). For large production clusters, it is recommended to use a dedicated master node to control the cluster, which will not handle any user requests.

Data nodes: hold data and inverted indexes. By default, each node can become a data node by setting the node.data attribute in the configuration file elasticsearch.yml to true (default). If we want to use a dedicated master node, we should set its node.data property to false.

Client node: if we set both the node.master attribute and the node.data attribute to false, then this node is a client node that acts as a load balancer and routes incoming requests to each node in the cluster. Elasticsearch cluster status

Green: all master shards and replica shards are working properly.

Yellow: all primary shards work, but not all replica shards work.

Red: there is a main shard that does not work properly.

Elasticsearch cluster election

Elasticsearch uses the zen discovery algorithm developed by itself to elect the master of the cluster. The general principle is as follows:

1, sort all the nodes that can become master according to nodeId, each node will sort the nodes they know once, and then select the first (bit 0) node, which is temporarily considered to be a master node.

2, if the number of votes for a node reaches a certain value (it can be called the number of master nodes) and the node elects itself, then that node is master. Or re-election.

3. For the problem of cerebral fissure, the minimum value of candidate master nodes needs to be set to the number of master nodes that can become master nodes (quorum).

Elasticsearch write operation

When we send a request to index a new document to the orchestration node, the following set of actions occurs:

Each node in the Elasticsearch cluster contains metadata information about the shards on the modified nodes. The orchestration node (default) uses the document ID to participate in the calculation to provide appropriate shards for routing. Elasticsearch uses the MurMurHash4 function to hash the document ID, and the result is to model the number of fragments, and the result is the fragmentation of the indexed document.

Formula: shard = hash (document_id)% (num_of_primary_shards) when the node where the shard is located receives a request from the orchestrating node, the request is written to translog (which we will discuss in the next article in this series) and the document is buffered in memory. If the request is successfully processed on the main slice, the request is sent to a copy of the slice in parallel. The client will not receive an acknowledgement notification until the translog is synchronized (fsync) to all the primary shards and their copies. The memory buffer is refreshed periodically (the default is 1 second) and the contents are written to a new segment of the file system cache. Although this segment is not fsync, it is open and the content can be searched. Every 30 minutes, or when the translog is large, the translog is emptied and the file system cache is synchronized. This process is called flush in Elasticsearch. During the flushing process, the buffer in memory is cleared and the contents are written to a new segment. The fsync of the segment creates a new commit point and flushes the contents to disk. The old translog will be deleted and a new translog will be started. Elasticsearch update and delete operation

Deletions and updates are also write operations. But documents in Elasticsearch are immutable and cannot be deleted or modified to show their changes. So how do you delete and update documents?

Each segment on disk has a corresponding .del file. When the delete request is sent, the document is not actually deleted, but marked for deletion in the .del file. The document will still match the query, but will be filtered out in the results. When segments are merged (which we will talk about in the next article in this series), documents marked for deletion in the .del file will not be written to the new segment.

Next let's take a look at how updates work. When a new document is created, Elasticsearch assigns a version number to the document. When an update is performed, the old version of the document is marked for deletion in the .del file, and the new version of the document is indexed to a new segment. The older version of the document will still match the query, but will be filtered out in the results.

Elasticsearch read operation

The read operation consists of two parts:

1, query phase

2, extraction stage

Query phase

At this stage, the orchestrating node routes the query request to all shards (primary shards or copies thereof) of the index. Each shard executes the query independently and creates a priority queue for the query results, sorted by correlation scores (which we will discuss in subsequent articles in this series). All shards return the ID of the matching document and its correlation score to the orchestrating node. The orchestration node creates a priority queue and sorts the results globally. There will be a lot of document matching results, but by default, only the first 10 results per shard are sent to the orchestrating node, which creates a priority queue for these results on all shards and returns the top 10 as hit.

Extraction stage

When the orchestrating node sorts all the results in a globally ordered list of documents, it initiates a request to the shard containing the original document. All fragments populate the document information and return it to the orchestration node.

Principle of Elasticsearch failover fault detection

The master node ping all the other nodes to see if they are still alive; then all the nodes ping back and tell master they are still alive.

Fail-over

We shut down a primary node, Node1. The cluster must have a master node to work properly, so the first thing that happens is to elect a new master node: Node 2.

When we shut down Node 1, we also lost main shard 1 and 2, and the index did not work properly when the master shard was missing. If we check the status of the cluster at this point, the status we see will be red: not all the main shards are working properly.

I see that there are complete copies of these two master shards on other nodes, so the new master node immediately promotes the corresponding replica shards on Node 2 and Node 3 to primary shards, and the status of the cluster will be yellow. The process of lifting the main slice occurs instantly, like pressing a switch. Why is our cluster status yellow instead of green? Although we have all three master shards, we also set up that each master shard needs to correspond to two copy shards, and there is only one copy shard at this time. So the cluster cannot be in the state of green. (if we set the copy to 1, the cluster state will change to green)

If we restart Node 1, the cluster can redistribute the missing replicas, then the state of the cluster will be completely normal and the green state will be restored. If Node 1 still has the previous shards, it will try to reuse them and copy only the modified data files from the main shard. Elasticsearch optimization point

1, no more than 32 GB of memory per node, use jvm to compress the pointer

2. Set the index template reasonably, including settings and mappings

3. Use ssd to promote disk io

4. When the amount of data is large, the use of from,size for deep paging consumes performance and should be prohibited.

5. Close unused indexes

6, set aside half of the system memory for file system cache

7, write using bulk

8. Increase the index refresh time with low real-time requirements

9, shut down the system swap

10. When the entire cluster is restarted, the delay sharding policy is adjusted as needed.

The whole cluster is restarted and the recovery is slow.

Reason:

1. The translog of all active status index needs to be loaded to ensure the data integrity of the master shard. After the master shard is restored, the rep replicates the missing parts from the master shard.

2There are too many shards.

Mitigation:

1. Close unwanted indexes

2. Translog of flush cluster before restart

3. When the cluster is restored, turn off the cluster write to reduce the pressure on the cluster system.

Shard that cannot be assigned

Reason:

1. File corruption or file system problem

2. Due to high machine pressure, allocation allocation timed out

Tools:

1. You can use explain API to view the reason for the failure of allocation.

Resolve:

1. Enables and closes the index, triggering reallocation

2, set rep to 0

3. There may be a risk of data loss when using rerouting API.

Elasticsearch main features advanced features

1. Support SQL-like query

2. Machine learning

The latest stable 7.x new features

1, the default shard is adjusted to 1

2, no type index structure

3Jing Kibana supports Darkness mode

4. Using Term to improve query performance by 3700%

5. Make memory management more robust and reduce the occurrence of OOM

6. Timestamp supports nanosecond level

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.