What is the core principle of Elasticsearch in Java 07/06 Update SLTechnology News&Howtos

What is the core principle of Elasticsearch in Java

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly shows you what is the core principle of Elasticsearch in Java. It is easy to understand and clear. I hope it can help you solve your doubts. Let me take you to study and learn this article "what is the core principle of Elasticsearch in Java?"

Introduction to Elasticsearch what is Elasticsearch? What can it do?

Elasticsearch (hereinafter called ES) is a distributed full-text search engine based on Lucene, which is good at massive data storage, data analysis and full-text search query. It is a very excellent data storage and data analysis middleware, widely used in log analysis and full-text retrieval and other fields. At present, many large factories have developed their own storage middleware and data analysis platform based on Elasticsearch.

Start with the core concept, Lucence

Lucene is a sub-project under Apache and an open source full-text search engine toolkit, but it is not a complete full-text search engine, but a full-text search engine architecture, which provides a complete query engine and index engine. It is the core foundation of ES to achieve full-text retrieval. The core processes of indexing documents and search index are completed in Lucene.

Core data structure Document

We all say that ES is for document. What does that mean? It actually means that ES performs data operations based on document, which mainly includes data search and indexing (the index here means data writing). So document can be said to be the underlying data structure of ES, which is serialized and saved to ES. So what on earth is this document? I believe we are all familiar with Mysql, so we use the concepts of databases and tables in Mysql to compare with ES's index, which may not be very appropriate and consistent, but it can help you understand these concepts. In addition, type has been phased out after the ES6.x version.

Index

In previous versions of ES, there was the concept of type, analogous to the tables in the database, and the document mentioned above would be placed in the type. However, in later versions of ES, type is gradually eliminated in order to improve the efficiency of data storage, so index actually has the concept of both libraries and tables in today's ES. The simple understanding is that index is the container of documents, it is a collection of documents, but it should be noted here that index is the classification of logical space, and the actual data exists on the fragments of physical space.

In addition, it should be noted that in ES, an index has different contextual meanings. It can be either a noun or a verb. Index is a noun, that is, as mentioned above, it is a collection of document, and when the index is a verb, it means to save document data to ES, that is, to write data.

In ES, in order to shield the interactive differences of language, the direct external interaction of ES is carried out through Rest API.

Inverted index

We all know that the purpose of indexing is to speed up the query of data. If there is no index in a relational database, we need to compare each piece of data in order to find the data. if we are unlucky, we may need to scan the whole table to find the data we want. Take Mysql as an example, it uses B + tree as an index to speed up data query. Suppose there is such a scene, when you suddenly hear a very good song while walking on the road at the weekend, you remember two of the lyrics and want to quickly take your phone to QQ Music to find out what the song is. If you are QQ Music's programmer, how can you query songs according to the lyrics? Is it possible to use a B+ tree as an index row? Full-text indexing needs to support indexing of large text. Spatially, B+ tree is not suitable for full-text indexing. At the same time, B+ tree follows the leftmost matching principle because every search starts from the root node. When we use full-text search, we often do not follow the leftmost matching principle, so it may lead to index failure. The inverted index will come in handy at this time. The so-called forward index is like the catalogue in a book, which queries the content according to the page number, but the inverted index is on the contrary. It establishes the relationship between the content and the document ID through the participle of the content. In this way, when carrying out full-text retrieval, we can query accurately and vaguely according to the content of the dictionary, which is very consistent with the requirements of full-text retrieval.

The structure of inverted index mainly consists of two parts, one is Term Dictionary (word dictionary), the other is Posting List (inverted list). Term Dictionary (word Dictionary) records the words of the document used and the relationship between the words and the inverted list. Posting List (inverted list) records the location of term in the document and other information, including document ID, word frequency (the number of times term appears in the document, used to calculate the correlation score), position and offset (to achieve search highlighting).

FST

As mentioned above, in full-text retrieval, the original data is obtained through the relationship between term and docId in the inverted index. But there is a problem here. The underlying ES relies on Lucene to implement the inverted index, so when writing data, Lucene will generate a corresponding inverted index for each term in the original data, so the result is that the amount of data in the inverted index will be very large. The inverted table file corresponding to the inverted index is stored on the hard disk. If each query reads the inverted index data directly to the disk, and then queries the original data through the obtained docId, it will certainly cause multiple disk IO, which will seriously affect the efficiency of full-text retrieval. So we need a way to quickly locate the term in the inverted index. What is the better way to think about it? Data structures such as HashMap, TRIE, Binary Search Tree or Tenary Search Tree can be considered. In fact, Lucene actually uses FST (Finite State Transducer) finite state sensor to realize the design of secondary index, which is actually a kind of finite state machine.

Let's first take a look at the structure of the trie tree, which is done in Lucene by composing the term with a common prefix in the inverted index into a block, such as cool and copy shown in the following figure, which have the common prefix of co, and construct the trie tree according to the logic similar to the prefix tree, corresponding to the first address of the block in the corresponding node. Let's analyze the advantages of trie tree over hashmap. Hashmap implements precise lookup, but trie tree can not only achieve precise lookup, but also fuzzy lookup because of its common prefix. So let's look at where the trie tree can be optimized.

As shown above, the school and the trailing characters of cool in term are consistent, so we can further compress the space by merging the suffix characters in the original trie tree. The optimized trie tree is FST.

Therefore, by establishing the secondary index of FST, we can quickly locate the inverted index without going through many times of disk IO, and the search efficiency is greatly improved. However, it should be noted that FST is stored in heap memory and is resident memory, which takes up about 50% and 70% of heap memory, so this is also where we can optimize heap memory in production.

Cluster related concepts

In order to enhance the data storage reliability and high availability of ES, ES supports cluster deployment. Even if some nodes of the clustered ES fail, it will not make the real ES cluster unavailable. At the same time, the data storage capacity of ES is enhanced by horizontal expansion.

Node

The so-called node is actually an instance of ES. We usually deploy an ES instance on a server, which is actually a Java process. Although they are all ES instances, in the actual ES cluster, different nodes assume different capability roles. Some are data node, which is mainly responsible for preserving fragmented data and plays an important role in data scale-out. Some coordinating node is responsible for forwarding user requests and merging query results back. Of course, there are also master nodes, which are responsible for managing and maintaining the state of the entire cluster.

Slice

After all, the data storage of a single ES node is limited, so it is impossible to achieve the storage requirements of massive data. So how can we meet the storage requirements of massive data? A core idea is to split, for example, a total of 1 billion pieces of data, if all in one node, not only the query and data writing speed is very slow, the page has a single point of problem. In the traditional relational database, the way of dividing database and table is adopted, and more database instances are used to undertake a large amount of data storage. Then in ES, a similar design idea is adopted. Since an instance of ES is online for data storage, multiple instances are used for storage. The set of data that exists in each instance is sharding. As shown in the following figure, index is divided into three shards, and the three shards are stored in three ES instances respectively. At the same time, in order to improve the high availability of data, each master shard has two replica shards, which are data copies of the master shard.

Put / article {"settings": {"number_of_shards": 3, "number_of_replicas": 3}}

It should be noted that sharding is not set at will, but the capacity of data storage needs to be planned in advance according to the actual production environment, otherwise too large or too small sharding settings will affect the overall performance of the ES cluster. If the sharding setting is too small, then the amount of data of a single shard may be very large, affecting the efficiency of data retrieval and the horizontal expansion of the data. If the fragment setting is too large, it will affect the data relevance score of the search results and affect the accuracy of data retrieval.

What are the advantages of Java. Simple, as long as understand the basic concepts, you can write applications suitable for a variety of situations; 2. Object oriented; 3. Distributed, Java is a network-oriented language; 4. Robust, java provides automatic garbage collection for memory management to prevent programmers from making mistakes when managing memory. ; 5. Security, Java used in the network and distributed environment must prevent the invasion of viruses. 6. Architecturally neutral, as long as the Java runtime system is installed, it can run on any processor. 7. Portability, Java can be easily ported to different machines on the network. 8. Interpretation execution, Java interpreter directly interprets and executes Java bytecode.

The above is about "what is the core principle of Elasticsearch in Java". If this article is helpful to you and think it is well written, please share it with your friends to learn new knowledge. if you want to know more about it, please pay more attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.