Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the common terms of ElasticSearch

2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Share

Shulou(Shulou.com)05/31 Report--

Today, I will talk to you about the commonly used terms about ElasticSearch, which may not be well understood by many people. in order to make you understand better, the editor has summarized the following for you. I hope you can get something according to this article.

This article mainly introduces the basic concepts of ElasticSearch, learning documents, indexes, clusters, nodes, sharding and other concepts. At the same time, it will make a simple analogy between ElasticSearch and relational databases, and briefly introduce the use of REST API.

ElasticSearch terminology

Indexes and documents tend to be logical concepts, while nodes and fragments are more physical concepts.

First, let's talk about what a document is:

Document (Document)

ElasticSearch (ES for short) is document-oriented and the document is the smallest unit of all searchable data.

Give you a few examples to give you a better understanding of what a document is:

Log entries in log files

The details of a movie, record, etc.

A song in a MP3 player, a specific content in an PDF document

A piece of customer data, a piece of commodity classification data, a piece of order data

You can think of a document as a record in a relational database.

In ES, documents are serialized into JSON format and saved in ES, and the JSON object consists of fields, where each field has a corresponding field type (string / array / Boolean / date / binary / range type).

In ES, each document has a Unique ID, which can be specified by itself or automatically generated by ES.

In the previous article, we taught you how to build a real-time log analysis platform for ELK. We talked about importing data into ES through Logstash. Some of the test data sets and the corresponding converted formats are as follows:

MovieId,title,genres 193585 Drama 193587 Stray Dogs: Dead Apple (2018), Action | Animation 193585 Dice Clay: Dice Rules (1991), Comedy

We read the movie data of RowData one by one from the test dataset csv file, and then convert it into ES through Logstash transformation, which is the JSON format.

JSON each field has its own data type, ES can help you automatically make a data type calculation, and the data in ES also supports arrays and nesting.

Each document has corresponding metadata, which is used to mark the relevant information of the document. Let's find out what the following metadata contains:

{"_ index": "movies", "_ type": "_ doc", "_ id": "2035", "_ score": 2035, "_ source": {"title": "Blackbeard's Ghost", "genre": ["Children", "Comedy"] "id": "2035", "@ version": "1", "year": 1968}}

Where _ index represents the index name to which the document belongs; _ type represents the type name to which the document belongs; _ id is the document's unique id;_source is the original JSON data of the document, and the default is the _ source field returned when searching for a document; @ version is the version information of the document, which can well solve the problem of version conflicts; and _ score marks the relevance, which is the score of this document in this query.

After introducing the document, let's take a look at the index:

Index (Index)

To put it simply, an index is a collection of documents with similar structures, for example, there can be a customer index, a commodity classification index, an order index, an index has a name, and an index can contain many documents. an index represents a class of similar or identical documents, such as the establishment of a commodity index, which may store all the commodity data, that is, all the commodity documents. Each index is its own Mapping definition file, which is used to describe the types that contain document fields. Shard embodies the concept of physical space, and the data in the index is scattered on the shards.

In an index, you can set Mapping and Setting,Mapping to define the type structure of all document fields in the index. Setting mainly specifies how many fragments are used and how the data is distributed.

Index has different meanings in different contexts, for example, in ES, an index is a collection of documents, here is a noun; at the same time, the process of saving a document to ES is also called indexing, aside from ES, mention index, there may be B-tree index or inverted index, inverted index is an important data structure in ES, which will be explained in future articles.

Next, we will explain the types:

Type (Type)

Before 7. 0, each index can set multiple Types, each Type will have the same structure of the document, but in 6. 0, Type has been abolished, in 7. 0, an index can only create one Type, that is, _ doc.

Each index can have one or more Type,Type is a logical data category in the index, a Type document, all have the same field (Field), such as blog system, there is an index, you can define user data Type, blog data Type, comment data Type and so on.

So far, we have learned the concepts of documents, indexes, and types, and then what is clustering? What is a node? What is slicing?

First of all, let's take a look at the concept of cluster.

Cluster (Cluster)

ES cluster is actually a distributed system. To meet high availability, high availability means that when the service of a node in the cluster stops responding, the whole service can still work normally, that is, service availability; or if some nodes in the cluster are lost, there will be no data loss, that is, data availability.

When the number of users' requests is getting higher and higher, and the growth of data is more and more, the system needs to distribute the data to other nodes, and finally achieve horizontal expansion. When there is a problem with the nodes in the cluster, the services of the whole cluster will not be affected.

In ES's distributed architecture, different clusters are distinguished by different names. The default name is elasticsearch, which can be modified in the configuration file or set with-E cluster.name=wupx on the command line. There can be one or more nodes in a cluster.

An ES cluster has three colors to indicate how healthy it is:

Green: both the master shard and the replica are assigned normally

Yellow: all the main shards are allocated normally, while some replica shards are not allocated properly.

Red: there are primary shards that could not be allocated (for example, to create a new index when the disk capacity of the server exceeds 85%)

After learning about clusters, let's take a look at what nodes are.

Node (Node)

A node is actually an ES instance, which is essentially a Java process. Multiple ES processes can be run on a machine, but it is generally recommended that only one ES instance run on a machine in a production environment.

Each node has its own name, which is important (when performing operation and maintenance operations) and can be configured through a configuration file or specified by-E node.name=node1 at startup. After each node is started, a UID is assigned and saved in the data directory.

The default node will join a cluster named elasticsearch. If you start many nodes directly, they will automatically form an elasticsearch cluster. Of course, a node can also form an elasticsearch cluster.

Candidate Master Node (Master-eligible Node) & Master Node (Master Node)

After each node starts, the default is a Master-eligible node. By setting node.master: false prohibition in the configuration file, the Master-eligible node can participate in the main selection process and become a Master node. When the first node starts, it elects itself as a Master node.

The state of the cluster is saved on each node, and only the Master node can modify the status information of the cluster. If any node can modify the information, it will lead to data inconsistency.

Cluster status (Cluster State) to maintain the necessary information in a cluster, including the following information:

All node information

All indexes and their related Mapping and Setting information

Routing information of fragments

Let's take a look at what are Data Node and Coordinating Node?

Data Node (Data Node) & Coordination Node (Coordinating Node)

As the name implies, the node that can save the data is called Data Node, which is responsible for saving all the data stored on the shard. When the cluster cannot save the existing data, it can solve the storage problem by adding data nodes, which plays a vital role in data expansion.

Coordinating Node is responsible for receiving requests from Client, distributing the requests to appropriate nodes, and finally aggregating the results and returning them to the client. Each node plays the responsibility of Coordinating Node by default.

There are other node types, which you can learn about:

Other node types

Hot and cold nodes (Hot & Warm Node): hot nodes (Hot Node) are nodes with high configuration, which can have better disk throughput and better CPU. The cold nodes (Warm Node) store some nodes for a long time, and the machine configuration of these nodes will be lower. Data Node with different hardware configurations is used to implement Hot & Warm architecture and reduce the cost of cluster deployment.

Machine learning node (Machine Learning Node): responsible for running machine learning, used for anomaly detection.

Tribal nodes (Tribe Node): connect to different ES clusters and support the treatment of these clusters as a separate cluster.

Preprocessing node (Ingest Node): the preprocessing operation allows some transformation and enrichment of the data through a pre-defined series of processors (processors) and pipeline (pipes) before indexing the document, that is, before the data is written.

Each node decides what role to play by reading the elasticsearch.yml configuration file when it starts, so let's take a look at the configuration node type!

Configure Node Typ

A node in a development environment can assume multiple roles.

In a production environment, you should set up a single role node (dedicated node).

After talking about the nodes, let's take a look at what is fragmentation?

Fragmentation (Shard)

Because a single machine cannot store a large amount of data, ES can split the data in an index into multiple Shard, which can be stored on multiple servers. With sharding, you can scale out, store more data, distribute search and analysis operations to multiple servers, and improve throughput and performance.

The relationship between index and shard is shown in the figure above. An ES index contains many shards, and a shard is an Lucene index. It is a complete search engine and can independently perform indexing and search tasks. The Lucene index consists of many segments, each of which is an inverted index. Each time ES refresh generates a new segment that contains data from several documents. Within each segment, different fields of the document are indexed separately. The value of each field consists of several words (Term), and Term is the final result of the original text being processed by a word splitter and language (for example, removing punctuation and converting to a word root).

Slicing can be divided into two categories, one is main shard (Primary Shard), the other is replica shard (Replica Shard).

The main shard is mainly used to solve the problem of horizontal expansion. Through the master shard, the data can be distributed to all nodes on the cluster. A primary shard is a running Lucene instance. When we create an ES index, we can specify the number of shards, but the number of main shards is specified when the index is created, and subsequent modifications are not allowed unless we use Reindex to modify them.

Replica sharding is used to solve the problem of high availability of data, that is, when a node in the cluster has a hardware failure, it can also ensure that the data will not be really lost, because replica sharding is a copy of the main shard. The number of replica shards in the index can be adjusted dynamically. By increasing the number of replicas, the performance of service query (read throughput) can be improved to a certain extent.

Let's use an example to understand how master shards and replica shards distribute data among different nodes in the cluster:

PUT / blogs {"settings": {"number_of_shards": 3, "number_of_repicas": 1}}

The above is the definition of the blogs index, where the number_of_shards in settings indicates that the number of main slices is 3, which means that there is only one copy.

The picture above is a cluster of wupx, in which there are always three nodes. Through the configuration of index blogs above, when data comes in, the main shard will be distributed on three nodes inside ES, and the copy of each shard will be distributed to other nodes at the same time. When there is a node failure in the cluster, there will be a failover mechanism within ES. The failover mechanism will be explained in future articles. In the figure above, you can see that three main shards are scattered to three nodes. If you add one node to the cluster at this time, can you increase the availability of the system?

With this question in mind, let's first take a look at the settings of the fragments:

Slicing setting

The shard setting is very important in the production environment. In many cases, the capacity planning needs to be done in advance, because the main shard needs to be set in advance when the index is created and cannot be modified afterwards. In the previous example, an index is divided into three main shards. Even if the cluster adds more nodes, the index can only be scattered on three nodes.

When the fragmentation setting is too large, it will also bring side effects, on the one hand, it will affect the scoring of search results, affect the accuracy of statistical results, in addition, too many fragments on a single node will also lead to a waste of resources, but also affect performance. Since version 7. 0, the default setting for the number of primary fragments in ES has been changed from 5 to 1, which can also solve the problem of over-sharding.

After understanding the terminology of ES, let's make an analogy with the relational database we are familiar with so that we can understand it.

RDBMS & ES

I believe you should have a good understanding of relational database (RDBMS for short), so take an analogy between relational database and ES to make it easier for you to understand:

From the table, it is not difficult to see that relational databases correspond to ES as follows:

Tables in relational databases (Table) for indexes in ES (Index)

Each record in a relational database (Row) corresponds to a document in ES (Document)

Fields in relational databases (Column) correspond to fields in ES (Filed)

The table definition (Schema) in relational database corresponds to the mapping (Mapping) in ES.

Query and other operations can be carried out through SQL in relational database, and DSL is also provided in ES.

When conducting full-text search or scoring search results, ES is more appropriate, but if the data transaction requirements are relatively high, relational database and ES will be used together.

In order to facilitate the integration of other languages, ES provides REST API for other programs to call. When our program wants to integrate with ES, we only need to issue a HTTP request to get the corresponding results. Let's introduce the basic API:

REST API

Open Kibana, we first open the Kibana Management menu (Management), which provides index management function, you can see the index management has movies index, for the last article imported, click on the index, you can see the index Setting and Mapping information, how to set it will be described in later articles.

Back to the point, let's show you REST API:

Next, open the Kibana development tool (Dev Tools). Movies is the index. now type GET movies and click execute to view the information related to the movie index, mainly including Mapping and Setting of the index.

Enter GET movies/_count and click execute to see the total number of documents indexed. The running result is as follows:

{"count": 9743, "_ shards": {"total": 1, "successful": 1, "skipped": 0, "failed": 0}}

Enter the following code

POST movies/_search {}

Click execute, you can view the first 10 documents to understand the document format.

You can also do a wildcard query on the name of the index, using GET / _ cat/indices/mov*?v&s=index to see the matching index.

Using GET / _ cat/indices?v&s=docs.count:desc, you can sort by the number of documents.

Using GET / _ cat/indices?v&health=green, you can view an index with a status of green.

Using GET / _ cat/indices?v&h=i,tm&s=tm:desc, you can view the memory consumed by each index.

ES also provides API to check the health status of the cluster. You can use GET _ cluster/health to check the health status of the cluster. The returned result is as follows:

{"cluster_name": "wupx", "status": "green", "timed_out": false, "number_of_nodes": 2, "number_of_data_nodes": 2, "active_primary_shards": 10, "active_shards": 10, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 0 "delayed_unassigned_shards": 0, "number_of_pending_tasks": 0, "number_of_in_flight_fetch": 0, "task_max_waiting_in_queue_millis": 0, "active_shards_percent_as_number": 100.0}

You can see that the name of the cluster is wupx, and the status of the cluster is green. There are a total of 2 nodes, both of which assume the role of Data Node, and there are 10 main shards.

This is the end of the introduction of REST API, and the rest of you can fumble for yourself.

Careful friends will find how Kibana has become a Chinese interface, in fact, after version 7.0 of Kibana, the official Chinese resource file (node_modules/x-pack/plugins/translations/translations/ located in the Kibana directory), you can modify the kibana.yml file in the config directory, add the configuration item i18n.locale: "zh-CN" in the file, and then restart Kibana to complete the Sinicization.

This paper mainly studies the concepts of document, index, cluster, node, etc., and learns that each node in each cluster can play different roles. It also understands what is the main shard and replica shard and their role in the distributed system. It also makes it easier for everyone to understand by analogy with the relational database. In addition, it also introduces the use of REST API. Finally, it summarizes the mind map of ES terms. The mind map source file can be obtained by replying to es on the official account Wu Peixuan.

After reading the above, do you have any further understanding of the common terms of ElasticSearch? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Database

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report