In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly introduces "java Elasticsearch from basic concept to production use analysis". In daily operation, I believe that many people have doubts in java Elasticsearch from basic concept to production use analysis. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts of "java Elasticsearch from basic concept to production use analysis". Next, please follow the editor to study!
Basic concept
For a novice to Elasticsearch (ES), you first need to learn some basic concepts.
The Elasticsearch project comes from Java's excellent distributed search engine Apache Lucene,Luncene, which also derives another excellent search project, Solor. Whether it is Elasticsearch or Solor, the underlying part of the data and search engine is Lucene. ES is a better distributed real-time search engine based on Lucene kernel, especially in distributed clustering and scale-out. It can easily run and manage thousands of Lucene instances.
The highest-level unit in the ES architecture is the Cluster. A cluster is a collection of ES nodes and indexes.
A node (Node) is an instance of ES. They can be a single server or just ES processes running on the server. Note: the server is not equivalent to the node is different. A VM virtual machine or physical server can hold many ES processes, and each ES process is a node. Nodes can fully join a cluster. There are different types of nodes. The two most important nodes are the data node (Data Node) and the alternative master node (Master-Eligible Node). A node can have multiple attributes at the same time. The data node runs all data operations. That is, storing, indexing and searching data. The alternate master node is used to vote for the host running cluster and index management.
Index is a high-level abstraction of data. The index itself does not save data. They are just another abstraction of the actual stored data. Any actions performed on the data, such as insert, delete, indexing, and search, will affect the index. The index can belong entirely to a cluster and consists of fragments.
Shard is an instance of Apache Lucene. A shard can hold many documents. Sharding is the object of actual data storage, indexing and search. A shard happens to belong to a node and index. There are two types of fragmentation: master (primary) shards and copies (replica). The two are basically the same, they have the same data and search all shards in parallel. Of all the shards with the same data, one is the main shard, which is the only one that can accept index requests. If the node where the primary shard is located dies, the replica shard will automatically take over as the primary shard. ES then creates a new replica shard and copies the data. In general, you can use a simple illustration as follows:
In-depth understanding
If you want to run a system, you first need to understand the system. After understanding the basic concepts, let's take a practical look at the various parts of Elasticsearch.
Quorum
It is important to understand that the Elasticsearch organization is a democratic mechanism. The node votes to decide who is the boss Master, that is, the master node. The master node runs many cluster management processes and has the final say in the cluster. The election of ES is conditional, since only the candidate node can participate in the election to become the master node. Eligible for Master are all nodes in its configuration that are set to the following conditions:
Node.master: true
When the cluster starts or when the primary node exits the cluster, all nodes that meet the criteria for the primary node begin to elect a new primary node. Therefore, you need to have 2n + 1 nodes that meet the host requirements. Otherwise, there may be a brain crack in the election of 55.
Node joining
When the ES process starts, it exists independently and freely. How does he know which cluster he is in? There are different ways to do this. But a method called seed hosting (Seed Hosts) is often used to do this process.
The Elasticsearch node constantly talks to all the other nodes he has seen. Therefore, at first, a node only needs to consult a few other nodes to understand the entire cluster. The whole process is not a constant process: when nodes are not part of a cluster, they only share information about other nodes they find. Once you join the cluster, the node stops the operation and relies on the primary node of the selected cluster to share the information about the changes that have taken place. This can save a lot of unnecessary online gossip. In ES 7.x, nodes communicate only what they see as alternate host nodes, and the discovery process ignores them.
Take a three-node cluster as an example:
Initial state:
Nodes An and C only know B. B is the seed host. The seed host is either provided to the ES as a configuration file or placed directly into the elasticsearch.yml.
Node A connects with B and exchanges information:
Once node An is connected to BMIT B, it is now known that An exists. A there is no change.
Node C connects and shares information with B
Now that C is connected, C will communicate with B. B will tell C A that it exists. Both C and B now know all the nodes in the cluster. The next time A reconnects to B, it will also know that C exists.
Segment merging
As we said earlier, data is stored in pieces. The data will be based on.. The form of the file is stored in the file system. In Lucene and Elasticsearch, these files are called Segments. A slice will have one to thousands of segments.
Segments are physically existing files that can be seen in the data directory of your ES installation. So the operation of the end file is an overhead. If you want to view it, you must find the corresponding file and open it. If you want to open a lot of files, it will be very expensive. Because segments in Lucene are immutable, they can only be written and immutable. Each document placed in the ES creates a segment that contains only a single document. So, if there are 1 billion documents in the cluster, will there be 1 billion segment files?
Actually, that's not the case. In the background of Lucene, segment merging occurs. This operation does not change the segment, but you can merge the data of two smaller segments to create a new segment and clean up the merged two segments:
Lucene keeps merging segments and keeps the number of segments not too large.
Message routing
In Elasticsearch, you can run any command on any node in the cluster and keep the result the same. However, at the lowest level, the document will only exist in one main shard and its copy, while ES where the document is located, and there is no mapping indicating that a particular document is in a particular shard.
If you search, the request entry point ES node broadcasts it to all shards in the index to view all segments of the document. If you want to insert, the ES node randomly selects a master tile and places the document in it, and then writes it to that primary tile and all its copies.
Production practice
The last part talks about how to deploy and manage Elasticsearc in production.
One of the most common problems in Elasticsearch practice is to estimate the size of the cluster required, including the number of nodes, hardware resources, and so on.
Memory
First of all, consider memory usage, which limits all other resources.
Java reactor
ES is developed in Java. Java uses a heap, which can be thought of as memory reserved by Java. With regard to the heap, everything that is important will triple the size of this document.
Use as much as possible, but the heap size should not exceed 30g.
There is a secret about the heap that many people don't know: each object in the heap needs a unique address, an object pointer. The length of the address is fixed, so the number of objects that can be addressed is limited. In short, at some point, Java will need to use compressed object pointers instead of uncompressed object pointers. In this way, each memory access will involve other steps and will be much slower. Please do not exceed this threshold (about 32G).
Based on the community benchmark for different Elasticsearch file systems, heap size, combination of FS and BIOS, the results are as follows:
As shown in the figure above, starting with the heap size of 32G, performance suddenly begins to deteriorate (50% access latency, the smaller the better).
The throughput results (the bigger the better) are similar:
In short, use 29g or 30g of memory, use XFS, and use hardwareprefetch and llc-prefetch whenever possible.
File caching
Most people will run Elasticsearch,Linux on Linux using RAM as the file system cache. A common recommendation is to use 64G for ES servers, so that half of the cache and half of the heap. Large ES clusters, such as for logging, can benefit from having a large FS cache. If all the indexes are suitable for the heap, you don't need that many.
Elasticsearch 7.x requires a certain amount of direct memory on its heap and has other overhead, which is why it is recommended that the heap size not exceed 50% of physical memory. This is an upper limit, and the 32GB heap on the 64GB host may not reserve too much space for the file system cache. File system caching is key to Elasticsearch/Lucene performance, and smaller heaps sometimes produce better performance (they leave more space for file system caching and are cheaper for GC).
CPU
This depends on the action performed on the cluster. If you want to do a lot of indexing, you need more and faster CPU than just logging. Generally speaking, 8-core CPU is more than sufficient for logging, but it has more different uses, depending on the practice.
Magnetic disk
If the index is allocated to the appropriate memory, the disk is important only if the node is cold. And the actual amount of data that can be stored depends on the index layout. Each shard is an Lucene instance, and they all have memory requirements. This allows for the maximum number of fragments in the heap. Typically, you can put all the data disks into the RAID0. Replication should take place at the Elasticsearch level, so losing nodes does not matter. Do not use LVM with multiple disks, because LVM can only write to one disk at a time, which does not bring the benefit of multiple disks at all.
About file system and RAID settings:
Scheduler: cfq and deadline are better than noop. It might be good if you have nvme,Kyber (not strictly tested)
QueueDepth: as high as possible
Pre-reading: yes, please use
Raid block size: no effect
FS block size: no effect
FS type: XFS is better than ext4
Index layout
Depends largely on the use case. Take the log cluster background as an example.
Slice
In short:
For heavy workloads, main shard = number of nodes
In reading heavy workload, main shard * copy = number of nodes
More copies = higher search performance
Write performance can be calculated by a formula:
Node throughput * number of main shards
The reason is simple: if there is only one primary shard, you can only write data as quickly as a node can write data, because a shard can only be on one node. "if you do want to optimize write performance, you should ensure that there is only one shard (master or replica) on each node, because the replica obviously gets the same writes as the primary, and writes depend largely on disk IO."
Note: if there are many indexes, it may be incorrect, and the bottleneck may be something else.
If you want to optimize search performance, you can give search performance with the following formula:
Node throughput * (number of main fragments + number of replicas)
For search, the main fragment and replica fragments are basically the same. Therefore, if you want to improve search performance, you only need to increase the number of copies.
Scale and size
I know a lot about the size of the index. One of our experiences here is:
30g heap = up to 140 shards per node
Using more than 140shards may crash the Elasticsearch process and cause out-of-memory errors. Because each shard is a Lucene instance, and each instance requires a certain amount of memory. So, how many fragments can there be in each node?
If you have the number of nodes, the number of shards, and the index size, how many indexes can you hold:
Number of slices = (140* nodes) / (number of primary slices * replica rate)
So you can calculate the required size:
Index size = (number of nodes * hard disk size) / number of indexes
Please note that larger indexes are also relatively slow. It's OK for logging to some extent, but for applications that are really searching heavily, you should increase the size according to the amount of memory you have.
Segment merging
Remember that each segment is an actual file on the file system. Basically, for each search query, it goes to all the shards in the index, and then from there to all the segments in the shards. Too many segment files can greatly increase the read IOPS of the cluster until it cannot be used. Therefore, it is best to keep the number of segments as low as possible.
There is a force_merge API that allows segments to be merged to a certain number, such as 1. If index rotation is performed, for example, because Elasticsearch is used for logging, it is a good idea to do a forced merge in general use when not using a cluster.
Forcing a merge consumes a lot of resources and greatly slows down the cluster, and if you have many indexes, you must force the merge.
Cluster layout
For everything except the minimum setting, it is best to use dedicated host qualified nodes. Keep 2n + 1 alternate nodes to ensure arbitration. But for data nodes, you just want to be able to add a new node at any time without having to worry. In addition, you do not want the high load on the data node to affect the primary node.
Finally, the master node is an ideal candidate for seed nodes.
Remember, seed nodes are the easiest way to perform node discovery in Elasticsearch. Because the master nodes rarely change, they are the best choice, and they already know all the other nodes in the cluster.
The primary node may be so small that a core or even 4G of memory can meet the needs of most clusters. As usual, pay attention to the actual usage and adjust accordingly.
Monitor and control
Monitoring is a good thing, and it's also true for Elasticsearch. ES provides a large number of metrics and supports convenient calls in the form of JSON. It is very easy to add these metrics to the monitoring tool. Here are some useful monitoring indicators including:
Number of segments, heap utilization, heap GC time, average search, index, merge time, IOPS, disk utilization, etc.
At this point, the study of "Elasticsearch of java from basic concepts to production and use analysis" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.