Authoritative Guide to Elasticsearch search and tuning (1Compact 3) 04/28 Update SLTechnology News&Howtos

Authoritative Guide to Elasticsearch search and tuning (1Compact 3)

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Original English text: https://qbox.io/blog/elasticsearch-search-tuning-5-0-ultimate-guide

Author: Adam Vanderbush

Translator: Yang Zhentao

Catalogue

Document modeling global serial numbers and deferred multigenerational relationships allocate memory for file system caches

The authoritative guide to Elasticsearch search tuning is one of a series of articles published by QBOX on his blog. This article is the first in this series to introduce some experiences in optimizing query performance from the aspects of document modeling, memory allocation, file system caching, GC, and hardware.

Elasticsearch 5.0.0 is really a big version after 2.x, bringing you a lot of new things. Elasticsearch is now a member of Elastic Stack, aligned with the version numbers of other products throughout the technology stack, and now Kibana, Logstash, Beats, and Elasticsearch are all version 5.0.

This version of Elasticsearch is by far the fastest, safest, most flexible, and easiest to use, and brings a lot of improvements and new features.

We have introduced some basic experiences and methods of performance tuning through the "authoritative Guide to Elasticsearch performance tuning" series, and explained the most critical system settings and metrics for each step. The series is divided into the following three parts:

The Authoritative Guide to Elasticsearch Performance Tuning (Part 1) The Authoritative Guide to Elasticsearch Performance Tuning (Part 2) The Authoritative Guide to Elasticsearch Performance Tuning (Part 3)

Indexing decisions are also important, and they have a big impact on how to search for data. If it is a string field, do you need participle or normalization? If so, how do you do it? If it is a numeric attribute, which precision is required? There are many other types, such as date-time, geospatial shape, and paternity, that require more special consideration.

We also discussed "Elasticsearch Index performance Optimization" through a series of tutorials, introducing some general techniques and methods to maximize index throughput and reduce the load on monitoring and management. The tutorial is divided into three parts as follows:

How to Maximize Elasticsearch Indexing Performance (Part 1) How to Maximize Elasticsearch Indexing Performance (Part 2) How to Maximize Elasticsearch Indexing Performance (Part 3)

The purpose of this article is to recommend some search tuning techniques, strategies, and recommended features of Elasticsearch 5.0 and above.

1. Document modeling

An array of internal object properties does not work as expected. There is no concept of internal objects in Lucene, so Elasticsearch expands the object hierarchy into a simple list of attribute names and values. Take the following documents as an example:

Curl-XPUT 'localhost:9200/my_index/my_type/1?pretty'-H' Content-Type: application/json'-d'{"group": "fans", "user": [{"first": "John", "last": "Smith"}, {"first": "Alice", "last": "White"}]}'

The request is internally converted to the following document form:

{"group": "fans", "user.first": ["alice", "john"], "user.last": ["smith", "white"]}

If you need to index an array of objects and maintain the dependencies of each object in the array, you should use embedded data types instead of object data types. Embedded objects internally index each object in the array as a separate hidden document, that is, you can query each embedded object individually using the following embedded query:

Curl-XPUT 'ES_HOST:ES_PORT/my_index?pretty'-H' Content-Type: application/json'-d'{"mappings": {"my_type": {"properties": {"user": {"type": "nested"} 'curl-XPUT' ES_HOST:ES_PORT/my_index/my_type/ 1 "group"-H 'Content-Type: application/json'-d' {"group": "fans" "user": [{"first": "John", "last": "Smith"}, {"first": "Alice" "last": "White"}} 'curl-XGET' ES_HOST:ES_PORT/my_index/_search?pretty'-H 'Content-Type: application/json'-d' {"query": {"nested": {"path": "user" "query": {"bool": {"must": [{"match": {"user.first": "Alice"}} {"match": {"user.last": "Smith"} 'curl-XGET' ES_HOST:ES_PORT/my_index/_search?pretty'-H 'Content-Type: application/json'-d' {"query": {"nested": {"path": "user" "query": {"bool": {"must": [{"match": {"user.first": "Alice"}}, {"match": {"user.last": "White"}]} "inner_hits": {"highlight": {"fields": {"user.first": {}'

Embedded objects are useful when you have a primary entity, such as a blog post, with other entities that are related but not very important, such as comments. It would be nice to be able to query blog posts based on comments, and embedded queries and filters provide faster join queries.

The disadvantages of the embedded object model are as follows:

In order to add, modify, or delete an embedded object document, the entire document must be re-indexed; as a result, the more embedded documents, the greater the overhead.

The search request returns the entire document instead of just matching embedded documents. Although it is planned in the future to support the return of the root document that is best matched with the embedded document, it is still not supported at this time.

Sometimes it may be necessary to separate the main document from its associated entities, which is provided by the parent-child relationship.

By establishing the parent type mapping of another document, you can establish a parent-child relationship between documents with the same index:

Curl-XPUT 'ES_HOST:ES_PORT/my_index?pretty'-H' Content-Type: application/json'-d'{"mappings": {"my_parent": {} "my_child": {"_ parent": {"type": "my_parent"} 'curl-XPUT' ES_HOST:ES_PORT/my_index/my_parent/1?pretty'-H 'Content-Type: application/json'-d' {"text": "This is a parent document"} 'curl-XPUT' ES_HOST:ES_PORT/my_index/my_child Application/json'-d'{"text": "This is a child document"} 'curl-XPUT' ES_HOST:ES_PORT/my_index/my_child/3?parent=1&refresh=true&pretty'-H 'Content-Type: application/json'-d' {"text": "This is another child document"} 'curl-XGET' ES_HOST:ES_PORT/my_index/my_parent/_ Search?pretty'-H 'Content-Type: application/json'-d' {"query": {"has_child": {"type": "my_child" "query": {"match": {"text": "child document"}'

Parent-son join is very useful for managing entity relationships, especially when indexing time is more important than retrieval time, but it costs a lot of money; parent-child queries are 5 to 10 times slower than equivalent embedded queries.

two。 Global sequence number and delay

The parent-child relationship uses a global sequence number to speed up join operations. Regardless of whether the parent-son map uses an in-memory cache or doc value on disk, the global sequence number still needs to be rebuilt when any index changes.

The more parents there are in the shard, the more time it takes to build the global serial number. Compared with the need for fathers and fewer children, the father-son relationship is most suitable for each parent with many children.

The global sequence number defaults to delayed build: the first parent-child query or aggregation request after refresh will trigger the build of the global sequence number. This allows users to perceive a significant potential spike. You can use eager_global_ordinals to transfer the cost of building a global serial number during the query period to the refresh period, by mapping _ parent attribute as follows:

Curl-XPUT 'ES_HOST:ES_PORT/company-d' {"mappings": {"branch": {}, "employee": {"_ parent": {"type": "branch", "fielddata": {"loading": "eager_global_ordinals"}'

Here, the global sequence number of the _ parent attribute will be built when a new segment search is visible.

For many parents, the global serial number takes several seconds to build. At this point, you need to increase the refresh_interval so that the frequency of refresh is lower and the global serial number remains available for longer. This will significantly reduce the CPU consumption of rebuilding global serial numbers per second.

3. Multigenerational relationship

The ability to Join (reference Grandparents and Grandchildren) for multi-generation data sounds attractive, but you need to think about the cost:

The more Join, the worse the performance. Each parent needs to keep its own string _ id property in memory, which can consume a lot of RAM. When considering the appropriateness of relational solutions and father-child relationships, you can refer to the following suggestions on father-child relationships: use father-child relationships conservatively and consider only when there are many more children than fathers. Avoid using multiple parent-child relationships to join in a single query. Avoid scoring has_child queries that use has_child filters or whose score_mode is none. The parent ID is kept as short as possible so that it can be better compressed in doc value, thus consuming less memory when loading instantly. 4. Allocate memory for file system cache

For running Elasticsearch, memory is one of the important resources that need to be closely monitored. Elasticsearch and Lucene consume memory through JVM heap memory and file system caching. Because Elasticsearch runs in the Java virtual machine (JVM), the GC cycle and frequency of JVM also need to be monitored.

JVM heap memory

For Elasticsearch, a "just right" JVM heap size is very important-- it cannot be set too large or too small, for reasons described later. Generally speaking, the empirical value of Elasticsearch is to allocate less than 50% of the available RAM to the JVM heap, and do not exceed the 32GB.

Allocating too little heap memory for Elasticsearch leaves more memory for Lucene, while Lucene relies heavily on file system caching to process requests quickly. In any case, you can't set too small heap memory, because when an application faces short interruptions due to frequent GC, it may encounter memory overflow errors or throughput degradation.

The JVM heap size set by default when Elasticsearch is installed is 1GB, which is small in most cases. You can set the desired pair size and restart Elasticsearch through the environment variable:

Export ES_HEAP_SIZE=10g

Another way to set the JVM heap size, which is equivalent to setting the same minimum and maximum values to prevent the heap from being resized, is specified by command-line arguments each time Elasticsearch is started:

ES_HEAP_SIZE= "10g". / bin/elasticsearch

Both of these examples set the heap size of 10GB. To verify whether the heap size is set successfully, execute:

Curl-XGET http://ES_HOST:9200/_cat/nodes?h=heap.max

The returned output shows that the maximum heap memory has been updated correctly.

Garbage collection

Elasticsearch relies on the GC process to free heap memory. Because GC itself consumes resources (in order to release resources! So you should pay attention to the frequency and duration of GC to see if heap memory needs to be resized Setting too much heap memory in exchange for longer GC time; this excessive pause is very dangerous because the cluster may mistakenly think that the node network is abnormal and lose contact.

As a result, Elasticsearch relies heavily on file system caching to speed up searches. It is generally necessary to ensure that at least half of the available memory is used for file system caching so that Elasticsearch can keep the hot spots of index data in physical memory.

Use faster hardware

If the search is limited by Istroke O, you should consider caching more memory for the file system (see above), or buying a faster driver. In particular, SSD is generally considered to perform much better than mechanical disks. Use local storage whenever possible, avoid remote or network file systems like NFS or SMB, and pay attention to virtualized storage like Amazon EBS.

There is no problem with Elasticsearch working with virtualized storage, which is popular because of its speed and ease of installation, but unfortunately, it is inherently slow compared to dedicated local storage. If you create an index library on EBS, be sure to use the pre-allocated IOPS, otherwise you will soon be restricted.

If your search is limited by CPU, you should consider buying a faster CPU.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.