Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to understand the internal data structure of Elasticsearch

2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly introduces "how to understand the internal data structure of Elasticsearch". In daily operation, I believe many people have doubts about how to understand the internal data structure of Elasticsearch. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful for you to answer the doubts about "how to understand the internal data structure of Elasticsearch". Next, please follow the editor to study!

1. Cognitive premise of data storage

As the official Elastic documentation says:

One of the features of Elasticsearch is distributed document storage.

Elasticsearch does not store information as a row similar to a column database, but as a complex data structure that has been serialized into an JSON document.

When there are multiple Elasticsearch nodes in the cluster, the stored documents are distributed throughout the cluster and can be accessed immediately from any node.

After the document is saved, it will be indexed and fully searched in almost real time within 1 second (the default refresh rate is 1s).

How to achieve fast indexing and full-text retrieval?

Elasticsearch uses an inverted indexed data structure that supports very fast full-text search.

The inverted index lists each unique word that appears in any document and identifies all documents in which each word appears.

An index can be thought of as an optimized collection of documents, and each document is a collection of fields that are key-value pairs that contain data.

By default, Elasticsearch indexes all data in each field, and each index field has a dedicated optimized data structure.

For example, text fields are stored in an inverted index, and numeric and geographic fields are stored in a BKD tree.

Data type data structure text/keyword inverted index digital / geolocation BKD tree

Different fields have specific optimized data structures that belong to their own field types, and the ability to respond quickly to return search results makes Elasticsearch search fast!

1. Inverted Index inverted index 1.1 inverted index definition

In the face of massive content, inverted index plays a key role in how to quickly find the content that contains user query words.

Inverted index is the best way to realize the mapping relationship between words and documents.

The following picture is the index structure of the last page of the book, showing the corresponding relationship between the core keywords and the page number of the book.

Just imagine, without this index page, how slow it is to search from the whole book according to the keywords, you can intuitively realize the beauty of the index!

1.2 example of inverted index

Take an example of official documentation:

Suppose we have two documents, and the content field of each document contains the following:

-1. The quick brown fox jumped over the lazy dog

-2. Quick brown foxes leap over lazy dogs in summer

Indexing is subject to tagging and standardized processing analysis.

The restriction factor of data indexing: the selection of word splitter analyzer.

The inverted index (based on the default Standard standard word splitter) is as follows:

TermDoc_1Doc_2Quick

XTheX

BrownXXdogX

Dogs

XfoxX

Foxes

Xin

XjumpedX

LazyXXleap

XoverXXquickX

Summer

XtheX

As shown above, for each word in the document, there is a list of the documents in which they are located.

1.3 inverted index features create serialization to disk full-text search when indexing is very fast not suitable for sorting default enabled 1.4 inverted index applicable scenario query full-text search 2. Doc Values forward index 2.1 Doc Values definition

In Elasticsearch, Doc Values is a column storage structure. By default, the Doc Values of each field is active (except the text type). Doc Values is created when the field is indexed. When a field is indexed, Elasticsearch adds the value of the field to the inverted index in order to retrieve it quickly. At the same time, it also stores the Doc Values of the field.

Unlike the definition of inverted index, Doc Values is defined as "forward index".

2.2 example of Doc Values

Still take the 1.2 document as an example, the Doc Values structure is as follows (for example only):

DocTermsDoc_1brown, dog, fox, jumped, lazy, over, quick, theDoc_2brown, dogs, foxes, in, lazy, leap, over, quick, summer

By transposing the relationship between the two, Doc values solves the problem of low efficiency and difficult to expand for inverted index aggregation.

As can be seen from the comparison, inverted indexes map terms to the documents that contain them, and doc values maps documents to the terms they contain.

2.3 the Doc Values feature creates serialization to disk at index time for sorting operations to store all values of a single field together in a single data column, Doc Values is enabled by default for all field types except text. 2.4 Doc Values applicable scenario

Doc Values in Elasticsearch is often applied to the following scenarios:

Sort a field and aggregate some filters on a field, such as geolocation filtering some field-related script calculations

Note:

Because the document values are serialized to disk, we can rely on the help of the operating system for quick access.

When the working set (working set) is far less than the available memory of the node, the system automatically saves all document values in memory, making it very fast to read and write.

When it is much larger than the available memory, the operating system automatically loads Doc Values into the system's page cache, thus avoiding the jvm heap memory overflow exception.

2. 5 points for attention in using Doc Values

For business scenarios that do not require sorting, aggregation, script calculation, and geolocation filtering, you can consider disabling: Doc Values to save storage.

PUT my_index

{

"mappings": {

"properties": {

"title": {

"type": "keyword"

"doc_values": false

}

}

}

} 3. Fielddata3.1 fielddata definition

As mentioned in the previous summaries 1 and 2:

The search needs to answer "which document contains this word?" It's a problem. With the help of: inverted index. Sorting and summarization need to answer a different question: "what is the value of this field to this document?" . With the help of: forward indexing.

The text type field does not support Doc Values forward indexing, and the text field is based on the in-memory data structure (query-time in-memory data structure) fielddata created during the query.

This data structure is built on demand when fielddata uses the text field for aggregation, sorting, or use in scripts.

Implementation mechanism: it is built by reading the entire reverse index of each segment from disk, reversing the term ↔︎ document relationship, and storing the results in memory in the JVM heap.

3.2 fielddata example

Strictly speaking, the example of 2.2 would be more appropriate here.

DELETE test_001

PUT test_001

{

"mappings": {

"properties": {

"body": {

"type": "text"

"analyzer": "standard"

"fielddata": true

}

}

}

}

POST test_001/_bulk

{"index": {"_ id": 1}}

{"body": "The quick brown fox jumped over the lazy dog"}

{"index": {"_ id": 2}}

{"body": "Quick brown foxes leap over lazy dogs in summer"}

GET test_001/_search

{

"size": 0

"query": {

"match": {

"body": "brown"

}

}

"aggs": {

"popular_terms": {

"terms": {

"field": "body"

}

}

}

The fielddata feature is suitable for operations such as documents, but only for text text field types. Creating in-memory data structures when querying is not serialized to disk is disabled by default (expensive to build them and preset in the heap) 3.4 fielddata applies scenario full-text statistics word frequency full-text generation word cloud text types: aggregation, sorting, script calculation 3.5 fielddata usage considerations before enabling field data Consider why text fields are used for aggregation, sorting, or in scripts. Enabling fielddata usually doesn't make any sense because it is very memory-intensive. Just for full-text search applications, you don't need to enable fielddata. 4. _ source field interprets the definition of 4.1 _ source

The _ source field contains the body of the original JSON document passed at the indexing time.

The _ source field itself is not indexed (and therefore not searchable), but it is stored so that it can be returned when a get request, such as get or search, is executed.

4.2 _ source usage precautions

First: although it is very convenient, the source field does cause storage overhead within the index. Therefore, it can be disabled.

PUT my-index-000001

{

"mappings": {

"_ source": {

"enabled": false

}

}

}

Second: make the following measurements before disabling. After disabling _ source, the following actions will not be available:

Update, update_by_query and reindex API

Highlight operation

Therefore, it is necessary to weigh the pros and cons between storage space and business scenarios. 5. Interpretation of store field 5.1 store definition

By default, field values are indexed to make them searchable (inverted index in section 1), but they are not stored.

This means that the field can be queried but the original field value cannot be retrieved.

Usually it doesn't matter. This field value is already part of the _ source field and is stored by default.

However, in some special scenarios, if you only want to retrieve the value of a single field or several fields, rather than the value of the entire _ source, you can use source filtering.

At this point, store came in handy.

5.2 store sample DELETE news-000001

PUT news-000001

{

"mappings": {

"_ source": {

"enabled": false

}

"properties": {

"title": {

"type": "text"

"store": true

}

"date": {

"type": "date"

"store": true

}

"content": {

"type": "text"

}

}

}

}

PUT news-000001/_doc/1

{

"title": "Some short title"

"date": "2021-01-01"

"content": "A very long content field..."

}

GET news-000001/_search

GET news-000001/_search

{

"stored_fields": ["title", "date"]

} 5.3 store applicable scenarios

As in the 5.2 example, it may make sense to store fields in some cases. For example, the news data collected is: a document with a title, date, and large content field

You may want to retrieve only the title and date without having to extract these fields from the larger _ source field.

At this point, the study on "how to understand the internal data structure of Elasticsearch" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report