Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Doc_values of elasticsearch

2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

What is Doc Values?

In fact, most NoSQL also use this approach when creating multiple indexes, that is, using another way to store a text, so that the search can be enhanced. Docvalues solves this problem by transposing the relationship between the two. Inverted indexes map terms to the documents that contain them, and Docvalues maps documents to the terms they contain:

Doc Terms-Doc_1 | brown, dog, fox, jumped, lazy, over, quick, theDoc_2 | brown, dogs, foxes, in, lazy, leap, over, quick, summerDoc_3 | dog, dogs, fox, jumped, over, quick The-

When the data is transposed, it is very easy to get all the word items if you want to collect each document line. So search uses an inverted index to find documents, and aggregation operations collect and aggregate data in DocValues, which is called ElasticSearch.

Deep understanding of ElasticSearch Doc Values

DocValues is generated at the same time as the inverted index. That is, DocValues, like inverted indexes, is generated based on Segement and is immutable. At the same time, DocValues is serialized to disk like inverted indexes, which is very helpful for performance and scalability.

DocValues persists the data structure to disk through serialization, so we can make full use of the operating system's memory instead of JVM's Heap. When the workingset is much less than the system's available memory, the system automatically saves the DocValues in memory, making it very fast to read and write; however, when it is much larger than the available memory, the operating system automatically writes the DocValues to disk. Obviously, this performance is much worse than in memory, but its size is no longer limited to the server's memory. If you use JVM's Heap to implement it, it can only be because OutOfMemory causes the program to crash.

Doc Values compression

In a broad sense, DocValues is essentially a serialized column storage, and this structure is very suitable for aggregation, sorting, scripting, and so on. Moreover, this storage method is also very easy to compress, especially the digital type. This reduces disk space and improves access speed. Let's take a look at a set of numeric DocValues types:

Doc Terms-Doc_1 | 100 Doc_2 | 1000 Doc_3 | 1500 Doc_4 | 1200 Doc_5 | 300 Doc_6 | 1900 Doc_7 | 4200-

You will notice that each number here is a multiple of 100. DocValues detects all the values in a segment and uses a maximum common divisor to facilitate further data compression. We can divide each number by 100 and get: [1] 10, 15, 15, 12, 3, 19, 42]. Now these numbers are smaller, requiring only a few bits to store, and reducing the size of the disk storage.

DocValues uses the following techniques in the compression process. It detects the following compression modes in turn:

If all values are different (or missing), set a flag and record these values

If these values are less than 256, a simple coding table will be used

If these values are greater than 256, detect whether there is a maximum common divisor

If there is no maximum common divisor, the offset is calculated uniformly and encoded starting from the lowest value.

Of course, if you store the String type, you can also digitally encode the String type through a sequential table, and then build the numeric type into DocValues.

Disable Doc Values

DocValues is enabled by default for all fields except analyzed strings. This means that all numeric, geographic, date, IP, and not_analyzed character types are turned on by default.

Analyzed strings cannot use DocValues for the time being because the parsed text generates a large amount of Token, which has a significant impact on performance.

Although DocValues is very easy to use, if you really don't need this feature for the data you store, you might as well disable it, which will not only save disk space, but may also speed up indexing.

To disable DocValues, set doc_values:false in the field mapping (mapping). For example, here we create a new index, and the field "session_id" disables DocValues:

PUT my_index {"mappings": {"properties": {"session_id": {"type": "string", "index": "not_analyzed", "doc_values": false}

By setting doc_values:false, this field cannot be used for aggregation, sorting, and scripting operations.

You can also disable the inverted index so that it cannot be searched properly, but it can be sorted, for example:

PUT my_index {"mappings": {"properties": {"customer_token": {"type": "string", "index": "not_analyzed", "doc_values": true, "index": "no"}

By setting up doc_values:true and index:no, we get a field that can only be used for aggregation / sort / script.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report