Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

ES Learning Notes-- the Origin of fielddata

2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/03 Report--

The relationship between retrieval and sorting is particularly well said in ES's official documentation:

Search needs to answer the question "Which documents contain this term?", while sorting and aggregations need to answer a different question: "What is the value of this field for this document?"

The question to be solved by the search is: "which documents contain given keywords?"

The problem of sorting and aggregation is: "what is the value of the corresponding field in this document?"

Similarly, take the demand as the starting point: the need for "search results sorted by time" is very common in commodity search and log analysis systems. As we all know, Lucene solves the "retrieval problem" through inverted index, so how to deal with the "sorting problem"?

In the beginning, Lucene addressed this requirement through FieldCache. Is to establish the mapping relationship between docId and value through FieldCache. But FieldCache has two fatal problems: heap memory consumption and first load time. If the index is updated frequently, the GC caused by these two problems and the system instability caused by timeouts are estimated to be a programmer's nightmare.

Starting with Lucene4.0, a new component IndexDocValues is introduced, which is what we often call doc_value.

It has two bright spots:

1. Construct the mapping relationship of doc-value when indexing data. Note: the inverted index builds the mapping of value-doc. two。 Column storage

This is basically the typical practice of "space for time" and "on-demand loading". Moreover, column storage is basically the standard of all efficient NoSQL, Hbase, Hive have column storage figure.

Like FieldCache, IndexDocValues solves the problem of "querying value through doc_id", as well as two problems of FieldCache.

ES builds fielddata based on doc_value for sorting and aggregation functions. So, to put it bluntly, doc_value is the cornerstone of ES aggregations.

So how do you use fielddata in ES? Take the binary type as an example, refer to: org.elasticsearch.index.fielddata.BinaryDVFieldDataTests

S1: special treatment is required when building mappings

String mapping = XContentFactory.jsonBuilder () .startObject () .startObject ("test") .startObject ("properties") .startObject ("field") .field ("type", "binary") .startObject ("fielddata") .field ("format") "doc_values") .endObject () .string ()

S2: build doc_values through leafreader

LeafReaderContext reader = refreshReader (); IndexFieldData indexFieldData = getForField ("field"); AtomicFieldData fieldData = indexFieldData.load (reader); SortedBinaryDocValues bytesValues = fieldData.getBytesValues ()

S3: navigate to the specified document and use the setDocument () method.

/ * * A list of per-document binary values, sorted * according to {@ link BytesRef#getUTF8SortedAsUnicodeComparator ()}. * There might be dups however. * / public abstract class SortedBinaryDocValues {/ * * Positions to the specified document * / public abstract void setDocument (int docId); / * Return the number of values of the current document. * / public abstract int count (); / * * Retrieve the value for the current document at the specified index. * An index ranges from {@ code 0} to {@ code count ()-1}. * Note that the returned {@ link BytesRef} might be reused across invocations. * / public abstract BytesRef valueAt (int index);}

Note that if reader is combined, that is, there are multiple, you need to use docBase + reader.docId. It is easy to dig holes here.

S4: get the value of the specified field of the document, using the valueAt () method.

Finally, this paper briefly describes the relationship between lucene's doc_value and es's fielddata, and briefly describes the basic idea of doc_value. Finally, the basic method of using fielddata in ES is given. This is useful for developing plugin on your own.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report