Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the concept and usage of Term Vector in Java

2025-04-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly explains "what is the concept and use of Term Vector in Java". Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "what is the concept and usage of Term Vector in Java"?

What is term vector?

Every time there is document data insertion, in addition to storing the forward and inverted indexes of document, if the field of this index sets the term_vector parameter, elasticsearch will also calculate and count the word segmentation information, such as how many field there are in this document, the DF value of each field, the value of TTF, the location offset stored in each term, and other information, which are collectively referred to as term vector. There are 5 values for term vector

No: does not store term vector information, default

Yes: only field terms information is stored, not position and offset information

With_positions: stores term information and position information

With_offsets: stores term information and offset information

With_positions_offsets: stores complete term vector information, including field terms, position, and offset information.

There are two ways to generate information in term vector: index-time and query-time. Index-time generates term vector information when the index is established, and query-time generates term vector information in real time during the query process. The former uses space for time, and the latter for space.

What is the purpose of term vector?

Term vector is essentially a data exploration tool (can be thought of as a debugger tool), which records the details of the term after the field participle in a document, such as splitting into several term, where each term is in the forward index, their respective DF values, TTF values, and so on. It is generally used for the troubleshooting of suspected data problems, such as sorting and search results that are inconsistent with the expected results. You need to understand the root cause. You can use this tool to manually analyze the data to help determine the root cause of the problem.

Read term vector information

Let's take a look at the information of a complete term vector message. A line of code with a # sign is an added comment, as shown in the following example:

{

"_ index": "music"

"_ type": "children"

"_ id": "1"

"_ version": 1

"found": true

"took": 0

"term_vectors": {

"text": {

"field_statistics": {

"sum_doc_freq": 3

"doc_count": 1

"sum_ttf": 3

}

"terms": {

"elasticsearch": {

"doc_freq": 1

"ttf": 1

"term_freq": 1

"tokens": [

{

"position": 2

"start_offset": 11

"end_offset": 24

}

]

}

"hello": {

"doc_freq": 1

"ttf": 1

"term_freq": 1

"tokens": [

{

"position": 0

"start_offset": 0

"end_offset": 5

}

]

}

"java": {

"doc_freq": 1

"ttf": 1

"term_freq": 1

"tokens": [

{

"position": 1

"start_offset": 6

"end_offset": 10

}

]

}

}

}

}

}

A complete piece of term vector information. Term vector is calculated according to the dimension of field. It mainly consists of three parts:

Field statistics

Term statistics

Term information

Field statistics

Refers to all the document under the index and type. For the statistics of all the term under this field, note that the scope of the document, not one, is all the document under the specified index/type.

Sum_doc_freq (sum of document frequency): the sum of the df of all term in this field.

Doc_count (document count): how many document contain this field, and some document may not have this field.

Sum_ttf (sum of total term frequency): the sum of the tf of all term in this field.

Term statistics

Hello is the term after participle in the text field field of the current document, which takes effect when term_statistics=true is set for query.

Doc_freq (document frequency): how many document contain this term.

Ttf (total term frequency): the frequency at which this term appears in all document.

Term_freq (term frequency in the field): how often this term appears in the current document.

Term information

For the contents of tokens in the example, there is an array in tokens

Position: the forward index position of this term in field. If there are multiple identical term,tokens, there will be multiple records under it.

Start_offset: the offset of this term in field, indicating the offset of the starting position.

End_offset: the offset of this term in field, indicating the offset of the end position.

Term vector use case

Index music,type named children, specify text field as index-time,fullname field as query-time

PUT / music

{

"mappings": {

"children": {

"properties": {

"content": {

"type": "text"

"term_vector": "with_positions_offsets"

"store": true

"analyzer": "standard"

}

"fullname": {

"type": "text"

"analyzer": "standard"

}

}

}

}

}

Add 3 pieces of sample data

PUT / music/children/1

{

"fullname": "Jean Ritchie"

"content": "Love Somebody"

}

PUT / music/children/2

{

"fullname": "John Smith"

"content": "wake me, shark me..."

}

PUT / music/children/3

{

"fullname": "Peter Raffi"

"content": "brush your teeth"

}

Perform term vector probe on the data whose document id is 1

GET / music/children/1/_termvectors

{

"fields": ["content"]

"offsets": true

"positions": true

"term_statistics": true

"field_statistics": true

}

The result is the term vector example above. In addition, it can be mentioned that using the id of these three document to query, the field_statistics part is the same.

Common usage of term vector

In addition to the standard query usage in the previous section, there are some parameters that can enrich term vector's query.

Doc parameter

GET / music/children/_termvectors

{

"doc": {

"fullname": "Peter Raffi"

"content": "brush your teeth"

}

"fields": ["content"]

"offsets": true

"positions": true

"term_statistics": true

"field_statistics": true

}

The meaning of this syntax is to carry out term vector analysis for the specified doc. The content in the doc can be specified at will, which is particularly useful.

Per_field_analyzer parameter

You can specify the word splitter of the field to explore.

GET / music/children/_termvectors

{

"doc": {

"fullname": "Jimmie Davis"

"content": "you are my sunshine"

}

"fields": ["content"]

"offsets": true

"positions": true

"term_statistics": true

"field_statistics": true

"per_field_analyzer": {

"text": "standard"

}

}

Filter parameter

Filter the term vector statistical results

GET / music/children/_termvectors

{

"doc": {

"fullname": "Jimmie Davis"

"content": "you are my sunshine"

}

"fields": ["content"]

"offsets": true

"positions": true

"term_statistics": true

"field_statistics": true

"filter": {

"max_num_terms": 3

"min_term_freq": 1

"min_doc_freq": 1

}

}

Filter out the term vector statistics you want to see based on term statistics. It's also useful, for example, if you explore the data, you can filter out some term that are too low in frequency.

Docs parameter

Allows you to explore multiple doc at the same time, depending on your personal habits.

GET _ mtermvectors

{

"docs": [

{

"_ index": "music"

"_ type": "children"

"_ id": "2"

"term_statistics": true

}

{

"_ index": "music"

"_ type": "children"

"_ id": "1"

"fields": [

"content"

]

}

]

}

Recommendations for using term vector

There are two ways to get term vector information, one is specified when creating, as in the case above, and the other is generated when querying directly.

Index-time, configured in mapping, generates these term and field statistics directly when indexing. If term_vector is set to with_positions_offsets, the index takes up twice as much space as when term vector is not set.

Query-time, you have not generated any Term vector information before, and then when you check the term vector, you can directly see it, on the fly, calculate all kinds of statistics on the spot, and then return it to you.

Which of these two methods depends on the expectation of the use of term vector, query-time is more commonly used, after all, the purpose of this tool is to help locate the problem, real-time computing is fine.

At this point, I believe you have a deeper understanding of "what is the concept and use of Term Vector in Java". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report