How to debug IDF in elasticsearch7.x 07/06 Update SLTechnology News&Howtos

How to debug IDF in elasticsearch7.x

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article is to share with you about how to debug IDF in elasticsearch7.x. The editor thinks it is very practical, so I share it with you to learn. I hope you can get something after reading this article.

Brief introduction

TF-IDF is the default scoring mechanism in elasticsearch (like Lucene), which explains:

The more times an entry appears in a document, the more relevant it becomes. But the more times an entry appears in different documents, the more irrelevant it becomes.

TF word frequency, the number of times entries appear in the document

For example, there are two documents now. When searching for Elasticsearch, document 2 should have a higher score. The word frequency of document 1 is 1 and the word frequency of document 2 is 2.

1 、 We will discuss Elasticsearch at the nex Big Data group2 、 Tuesday the Elasticsearch team will gather to answer questions about Elasticsearch

IDF inverse document frequency, the number of times entries appear in all documents under the index

For example, there are three documents where the exists in every document. If we search for the score, if there is no IDF, the first document returned may be inaccurate. IDF balances the relevance of common words.

1. How to debug We ues Elasticsearch to power the search for our website2, The developers like Elasticsearch so far3 and The scoring of documents is calculated by the scoring formula

When the query results do not match our expectations, you can use "explain": true to debug

GET / full_text_test123/_search {"query": {"match": {"content": "Beijing"}, "explain": true}

If you know the document ID, you can use _ explain API if you want to query why a document has not been queried.

GET / full_text_test123/_explain/1 {"query": {"match": {"content": "Yunnan Province"} set the score when querying

Use the boost field to set the score factor, and the boost query is more meaningful when you use a combined query such as bool or and/or/not

Specify a score in dsl

The bool query specifies that the priority of "Beijing" is 10 and Yunnan Province is 1.

GET / full_text_test123/_search {"query": {"bool": {"should": [{"match": {"content": {"query": "Beijing", "boost": 10} {"match": {"content": {"query": "Yunnan Province", "boost": 1}]}

According to the results of the query, we can see that several towns in Beijing are ranked at the top.

"hits": [{"_ index": "full_text_test123", "_ type": "_ doc", "_ id": "1", "_ score": 7.315132, "_ source": {"title": "Huilongguan Town" "content": "Huilongguan Town, Changping District, Beijing, people's Republic of China", "geolocation": "40.0764332591116.3429765651", "clicknum": 102,102, "date": "2019-01-01"}}, {"_ index": "full_text_test123", "_ type": "_ doc" "_ id": "3", "_ score": 7.315132, "_ source": {"title": "Xiaotangshan Town", "content": "Xiaotangshan Town, Changping District, Beijing, people's Republic of China", "geolocation": "40.1809900000116.3915700000", "clicknum": 202 "date": "2019-03-03"}}, {"_ index": "full_text_test123", "_ type": "_ doc", "_ id": "2", "_ score": 7.156682, "_ source": {"title": "Shahe Town" "content": "Shahe Town, Changping District, Beijing, people's Republic of China", "geolocation": "40.1481760748116.2889957428", "clicknum": 92, "date": "2019-02-02"}}, {"_ index": "full_text_test123", "_ type": "_ doc" "_ id": "6", "_ score": 1.4350727, "_ source": {"title": "Menglun Town", "content": "Menglun Town, Mengla County, Xishuangbanna Autonomous Prefecture, people's Republic of China", "geolocation": "21.9481406582100.4479980469", "clicknum": 330 "date": "2019-10-05"}]

If we are using a multi_match query, how to specify the score factor? we can use ^ match, as shown in the following content field, the score coefficient is 3, the title is 4, so the result of title matching takes the second place: content.

GET / full_text_test123/_search {"query": {"multi_match": {"query": "Beijing", "fields": ["content ^ 3", "title ^ 4"]}

The above two are relatively simple scoring methods. Let's take a look at the higher-level ways that elasticsearch provides for users.

Function_score setting score

Function_score also provides five other ways to modify the score:

Filter, which sets the specified score for documents whose content meets the filter criteria

Field_value_factor, using the parameters in the document as coefficients to affect the score, for example, the more comments in the message, the higher the score

Script_score, which uses scripts to calculate the score factor, can use doc ['fieldname'] to access the value of a field in a document, such as Math.log (doc [' attendees'] .values.size ()) * myweight, where myweight is the parameter specified in the params field when querying

Random_score, randomly assign a number. In some scenarios, you want to return different data each time. You can use this function.

Attenuation function (linear linear curve, gauss Gaussian curve, exp exponential curve)

Filter

Search out the documents of keywords in Changping District of Beijing, set the score through functions, and set the priority according to the demand.

The difference between weight here and boost in the above example is that ordinary boost increases scores according to standardization, while weight is multiplied by a constant.

GET full_text_test123/_search {"query": {"function_score": {"query": {"match": {"content": "Beijing Changping District"}}, "functions": [{"weight": 2, "filter": {"match": {"content": "Shahe Town"} {"weight": 3, "filter": {"match": {"content": "Huilongguan Town"}}], "score_mode": "max", "boost_mode": "replace"}

Result

"hits": [{"_ index": "full_text_test123", "_ type": "_ doc", "_ id": "1", "_ score": 3.0, "_ source": {"title": "Huilongguan Town" "content": "Huilongguan Town, Changping District, Beijing, people's Republic of China", "geolocation": "40.0764332591116.3429765651", "clicknum": 102,102, "date": "2019-01-01"}}, {"_ index": "full_text_test123", "_ type": "_ doc" "_ id": "2", "_ score": 3.0, "_ source": {"title": "Shahe Town", "content": "Shahe Town, Changping District, Beijing, people's Republic of China", "geolocation": "40.1481760748116.2889957428", "clicknum": 92 "date": "2019-02-02"}}, {"_ index": "full_text_test123", "_ type": "_ doc", "_ id": "3", "_ score": 1.0, "_ source": {"title": "Xiaotangshan Town" "content": "Xiaotangshan Town, Changping District, Beijing, people's Republic of China", "geolocation": "40.1809900000116.3915700000", "clicknum": 202, "date": "2019-03-03"}}]

Score_mode, how the scores from each individual function are merged, have the following settings

Multiply, default

Sum

Avg

First

Max

Min

Boost_mode, how the score obtained from the function is combined with the original score, which refers to the query of content: "Changping District of Beijing" in the example, with the following settings

Sum, the sum of the score _ score and the function value

Max, the larger value between the score _ score and the function value

Min, the smaller value between the score _ score and the function value

Replace, replacing the function score with the original score

Field_value_factor

Field_value_factor, using the parameters in the document as a factor to affect the score

Field: specifies which field name in the document participates in the calculation

Factor: the multiple of the number of clicks to be multiplied

Modifier: the score is calculated by default: none, which is divided into the following categories (log, log1p, log2p, ln, ln1p, ln2p, square, sqrt, reciprocal)

The higher the click rate, the higher the priority.

GET full_text_test123/_search {"query": {"query": {"match": {"content": "Beijing"}}, "functions": [{"field_value_factor": {"field": "clicknum", "factor": 2.2 "modifier": "log1p"}], "score_mode": "max", "boost_mode": "replace"}}

Result

[{"_ index": "full_text_test123", "_ type": "_ doc", "_ id": "3", "_ score": 2.6487503, "_ source": {"title": "Xiaotangshan Town", "content": "Xiaotangshan Town, Changping District, Beijing, people's Republic of China" "geolocation": "40.1809900000116.3915700000", "clicknum": 202, "date": "2019-03-03"}}, {"_ index": "full_text_test123", "_ type": "_ doc", "_ id": "1", "_ score": 2.352954 "_ source": {"title": "Huilongguan Town", "content": "Huilongguan Town, Changping District, Beijing, people's Republic of China", "geolocation": "40.0764332591116.3429765651", "clicknum": 102, "date": "2019-01-01"}} {"_ index": "full_text_test123", "_ type": "_ doc", "_ id": "2", "_ score": 2.308351, "_ source": {"title": "Shahe Town", "content": "Shahe Town, Changping District, Beijing, people's Republic of China" "geolocation": "40.1481760748116.2889957428", "clicknum": 92, "date": "2019-02-02"}}] script_score

Using a script to calculate the score factor, you can use doc ['fieldname'] to access the value of a field in the document, such as Math.log (doc [' attendees'] .values.size ()) * myweight, where myweight is the parameter specified in the params field when querying

Source: script function (there is a slight difference between the 7.x version and the previous version, specify the script in source)

Params: our custom object

GET full_text_test123/_search {"query": {"function_score": {"query": {"match": {"content": "Beijing"}} "functions": [{"script_score": {"script": {"source": "doc ['clicknum']. Value * params.myweight", "params": {"myweight": 3}] Score_mode: "max", "boost_mode": "replace"}

Result

[{"_ index": "full_text_test123", "_ type": "_ doc", "_ id": "3", "_ score": 606.0, "_ source": {"title": "Xiaotangshan Town", "content": "Xiaotangshan Town, Changping District, Beijing, people's Republic of China" "geolocation": "40.1809900000116.3915700000", "clicknum": 202, "date": "2019-03-03"}}, {"_ index": "full_text_test123", "_ type": "_ doc", "_ id": "1", "_ score": 306.0 "_ source": {"title": "Huilongguan Town", "content": "Huilongguan Town, Changping District, Beijing, people's Republic of China", "geolocation": "40.0764332591116.3429765651", "clicknum": 102, "date": "2019-01-01"}} {"_ index": "full_text_test123", "_ type": "_ doc", "_ id": "2", "_ score": 276.0, "_ source": {"title": "Shahe Town", "content": "Shahe Town, Changping District, Beijing, people's Republic of China" "geolocation": "40.1481760748116.2889957428", "clicknum": 92, "date": "2019-02-02"}}] random_score

Randomly assign a number. In some scenarios, you want to return different data each time. You can use this function.

Seed: the seed of a random number. If the seed is the same for two queries, then the result page is the same.

Interpretation of seed and field fields on the official website

Random

The random_score generates scores that are uniformly distributed from 0 up to but not including 1. By default, it uses the internal Lucene doc ids as a source of randomness, which is very efficient but unfortunately not reproducible since documents might be renumbered by merges.

In case you want scores to be reproducible, it is possible to provide a seed and field. The final score will then be computed based on this seed, the minimum value of field for the considered document and a salt that is computed based on the index name and shard id so that documents that have the same value but are stored in different indexes get different scores. Note that documents that are within the same shard and have the same value for field will however get the same score, so it is usually desirable to use a field that has unique values for all documents. A good default choice might be to use the _ seq_no field, whose only drawback is that scores will change if the document is updated since update operations also update the value of the _ seq_no field.

The translation is as follows (let's make do with the software translation)

You can set the seed without setting the field, but this is not recommended because it requires loading fielddata on the _ id field, which consumes a lot of memory.

By default, it uses internal lucene doc id as a source of randomness, which is very effective, but unfortunately cannot be copied because documents may be merged and renumbered.

If you want the score to be replicable, you can provide seeds and fields. The final score will be based on this seed, the minimum field value of the document under consideration, and the salt calculated based on the index name and shard id, so that documents with the same value but stored in different indexes will get different scores.

Note that documents in the same shard with the same field value will get the same score, so usually you need to use a field with a unique value for all documents. A good default choice may be to use the _ seq_no field. The only disadvantage is that if the document is updated, the score will change, because the update operation will also update the value of the seq no field.

GET full_text_test123/_search {"query": {"function_score": {"query": {"match": {"content": "Beijing"}}, "functions": [{"random_score": {"seed": 314159265359 "field": "_ seq_no"}}], "score_mode": "max", "boost_mode": "replace"}} attenuation function

The falloff function scores the document according to the distance attenuation function between the numeric field value of the document and the origin given by the user, similar to a range query.

The attenuation function provides three kinds of curves (linear curve, gauss Gaussian curve, exp exponential curve). These three types of attenuation types can have 4 configuration parameters.

Origin, the best possible value of the center point or field of the curve, the document score _ score that falls on the origin origin is the full score, which is the highest and highest score desired by the user. If the current position of "Xiaoming" is indicated in the example of calculating the distance between "Xiaoming" and "library", or in the example from a certain date to today, it means today.

Offset, centering on the origin origin, sets a non-zero offset offset to cover a range (origin+/-offset this range), not just a single origin. All scores in the range _ score are full marks, and the default is 0.

Scale: decay rate, that is, the rate at which the score _ score changes when a document falls from the origin origin.

Decay: when the value of the field decays to the value specified by scale, it decays to decay

The value of the origin origin (that is, the center point) of all the curves in the diagram is 40, and the office is 5, that is, in the range 40-5.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.