Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Case Analysis of Elasticsearches scoring Mechanism

2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)05/31 Report--

Today, the editor will share with you the relevant knowledge points about the case analysis of the Elasticsearches scoring mechanism. The content is detailed and the logic is clear. I believe most people still know too much about this, so share this article for your reference. I hope you can get something after reading this article. Let's take a look at it.

An example

Now, tell a true story!

The story must have begun with the voice of teacher Zhao Zhongxiang, the rainy season is coming, and it is time for animals to be in estrus.

Do you still remember when the writer Liu Liu complained about xx? By the way, there is a picture and the truth! The image above:

As an onlooker, we should analyze it from a professional point of view and talk about it on a case-by-case basis.

In terms of the search results themselves, xx returned the correct results (yes, it has been adjusted, and there is no problem with the search now! ). Because the search keywords are included in the returned results. And from a logical point of view, this bunch of fucking ads is what's going on! This complaint is from the user's point of view. Obviously, the results returned, especially the first few, and sometimes even the first few pages, are far from what we want!

Furthermore, it may make sense to consider the matching of documents and queries only in a binary way, that is, Baidu search engine returns binary matching results: yes, found, no, I didn't find it! Although the results are returned, it also contains the results we want, even if it is not easy for you to find the right results in a lot of advertisements, just like everyone is used to interspersing TV dramas in advertisements, just get used to it! Xx adds weight to the entry of the advertisement from the point of view of x. As for the real result, you didn't give me any money.

Xx browsers, which require xx to access, pay more attention to document relevancy before correctly returning binary results to users, because as far as a result is concerned, if A document is more relevant to the result than B document, then A document should be higher than B document in the result, and with other optimizations, all the results will be returned eventually, and the result that users expect most is likely to be the highest. Isn't that beautiful?

The process of determining how relevant a document is to a query is called scoring.

Second, the operating mechanism of document scoring: TF-IDF

The scoring mechanism of Lucene and es is a formula. Take the query as input, use different means to determine the score of each document, and finally synthesize each factor through the formula to return the final score of the document. The process of comprehensive consideration is the process of considering that we want the relevant documents to be returned first. In Lucene and es, this correlation is called a score.

Before starting to calculate the score, es uses how often the term is searched and how common it is to influence the score in two ways:

The more times an entry appears in a document, the more relevant the document will be.

The more times an entry appears in different documents, the more irrelevant it will be!

We call TF-IDF,TF the word frequency (term frequency) and IDF the inverse document frequency (inverse document frequency).

2.1 word frequency: TF

The first way to consider the score of a document is to check the number of times an entry appears in the document, for example, an article around the score of es, then the relevant words will certainly appear many times in the article, when querying, we think that the document is more consistent, so the score of this document will be higher.

Free egg pain can Ctrl + f search related keywords (es, score, score) and so on.

2.2 inverse document frequency: IDF

The inverse document frequency is slightly more complex than the word frequency, and the more times an entry appears in different documents in the index, the less important it is.

Let's take an example, an example:

The rules-which require employees to work from 9 am to 9 pmIn the weeks that followed the creation of 996.ICU in MarchThe 996.ICU page was soon blocked on multiple platforms including the messaging tool WeChat and the UC Browser.

If you have the above three documents in the es index:

The document frequency of the entry ICU is 2, because it appears in 2 documents, and the inverse of the document is derived from the score multiplied by 1 ICU DF is the document frequency of the entry, which means that because the term has a higher document frequency, its weight will be reduced.

The document frequency of the entry the is 3, it appears in all three documents, note: although the appears twice in the last two documents, its word frequency is still 3, because the inverse document word frequency only checks whether the entry appears in a document, not how many times it appears in this document, that is what the word frequency should do.

The word frequency of inverse document is an important factor, which is used to balance the word frequency of entries. For example, we search the 996.ICU. The word the appears in almost all documents (such as "de" in Chinese), and if the damn thing is not balanced, then the frequency of the will completely overwhelm 996.ICU. Therefore, the word frequency of the inverse document effectively balances the correlation influence of the common word the. To achieve the actual correlation score will have a more accurate description of the entries of the query.

When the word frequency and inverse document word frequency are calculated. You can use the TF-IDF formula to calculate the score of the document.

Three Lucene scoring formula

The previous discussion of the Lucene default scoring formula is called TF-IDF, a formula based on word frequency and inverse document word frequency. The practical scoring formula for Lucene is as follows:

You think I'm gonna focus on this damn formula?!

I can only say that the higher the frequency of entries, the higher the score; similarly, the rarer the entries in the index, the higher the frequency of inverse documents, in which the quotient harmony factor and query standardization are added, and the harmony factor takes into account how many documents have been searched and how many terms have been found.

Query standardization is an attempt to make different query results comparable, which is obvious. It's hard.

We call this default scoring method a combination of TF-IDF and Vector Space Model (vector space model).

4 other scoring methods

In addition to the practical scoring model of TF-IDF combined with vector space model, it is the most mainstream scoring mechanism of es and Lucene, but this is not the only one. Except for the practical model such as TF-IDF, other models include:

Okapi BM25 .

Random bifurcation (Divergence from randomness), that is, DFR similarity.

LM Dirichlet similarity.

LM Jelinek Mercer similarity.

Here is a brief introduction to several main settings of BM25, namely K1, b and discount_overlaps:

K1 and b are numeric settings that adjust how the score is calculated.

K1 controls the importance of word frequency (TF) for scoring.

B is a value between 0 and 1, which controls the effect of document length on scores.

By default, K1 is set to 1.2 and b is set to 0.75

The setting of discount_overlaps is used to tell es how many participles appear in the same place in a field and whether it should affect the standardization of length. The default value is true.

Five configuration scoring model 5.1 brief configuration of BM25 scoring model

BM25 (is it just like pm2.5?! Is a probability-based scoring framework. Let's briefly configure:

PUT w2 {"mappings": {"doc": {"properties": {"title": {"type": "text" "similarity": "BM25"} PUT w2/doc/1 {"title": "The rules-which require employees to work from 9 am to 9 pm"} PUT w2/doc/2 {"title": "In the weeks that followed the creation of 996.ICU in March"} PUT w2/doc/3 {"title": "The 996.ICU page was soon blocked on multiple platforms including the messaging tool WeChat and the UC Browser." } GET w2/doc/_search {"query": {"match": {"title": "the 996"}

The above example specifies the scoring model through the similarity parameter. As for the query, or when the amount of data is relatively large, try a few more times, it is easier to find differences.

5.2 configure advanced settingsPUT w3 {"settings": {"index": {"analysis": {"analyzer": "ik_smart"}}, "similarity": {"my_custom_similarity": {"type": "BM25", "K1": 1.2, "b": 0.75 for BM25 "discount_overlaps": false}}, "mappings": {"doc": {"properties": {"title": {"type": "text" "similarity": "my_custom_similarity"} PUT w3/doc/1 {"title": "The rules-which require employees to work from 9 am to 9 pm"} PUT w3/doc/2 {"title": "In the weeks that followed the creation of 996.ICU in March"} PUT w3/doc/3 {"title": "The 996.ICU page was soon blocked on multiple platforms including the messaging tool WeChat and the UC" Browser. "} GET w3/doc/_search {" query ": {" match ": {" title ":" the 996 "} 5.3 configure the global scoring model

If we want to use a particular scoring model and want to apply it globally, add it to the elasticsearch.yml configuration file:

Index.similarity.default.type: BM25 six boosting

Boosting is a program used to modify document dependencies. There are two types of boosting:

When indexing, for example, when we define mappings.

When querying a document.

Both of the above methods can improve the score of a document. It should be noted that the document boosting modified during the index is stored in the index, and the document must be re-indexed if you want to modify the boosting.

6.1 boosting during indexing

Don't say anything, it's all in the wine! The above code:

PUT w4 {"mappings": {"doc": {"properties": {"name": {"boost": 2.0, "type": "text"}, "age": {"type": "long"}

It is true once and for all, but it is generally not recommended.

One reason is that once the mapping is established, all name fields automatically have a boost value. To change this value, you must re-index the document.

Another reason is that the boost value is stored as a reduced-precision value in the index structure within Lucene. Only one byte is used to store floating-point values (if you can't save it, you lose precision), so you may lose precision when calculating the final score of a document.

Finally, boost is applied with entries. Therefore, if more than one entry is matched in the boost field, it means that the boost is calculated multiple times, which will further increase the weight of the field and may affect the final document score.

Now let's introduce another way.

6.2 boosting during query

In es, almost all query types support boost, as you might imagine, match, multi_match, and so on.

As an example, during the query, use the match query for boosting:

PUT w5 {"mappings": {"properties": {"title": {"type": "text", "analyzer": "ik_max_word"}, "content": {"type": "text" "analyzer": "ik_max_word"} PUT w5/doc/1 {"title": "Lucene is cool", "content": "Lucene is cool"} PUT w5/doc/2 {"title": "Elasticsearch builds on top of lucene", "content": "Elasticsearch builds on top of lucene"} PUT w5/doc/3 {"title": "Elasticsearch rocks", "content": "Elasticsearch rocks"}

To inquire:

GET w5/doc/_search {"query": {"bool": {"should": [{"match": {"title": {"query": "elasticserach rocks", "boost": 2.5} {"match": {"content": "elasticserach rocks"}

As far as the final score is concerned, the content field, the title query with boost is more influential. Only in bool queries does boost make more sense.

6.3 queries that span multiple fields

Boost can also be used for multi_match queries.

GET w5/doc/_search {"query": {"multi_match": {"query": "elasticserach rocks", "fields": ["title", "content"], "boost": 2.5}

In addition, we can use a special syntax to specify only a boost for a specific field. By adding a ^ symbol and the value of boost after the field name. Tell es to boost only that field:

GET w5/doc/_search {"query": {"multi_match": {"query": "elasticserach rocks", "fields": ["title ^ 3", "content"]}

In the above example, the title field is 3 times larger than boost.

It is important to note that when using boost, both fields and entries are boost by relative value, not multiplied by multiplier.

If boost has the same value for all the terms to be searched, it's as if there is no boost (nonsense, as if everyone is one meter taller at the same time)! Because Lucene standardizes the value of boost.

If you boost a field 4 times, it doesn't mean that the score for that field is multiplied by 4. So, if your score is not based on strict multiplication, don't worry.

7. Use "explanation" to understand how documents are graded.

Nothing is what you think it is! Yes, in es, one document is more consistent with a query than another document is likely to be different from what we thought!

In this section, let's take a look at the internal formulas used by es and Lucene to calculate scores.

We use explain=true to tell es that you need to explain to sa Jia why this score is like this?! What is the py deal behind it?

For example, let's inquire:

GET py1/doc/_search {"query": {"match": {"title": "Beijing"}}, "explain": true, "_ source": "title", "size": 1}

Because the result is too long, we filter the result here ("size": 1 returns a document) and look at only the specified field ("_ source": "title" only returns the title field).

Look at the results:

{"took": 1, "timed_out": false, "_ shards": {"total": 5, "successful": 5, "skipped": 0, "failed": 0}, "hits": {"total": 24, "max_score": 4.9223156, "hits": [{"_ shard": "[py1] [1]" "_ node": "NRwiP9PLRFCTJA7w3H9eqA", "_ index": "py1", "_ type": "doc", "_ id": "NIjS1mkBuoj17MYtV-dX", "_ score": 4.9223156, "_ source": {"title": "Why is the embarrassing interruption of uppercase not popular in Beijing?" }, "_ explanation": {"value": 4.9223156, "description": "weight (title: Beijing in 36) [PerFieldSimilarity], result of:", "details": [{"value": 4.9223156, "description": "score (doc=36,freq=1.0 = termFreq=1.0\ n), product of:" "details": [{"value": 4.562031, "description": "idf, computed as log (1 + (docCount-docFreq + 4.562031) / (docFreq + 0.5)) from:" "details": [{"value": 4. 0, "description": "docFreq", "details": []}, {"value": 430.0 "description": "docCount", "details": []}}, {"value": 1.0789746, "description": "tfNorm" Computed as (freq * (K1 + 1)) / (freq + K1 * (1-b + b * fieldLength / avgFieldLength)) from: "," details ": [{" value ": 1.0," description ":" termFreq=1.0 " "details": []}, {"value": 1.2, "description": "parameter K1", "details": []} {"value": 0.75, "description": "parameter b", "details": []}, {"value": 12.1790695 "description": "avgFieldLength", "details": []}, {"value": 10.0, "description": "fieldLength" "details": []}

In the new _ explanation field, you can see that the value value is 4.9223156, so how is it calculated?

To analyze, the participle "Beijing" appears once in the description field (title), so the comprehensive score of TF is calculated by "description": "tfNorm, computed as (freq * (K1 + 1)) / (freq + K1 * (1-b + b * fieldLength / avgFieldLength)) from:", the score is 1.0789746.

What about the frequency of words in reverse documents? The score is calculated to be 4.562031 based on "description": "idf, computed as log (1 + (docCount-docFreq + 0.5) / (docFreq + 0.5)) from:".

So the final score is:

1.0789746 * 4.562031 = 4.9223155734126

The result is 4.9223156 when rounded.

It is important to note that the features of explain bring additional performance overhead to es. Therefore, in addition to being available for debugging, explain should be avoided in production environments.

The above is all the contents of this article "case Analysis of Elasticsearches scoring Mechanism". Thank you for reading! I believe you will gain a lot after reading this article. The editor will update different knowledge for you every day. If you want to learn more knowledge, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report