In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
Today, I will talk to you about what the ElasticSearch relevance scoring mechanism is, many people may not know much about it. In order to make you understand better, the editor summed up the following content for you. I hope you can get something according to this article.
ElasticSearch 2.3 version of full-text search uses relevance scoring TFIDF by default. In practical application, we use Multi-Match to set weights for each field, use should to give specific document weights or use more advanced Function_Score to score. With the help of the explain function of Elasticsearch, we can deeply study the mechanism.
Create an index PUT / gino_test {"mappings": {"tweet": {"properties": {"text": {"type": "string", "term_vector": "with_positions_offsets_payloads", "store": true, "analyzer": "fulltext_analyzer"} "fullname": {"type": "string", "term_vector": "with_positions_offsets_payloads", "analyzer": "fulltext_analyzer"}, "settings": {"index": {"number_of_shards": 1, "number_of_replicas": 0} "analysis": {"analyzer": {"fulltext_analyzer": {"type": "custom", "tokenizer": "whitespace", "filter": ["lowercase", "type_as_payload"]}
Insert test data:
Simple case: single field matching score POST gino_test/_search {"explain": true, "query": {"match": {"text": "my cup"}
Query result: score_simple.json
Scoring analysis:
The default correlation score currently used by ElasticSearch is Lucene's TF-IDF technology.
Let's take a closer look at this formula:
Score (QMagazine d) = queryNorm (Q) coord (QMague d) ∑ (tf (tPerry d) idf (t) ²t.getBoost () norm (tMagne d)
Score is the correlation score between the query input Q and the current document D.
QueryNorm (Q) is the normalization factor of query input, and its function is to make the final score not too large, so it is comparable to a certain extent.
Coord is the coordination factor, indicating the proportion to which the input Token is matched by the document
Tf (t Token d) indicates the frequency at which an input Token appears in the document. The higher the frequency, the higher the score.
Idf (t) indicates the frequency level of a Token entered. Its specific calculation has nothing to do with the current document, but is related to the frequency that appears in the index. The lower the frequency, the higher the score.
T.getBoost () is the weight specified when querying.
Norm is a weight of the number of Term in the current document, which has been calculated in the indexing phase, and its final value is a multiple of 0.125 due to storage.
Note: in the calculation process, the variables involved should take into account the shard in which the document is located rather than the entire index.
Score (QMagazine d) = _ score (QMagazine d.f)-① = queryNorm (Q) coord (QMagazine d) ∑ (tf (tmaeed d) idf (t) ²t.getBoost () norm (tMagazine d)) = coord (QMague d) ∑ (tf (tPowerd) idf (t) ²t.getBoost () norm (t D) queryNorm (Q) = coord (QMagned.f) ∑ _ score (q.ti, d.f) [ti in q]-② = coord (QMagned.f) (_ score (q.t1, d.f) + _ score (q.t2, d.f))
① relevance scoring is actually a query about the correlation between a field of a document, not the relevance of a document.
When ② is converted according to the formula, it becomes the sum of the correlation between all the Term of the query and the fields in the document. If a certain Term is not related, you need to deal with the coord coefficient.
Multi-match multi-field matching score (best_fields mode) POST / gino_test/_search {"explain": true, "query": {"multi_match": {"query": "gino cup", "fields": ["text ^ 8", "fullname ^ 5"]}
Query result: score_bestfields.json
Scoring analysis:
Score (Qd.f1 d) = max (_ score (Q, d.fi)) = max (_ score (QMagazine d.f1), _ score (qmemd.f2)) = max (coord (qPowerd.f1) (_ score (q.t1, d.f1) + _ score (q.t2, d.f1)), coord (qjingd.f2) (_ score (q.t1, d.f2) + _ score (q.t2, d.f2)
For multi-field 's best_fields mode, it is equivalent to scoring the query separately for each field, and then performing the max operation to get the highest score.
In the process of calculating query weight, you need to multiply the weight of the field, and you also need to multiply the weight of the field when calculating fieldNorm.
The default operator is or, and if you use and, the scoring mechanism is the same, but the search results will be different.
Multi-match multi-field matching score (cross_fields mode) POST / gino_test/_search {"explain": true, "query": {"multi_match": {"query": "gino cup", "type": "cross_fields", "fields": ["text ^ 8", "fullname ^ 5"]}
Query result: score_crossfields.json
Scoring analysis:
Score (Q, d) = ∑ (_ score (q.ti, d.f)) = ∑ (_ score (q.t1dd.f), _ score (q.t1jind.f)) = ∑ (max (coord (q.t1jind.f) _ score (q.t1, d.f1), coord (q.t1jind.f) _ score (q.t1, d.f2)), max (coord (q.t2pr. F) _ score (q.t2) D.f1), coord (q.t2djd.f) _ score (q.t2, d.f2)
The coord (q.t1jdd.f) function indicates how many fields match the ratio of the searched Term (such as gino) in the multi-field; in the best_fields pattern, the coord (qmemd.f1) indicates the ratio of the searched Term (such as gino and cup) to the specific field field (such as the text field).
For multi-field 's cross_fields schema, it is equivalent to scoring the Term of each query (each Term performs a best_fields score, that is, to see which field matches higher), and then performs the sum operation.
The default operator is or, and if you use and, the scoring mechanism is the same, but the search results will be different. This is a message that uses operator as or: score_crossfields_or.json
Should increases weight to score
To add filter testing, add a tags field to gino_test/tweet.
PUT / gino_test/_mapping/tweet {"properties": {"tags": {"type": "string", "analyzer": "fulltext_analyzer"}
Add the tag of tags
POST / gino_test/_search {"explain": true, "query": {"bool": {"must": {"bool": {"must": {"multi_match": {"query": "gino cup", "fields": ["text ^ 8" "fullname ^ 5"], "type": "best_fields", "operator": "or"}}, "should": [{"term": {"tags": {"value": "goods" "boost": 6}, {"term": {"tags": {"value": "hobby" "boost": 3}]}
Query result: score_should.json
Scoring analysis:
After increasing the weight of should, it is equivalent to one more scoring reference item, and the scoring process is shown in the calculation process above.
Function_score advanced scoring mechanism
DSL format:
{"function_score": {"query": {}, "boost": "boost for the whole query", "functions": [{"filter": {}, "FUNCTION": {}, "weight": number} {"FUNCTION": {}}, {"filter": {}, "weight": number}], "max_boost": number, "score_mode": "(multiply | max |...)", "boost_mode": "(multiply | replace |...)" "min_score": number}}
Four types of FUNCTION are supported:
Script_score: a custom advanced scoring mechanism. The fields involved can only be numeric.
Weight: weight score, which is generally used in conjunction with filter to indicate how many times the score meets certain conditions.
Random_score: generate a random score, for example, uid should randomly disrupt the sort
Field_value_factor: score according to the influence of a field value in index, such as sales volume (the fields involved can only be numeric)
Decay functions: the attenuation function is scored, for example, the closer to the city center, the higher the score.
To do an experiment. First, add a field to view the number of views to index:
PUT / gino_test/_mapping/tweet {"properties": {"views": {"type": "long", "doc_values": true, "fielddata": {"format": "doc_values"}
Add the value of the number of views to the three pieces of data:
POST gino_test/tweet/1/_update {"doc": {"views": 56}}
What the final data looks like:
Execute a query:
{"explain": true, "query": {"function_score": {"query": {"multi_match": {"query": "gino cup", "type": "cross_fields", "fields": ["text ^ 8", "fullname ^ 5"]}}, "boost": 2 "functions": [{"field_value_factor": {"field": "views", "factor": 1.2, "modifier": "sqrt", "missing": 1} {"filter": {"term": {"tags": {"value": "goods"}, "weight": 4}], "score_mode": "multiply", "boost_mode": "multiply"}
Query result: score_function.json
Scoring analysis:
Score (QMague d) = score_query (QMagne d) * (score_fvf (`view`) * score_filter (`tags: goods`))
Score_mode represents the algorithm for scoring between multiple FUNCTION. It is important to note that the result levels of different FUNCTION scores may vary greatly.
Boost_mode represents the algorithm for scoring function_score and query_score, and you also need to pay attention to the level of the scoring result.
Rescore re-grading mechanism
Introduction to ES official website: Rescoring | Elasticsearch Reference [2.3] | Elastic
The regrading mechanism will not be applied to all data. For example, if you need to query the first 10 pieces of data, all the shards first query the first 10 pieces of data according to the default rules, and then apply the rescore rules to re-score and return them to the master node for comprehensive sorting and return to the user.
Rescore supports multiple rule calculations, as well as operations with the previous default scores (weight summation, etc.).
Rescore should have better performance because there are fewer document for scoring, but this involves global sorting, so you should pay attention to the actual scenarios.
After reading the above, do you have any further understanding of the ElasticSearch correlation scoring mechanism? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.