What is the Elaticsearch query problem analysis like? 07/02 Update SLTechnology News&Howtos

What is the Elaticsearch query problem analysis like?

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article shows you what Elaticsearch query problem analysis is like, the content is concise and easy to understand, can definitely brighten your eyes, through the detailed introduction of this article, I hope you can get something.

Seemingly irrelevant, but actually quite reasonable text matching problem

Why do you search for a cell phone number "15858593403", but the first result is a result that seems to have nothing to do with it, and the expected result ranks second?

As the result is shown in the picture, it is true that an unrelated mobile phone number result appears in the first place, and the matching score is really higher than the expected second result, so why is this happening?

From the query results, you can see that the matching score _ score is not a "regular" number, but can be inferred to be a match matching result. Look at the corresponding request (only the core part of the summary is shown):

Indeed, first review the match and match_phrase query conditions: the first step of these two conditions is to segment the query words, match directly retrieves the id_set intersection of each word, and match_phrase also needs to calculate the position_gap between the words to determine whether the character order is in line with the order of the query words.

Obviously, the character order required by the mobile condition does not match in the first result, so there is a high probability of the problem in the name field. You can see that the query requires that the word segmentation result in the range of 13 words must satisfy all 13 words, and the matching degree of more than 13 words must be 90%.

Here we insert the minimum_should_match parameter. The match condition is actually similar to the bool query. By default, each word element after the word segmentation is treated as a clause in the should condition, so that the matching degree of the minimum_should_match parameter on the surface is actually converted into the number of clause to match after the word segmentation (word format). The specific conversion method is probably max (1, floor (minimum_should_match * num_of_clauses)).

All right, it's time to take a look at the word splitter in the name field:

Is a word participle, that is, 18688034559 after the participle for [18688034559], after removing the repetition to get [0meme 1pyrus 3pyrus 4pyrus 5pyrrine 5], and the query word 15858593403 after the participle is [1pje 5pr 8pr 5pr 5pr 3pr 3pr 3pr 3pje 5pje 8pj9], the mobile phone number is 11 digits, in the range of 13 words, so all the words must match, obviously the above participle results meet the requirements.

The truth is revealed here. Although it seems irrelevant, the search engine honestly executed our query and gave very reasonable results. This is also a text matching problem that is easy to be encountered in practical search applications. I hope you can understand it.

Mm-hmm. No, no, no. The mobile phone numbers in the case are all processed pseudo numbers.

It seems unreasonable, but in fact, it is a very scientific statistical query problem.

Why is it that there are two identical yyyy_id records, but not in the statistical query?

Directly turn over the query request:

You can see that it is a relatively simple terms statistical query, to the effect that the records with xxxx_id of 8 and yyyy_id > 0 are queried, the query results are aggregated (group by) according to yyyy_id, and the aggregate results are displayed in reverse order.

According to reason, if there are two records with the same yyyy_id, it will certainly be displayed in front of the statistical results, but in practice, you do not see the same records of the expected yyyy_id in the statistical results, but need to narrow the gt range of yyyy_id in the range condition (gt:0 = > gt:1224545) to see the expected results. what kind of weird posture is this?

First of all, let's learn about some "hidden rules" of terms statistics. Terms statistics take top10 (size=10) data from the index by default. Of course, you can expand the scope of topN by setting the size parameter (this is also recommended here, so that others can know the background clearly when they see the query to avoid falling into the hole).

Even so, the aggregate value of two records with the same yyyy_id should be entered into top10, so here is the second point to note. This test_10 index has 10 shard. The default top10 of terms statistics is to count the top10 of each shard in parallel in each shard, and then summarize the statistical results of all shard to get the final top10 result. Maybe everyone here is a little confused, and simply draw a picture:

Assuming that the index has three fragments, do a top5 statistic in order to find records with the same value (most record values are unique). If the records with the same value happen to be scattered in different shard and sorted at the bottom, then it is likely that even the top5 of a single shard can not be entered, and it is natural not to see the expected L value in the final result. The reason why the expected result can be recalled by modifying the gt condition is mainly because the range of the data bucket is narrowed, so that the L value has the opportunity to enter the top5 of shard.

The above content is what Elaticsearch query problem analysis is like, have you learned the knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.