How to use inverted index to improve the efficiency of string search 07/01 Update SLTechnology News&Howtos

How to use inverted index to improve the efficiency of string search

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly explains "how to use inverted index to improve the efficiency of string search". The content of the explanation in this article is simple and clear, and it is easy to learn and understand. let's study and learn how to use inverted index to improve string search efficiency.

In Python, to determine whether a string is in another string, we can use the in keyword, for example:

A ='do you think I should buy an Apple computer or a windows computer?' If 'Apple' in a:. Print ('the word Apple is in the a string'). The word Apple is in the a string.

If you have multiple sentences and multiple keywords, you can use a for loop to do this:

Sentences = ['do you think I should buy Apple computer or windows computer?' Life is too short, I use Python', your TM just shows off all the time, no, I don't mean you, I mean everyone here is rubbish. 'I cnm you big sb'] keywords = [' garbage', 'cnm',' sb', 'TM'] for sentence in sentences: for keyword in keywords: if keyword in sentence: print (sentence: [{sentence}] contains dirty words: [{keyword}]')

The running effect is shown in the following figure:

Now if you have 100000000 sentences and 1000 keywords, you need to compare them 100 billion times to complete the query. This time is too expensive, if Python can run 5 million queries per second (actually not so fast), then 100 billion queries will take 20000 seconds, nearly 6 hours. Moreover, because the time complexity of the in keyword is O (n), if there are a large number of long sentences, the query time will be longer.

For example, we need to look for cnm in the following sentence.

If you use the conventional method, then what we do is:

Hongmeng official Strategic Cooperation to build HarmonyOS Technology Community

Cnm, do you think I should buy Apple computer or windows computer? Did you hit it? No!

Cnm, do I use Python when life is too short? No!

……

Cnm is my cnm, are you a big sb? Yes!

Cnm in the class, Good Morning! Is it? No!

The word CMN is on the Internet. Is it Network in English? No!

Cnm, I don't want to hear anyone say cnm! Is it? Yes!

So you know that cnm is in these two sentences with subscript 4 and 7 in the sentences list.

Next, let's change to a more stupid-looking approach:

To find out which sentences cnm is in, you can turn into: look for the letters C, N, and M in which sentences. Then, find the sentence with these three letters at the same time:

Hongmeng official Strategic Cooperation to build HarmonyOS Technology Community

C in 4, 7 sentences

N in 4, 6, 7 sentences.

M in 2, 4, 5, 5, 7 sentences

Therefore, {4,7} intersects {4,6,7} and {4,5,7} to get {4,7}, which means that the word cnm is likely to be in the fourth and seventh sentences.

Why is it possible? Because if you add one more sentence: today we will learn three words: Cat, Network, Morning. This sentence would also be thought to contain the word cnm, but in fact it only contains the letters C, N and M.

Next, someone will ask: when you directly query cnm, you only need to query it 8 times. Now you have to query C N M 24 times. Did you fix the bug where the query time is too short?

Before we answer this question, let's look at another question.

In Python, when I want to judge whether the letter C is in a sentence, I don't want to hear someone say cnm! How does Python work inside?

In fact, how it works can be written as follows:

Sentence ='I don't want to hear anyone say CNMM' For char in sentence: if char = = 'compressed: print (' C is in this string') break

If you want to judge whether C, N, M are all in this string, I don't want to hear anyone say cnm! The same string is traversed 3 times. Is there any way to reduce this seemingly redundant traversal operation?

If we put, I don't want to hear someone say cnm! What happens when this sentence is translated into a dictionary:

Sentence ='I don't want to hear anyone say CNMM' Sentence_dict = {char: 1 for char in sentence} for letter in 'cnm': if letter in sentence_dict: print (f' letter {letter} in the sentence!')

In this way, you only need to traverse the sentence once when generating the dictionary, reducing the number of redundant traverses for 2 times. Moreover, it is very fast to judge whether an element is in the dictionary with a time complexity of O (1).

I don't want to hear anyone say cnm! The generated dictionary is {'I': 1,'no': 1, 'think': 1, 'listen': 1,'to': 1, 'you': 1, 'people': 1, 'say': 1, 'Che: 1,' Nice: 1, 'MOS: 1,'!': 1}. So if you want to treat all the sentences in the list in this way, how do you store them? At this point, the dictionary's Key is each character, and Value can be the index of each sentence in the original list:

Sentences = ['do you think I should buy Apple computer or windows computer?' Life is too short, I use Python', your TM just shows off all the time, no, I don't mean you, I mean everyone here is rubbish. ,'I cnm, you big sb', 'fellow students, Good Morning The word "network" is called Network', in English. "I don't want to hear anyone say CNMM"] Index_dict = {} for index, line in enumerate (sentences): print (index, line) for char in line: if not char.strip (): continue if char in index_dict: index_ requests [char] .add (index) else: index_ certificates [char] = {index}

The generated dictionary is:

{4}, {4}, {4, 7}, {5}, {2, 4, 5, 7}, {4, 6, 7}, {4, 6, 7}, {1}, {1}, {4}, {2}, {0,5}, {6}, {5}, {6}, {5}, {6}, {6}, {5} 'hints: {1},' iTunes: {0,5}, 'kits: {6},' nails: {0,1,5}, 'oaths: {0,1,5,6},' rations: {5,6}, 'slots: {0},' tweets: {1,6}, 'winters: {0,6},' yearly: {1},'. : {3}, 'one': {2},'no': {3,7},'a': {4,6}, 'for': {6}, 'buy': {0}, 'people': {1,7}, 'bit': {3,5}, 'you': {0,2,3,4},'to': {2,7}, 'single': {6}, 'only': {2} 'ge': {3,5}, 'Tong': {5}, 'listen': {7}, 'where': {0},'in': {3}, 'rooms': {3}, 'days': {3}, 'big': {4}, 'days': {2}, 'learn': {5},'it': {6}, 'seat': {3},'de': {2} Think: {7},'I': {0, 1, 3, 4, 7}, 'Wen': {6}, 'Yes': {0,3}, 'late': {2}, 'You': {7}, 'Guo': {0},'se': {2}, 'sheng': {1}, 'Yong': {1}, 'electricity: {0},' of: {3,6} 'Zhi': {2}, 'short': {1}, 'Luo': {6}, 'net': {6}, 'brain': {0}, 'bitter': {1}, 'Ying': {6}, 'Ping': {0},'ci': {6}, 'Shuo': {0,3,7}, 'also': {0}, 'this': {6} 'Dao': {2},'du': {3},'!' : {5,7},',': {0,3,5,6},'?' : {0}}

The generated dictionary is so long that it looks terrible. But don't panic, after all, it's not your human flesh search. For Python, no matter how many Key there are in the dictionary, the query time is the same.

Now, we are looking for C, N, M, so the code can be written as follows:

Index_list = [] for letter in 'cnm': index_list.append (index_dict.get (letter, {})) common_index = set.intersection (* index_list) # intersects the index containing each letter to get the sentence print which contains three letters at the same time (f' these sentences contain `C`, `N`, `M`: {common_index}') for index in common_index: print

The running results are as follows:

So, for a set of keywords that need to be queried, you can also search like this:

Keywords = ['garbage', 'cnm',' sb', 'TM'] for word in keywords: index_list = [] for letter in word: index_list.append (index_dict.get (letter, {})) common_index = set.intersection (* index_list) print (f' > > these sentences also contain: {word}') for index in common_index: print

The running effect is shown in the following figure:

Thank you for reading, the above is the content of "how to use inverted index to improve the efficiency of string search". After the study of this article, I believe you have a deeper understanding of how to use inverted index to improve the efficiency of string search, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.